feat(search): open_taki v2 protocol — LLM-powered extraction with parallel indexing by flash7777 · Pull Request #3036 · opencloud-eu/opencloud

flash7777 · 2026-06-28T11:37:25Z

Summary

Drop-in support for open_taki as Apache Tika replacement. When open_taki is detected at the configured Tika URL, the search service automatically uses the v2 protocol for enhanced extraction with parallel indexing.

No config changes needed — same SEARCH_EXTRACTOR_TIKA_TIKA_URL env var. Just point it at an open_taki instance instead of Apache Tika.

What open_taki provides over Tika

LLM Vision OCR instead of Tesseract — reads scanned documents Tika can't
Parallel extraction — 8 concurrent workers (configurable), ~18x faster reindexing
Document intelligence — entities, summaries, embeddings per document
Content routing — meta always to bleve, content selectively, embeddings for vector search
Image descriptions — actual content, not just EXIF metadata
268 MB container instead of 1 GB Java — uses Collabora for office conversion

Changes

content/content.go — TakiExtraction struct in Document (method, summary, entities, embedding, routing)

content/tika.go — Auto-detect open_taki via /tika health endpoint. When detected:

Sends X-Taki-Protocol: v2 with feature headers
Parses extended response (meta, entities, summary, embedding, routing)
Falls back to classic Tika v1 if v2 request fails

search/service.go — Parallel IndexSpace when taki v2 detected:

Configurable worker pool (SEARCH_EXTRACTOR_TIKA_MAX_WORKERS, default 8)
Content routing: excludes content from bleve when routing.content_target=vector
Sequential fallback for classic Tika

config/content.go + defaults/defaultconfig.go — MaxWorkers config field

Backward compatible

Without open_taki: behaves exactly as before (classic Tika path unchanged)
With open_taki but without v2 support: graceful fallback to v1
No new env vars required — MaxWorkers optional (default 8)

Tested

Deployed and tested on cloud.brandis.eu with OpenCloud kosmos edition:

135 files indexed, 93% success rate (failures: binary files + videos without ffmpeg)
43 files/minute with 8 workers (vs 2.4/min sequential = 18x speedup)
Scan PDFs: full text + 20 entities + summary where Tika returned empty

Auto-detect open_taki as drop-in Tika replacement. When detected, uses v2 protocol for LLM-powered extraction with routing: - Meta (title, entities, summary) → always bleve - Content → bleve or vector-only based on routing config - Embedding → returned for vector DB integration - Graceful fallback to classic Tika v1 if v2 fails Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When open_taki is detected, IndexSpace uses a configurable worker pool for parallel extraction instead of sequential processing. This leverages the LLM backend's batch capacity (vLLM max-num-seqs=16). - 8 workers by default (configurable via SEARCH_EXTRACTOR_TIKA_MAX_WORKERS) - Only active with open_taki v2 (classic Tika stays sequential) - Workers use direct upsert (no batch, thread-safe) - IsTaki() exported on Tika extractor for runtime detection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codacy-production · 2026-06-28T11:39:00Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 duplication

Metric Results

Duplication 0

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

flash and others added 2 commits June 28, 2026 12:45

flash7777 mentioned this pull request Jun 28, 2026

feat(search): Qdrant vector store for semantic search #3037

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(search): open_taki v2 protocol — LLM-powered extraction with parallel indexing#3036

feat(search): open_taki v2 protocol — LLM-powered extraction with parallel indexing#3036
flash7777 wants to merge 2 commits into
opencloud-eu:mainfrom
flash7777:feat/taki-protocol-v2

flash7777 commented Jun 28, 2026

Uh oh!

codacy-production Bot commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

flash7777 commented Jun 28, 2026

Summary

What open_taki provides over Tika

Changes

Backward compatible

Tested

Uh oh!

codacy-production Bot commented Jun 28, 2026

Up to standards ✅

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant