Skip to content

feat(search): open_taki v2 protocol β€” LLM-powered extraction with parallel indexing#3036

Open
flash7777 wants to merge 2 commits into
opencloud-eu:mainfrom
flash7777:feat/taki-protocol-v2
Open

feat(search): open_taki v2 protocol β€” LLM-powered extraction with parallel indexing#3036
flash7777 wants to merge 2 commits into
opencloud-eu:mainfrom
flash7777:feat/taki-protocol-v2

Conversation

@flash7777

Copy link
Copy Markdown

Summary

Drop-in support for open_taki as Apache Tika replacement. When open_taki is detected at the configured Tika URL, the search service automatically uses the v2 protocol for enhanced extraction with parallel indexing.

No config changes needed β€” same SEARCH_EXTRACTOR_TIKA_TIKA_URL env var. Just point it at an open_taki instance instead of Apache Tika.

What open_taki provides over Tika

  • LLM Vision OCR instead of Tesseract β€” reads scanned documents Tika can't
  • Parallel extraction β€” 8 concurrent workers (configurable), ~18x faster reindexing
  • Document intelligence β€” entities, summaries, embeddings per document
  • Content routing β€” meta always to bleve, content selectively, embeddings for vector search
  • Image descriptions β€” actual content, not just EXIF metadata
  • 268 MB container instead of 1 GB Java β€” uses Collabora for office conversion

Changes

content/content.go β€” TakiExtraction struct in Document (method, summary, entities, embedding, routing)

content/tika.go β€” Auto-detect open_taki via /tika health endpoint. When detected:

  • Sends X-Taki-Protocol: v2 with feature headers
  • Parses extended response (meta, entities, summary, embedding, routing)
  • Falls back to classic Tika v1 if v2 request fails

search/service.go β€” Parallel IndexSpace when taki v2 detected:

  • Configurable worker pool (SEARCH_EXTRACTOR_TIKA_MAX_WORKERS, default 8)
  • Content routing: excludes content from bleve when routing.content_target=vector
  • Sequential fallback for classic Tika

config/content.go + defaults/defaultconfig.go β€” MaxWorkers config field

Backward compatible

  • Without open_taki: behaves exactly as before (classic Tika path unchanged)
  • With open_taki but without v2 support: graceful fallback to v1
  • No new env vars required β€” MaxWorkers optional (default 8)

Tested

Deployed and tested on cloud.brandis.eu with OpenCloud kosmos edition:

  • 135 files indexed, 93% success rate (failures: binary files + videos without ffmpeg)
  • 43 files/minute with 8 workers (vs 2.4/min sequential = 18x speedup)
  • Scan PDFs: full text + 20 entities + summary where Tika returned empty

flash and others added 2 commits June 28, 2026 12:45
Auto-detect open_taki as drop-in Tika replacement. When detected,
uses v2 protocol for LLM-powered extraction with routing:
- Meta (title, entities, summary) β†’ always bleve
- Content β†’ bleve or vector-only based on routing config
- Embedding β†’ returned for vector DB integration
- Graceful fallback to classic Tika v1 if v2 fails

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When open_taki is detected, IndexSpace uses a configurable worker pool
for parallel extraction instead of sequential processing. This leverages
the LLM backend's batch capacity (vLLM max-num-seqs=16).

- 8 workers by default (configurable via SEARCH_EXTRACTOR_TIKA_MAX_WORKERS)
- Only active with open_taki v2 (classic Tika stays sequential)
- Workers use direct upsert (no batch, thread-safe)
- IsTaki() exported on Tika extractor for runtime detection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codacy-production

Copy link
Copy Markdown

Up to standards βœ…

🟒 Issues 0 issues

Results:
0 new issues

View in Codacy

🟒 Metrics 0 duplication

Metric Results
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant