Skip to content

feat(semantic): [merge candidate build] FTS5 index and search tools, provider-aware typed embeddings, reranking, diagnostics, and eval harness#87

Open
Zireael wants to merge 208 commits into
cortexkit:mainfrom
Zireael:semantic-search-enhancement
Open

Conversation

@Zireael

@Zireael Zireael commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR hardens AFT’s semantic retrieval layer and adds the surrounding machinery needed to configure, inspect, rerank, and benchmark it.

The main changes are:

  • provider-aware semantic indexing;
  • typed vector storage with correct metric selection;
  • fingerprint-based index invalidation;
  • optional local model2vec embeddings;
  • optional reranking through chat-completion or /v1/rerank endpoints;
  • semantic diagnostics, doctor output, and JSONL eval support;
  • Semble-inspired benchmark scripts for local regression testing;
  • opt-in FTS5 search graph.

Default behavior should remain conservative. Users can continue using the existing semantic path, while advanced users can opt into local model2vec, OpenAI-compatible embedding servers, rerankers, and FTS5 benchmark paths.

Why this PR exists

AFT’s previous semantic search path was too thin for serious agent use. It could embed and rank, but it lacked enough structure around provider capabilities, index invalidation, vector format handling, diagnostics, and repeatable evaluation.

This PR addresses those missing pieces. The goal is to make semantic retrieval easier to configure, safer to change, easier to debug, and measurable enough to improve without guessing.

Key user-facing improvements

Provider-aware semantic configuration

Embedding backends now carry explicit capability information such as:

  • supported output encoding;
  • vector kind;
  • distance metric;
  • dimension expectations;
  • batch-size limits;
  • provider/model fingerprint fields.

This prevents invalid combinations from silently producing a bad index. Switching meaningful provider settings triggers the appropriate rebuild or cache invalidation path.

Typed vector storage

Stored vectors are no longer treated as undifferentiated f32 arrays. The search path can distinguish dense vectors, decoded int8 vectors, and binary-packed vectors.

That matters because binary vectors need Hamming-style comparison, not cosine scoring. The PR adds the plumbing needed for correct metric selection instead of relying on one ranking method for every provider output shape.

Local model2vec support

The branch adds an optional local model2vec backend, intended for fast local code retrieval. The current recommended model for this workstream is Potion Code 16M:

{
  "semantic_search": true,
  "semantic": {
    "backend": "model2vec",
    "model": "minishlab/potion-code-16M",
    "model_path": "D:/AI/LLM_models/potion-code-16M"
  }
}

This backend is local-only and feature-gated with semantic-model2vec.

Optional reranking

Search results can be reranked after initial retrieval.

Supported reranker API modes:

  • chat: OpenAI-compatible /v1/chat/completions;
  • rerank: cross-encoder-style /v1/rerank.

Example cross-encoder configuration:

{
  "semantic_search": true,
  "semantic": {
    "backend": "openai_compatible",
    "model": "CodeRankEmbed",
    "base_url": "http://127.0.0.1:8090/v1",
    "api_key_env": "OPENAI_API_KEY",

    "rerank_enabled": true,
    "rerank_api_type": "rerank",
    "rerank_model": "GTE-Reranker-Modernbert",
    "rerank_base_url": "http://127.0.0.1:8090/v1",
    "rerank_max_candidates": 30,
    "rerank_max_candidate_chars_cross_encoder": 512
  }
}

Reranking is fail-safe. If the reranker fails, returns malformed output, or exceeds safety limits, search falls back to the original retrieval order.

Diagnostics and doctor output

The PR adds semantic health-reporting surfaces for:

  • active config;
  • index state;
  • provider/model state;
  • model2vec health;
  • warning conditions;
  • suggested fixes.

This is meant to make retrieval failures debuggable. Examples: wrong model path, unavailable provider, stale index, failed reranker, or feature-gate mismatch.

Eval and benchmark support

The semantic eval harness accepts JSONL cases with expected paths and symbols, then reports retrieval metrics such as recall@k and MRR.

Example shape:

{"query":"where is JWT validation handled","expected_paths":["src/auth/session.ts"],"expected_symbols":["validateJwt"],"top_k":10}

The benchmark scripts are intended for regression testing and local backend comparison. They should not be treated as publishable model-quality claims yet.

How to try it

Minimal semantic search

{
  "semantic_search": true
}

Then use the normal AFT search path:

aft_search({ "query": "authentication middleware" })

OpenAI-compatible embedding backend

{
  "semantic_search": true,
  "semantic": {
    "backend": "openai_compatible",
    "model": "CodeRankEmbed",
    "base_url": "http://127.0.0.1:8090/v1",
    "api_key_env": "OPENAI_API_KEY",
    "max_batch_size": 16
  }
}

Local model2vec backend

{
  "semantic_search": true,
  "semantic": {
    "backend": "model2vec",
    "model": "minishlab/potion-code-16M",
    "model_path": "/path/to/potion-code-16M"
  }
}

Reranking with /v1/rerank

{
  "semantic_search": true,
  "semantic": {
    "backend": "openai_compatible",
    "model": "CodeRankEmbed",
    "base_url": "http://127.0.0.1:8090/v1",

    "rerank_enabled": true,
    "rerank_api_type": "rerank",
    "rerank_model": "GTE-Reranker-Modernbert",
    "rerank_base_url": "http://127.0.0.1:8090/v1",
    "rerank_max_candidates": 30
  }
}

Benchmark quick start

Build with the semantic features you want to test:

cargo build --release -p agent-file-tools --features semantic-model2vec,semantic-fts5

Clone the small pinned pilot corpus:

bun run benchmarks/semble/corpus.ts sync --pilot

Run the pilot comparison:

bun run benchmarks/semble/pilot.ts --binary ./target/release/aft --k 10

The current pilot corpus covers:

  • axum;
  • serde;
  • express;
  • pydantic;
  • gin.

For embedding and reranker server comparisons:

AFT_BENCH_EMBED_BASE_URL=http://127.0.0.1:8090/v1 \
AFT_BENCH_EMBED_MODEL=CodeRankEmbed \
AFT_BENCH_RERANK_BASE_URL=http://127.0.0.1:8090/v1 \
AFT_BENCH_RERANK_MODEL=GTE-Reranker-Modernbert \
bun run benchmarks/semble/run-semble-bench.ts \
  --profile c \
  --k 10 \
  --rerank \
  --rerank-candidates 30 \
  --binary ./target/release/aft

Recommended benchmark interpretation:

  • use the pilot for fast regression checks;
  • use rerank candidate sweeps to detect candidate starvation;
  • compare semantic-only, hybrid, FTS5, and reranked variants separately;
  • do not use the pilot as a final claim about model quality.

Review guide

This PR is easier to review by subsystem than by file order.

1. Config and trust boundary

Review:

  • crates/aft/src/config.rs
  • packages/opencode-plugin/src/config.ts
  • packages/pi-plugin/src/config.ts

Check:

  • Rust and TypeScript schemas match;
  • user-only provider fields are not accepted from untrusted project config;
  • prompt template fields are handled consistently;
  • defaults match docs and tests;
  • feature-gated settings fail clearly when the feature is unavailable.

2. Semantic index and vector storage

Review:

  • crates/aft/src/semantic_index.rs
  • crates/aft/src/vector_store.rs
  • crates/aft/src/local_embed.rs
  • crates/aft/src/commands/semantic_search.rs

Check:

  • vector kind selects the correct metric;
  • fingerprint changes trigger correct rebuild/cache behavior;
  • stale vectors are pruned;
  • changed/deleted/excluded files do not leave searchable old entries;
  • fallback paths are explicit and observable.

3. Reranking

Review:

  • crates/aft/src/semantic_rerank.rs
  • crates/aft/src/commands/semantic_search.rs

Check:

  • chat and /v1/rerank request shapes;
  • sorting behavior for provider response formats;
  • duplicate and out-of-bounds reranker indices;
  • fallback behavior;
  • response-size caps;
  • candidate count defaults;
  • cross-encoder snippet limits.

4. Diagnostics, doctor, and eval

Review:

  • crates/aft/src/semantic_diagnostics.rs
  • crates/aft/src/semantic_doctor.rs
  • crates/aft/src/semantic_eval.rs
  • crates/aft/src/commands/semantic_doctor.rs
  • crates/aft/src/commands/semantic_eval.rs

Check:

  • diagnostics expose useful failures without leaking secrets;
  • minimal diagnostic mode does not hide important reranker/provider failures;
  • eval metrics match documented behavior;
  • JSONL parsing is strict enough to avoid misleading benchmark output.

5. Benchmark scripts

Review:

  • benchmarks/SEMBLE-BENCHMARKS.md
  • benchmarks/semble/run-semble-bench.ts
  • benchmarks/semble/pilot.ts
  • benchmarks/semble/repos-pilot.json
  • benchmarks/semble/annotations/*

Check:

  • failed backends are reported instead of silently skipped;
  • benchmark profiles are reproducible;
  • local endpoint assumptions are documented;
  • FTS5 results are labeled experimental unless the FTS5 e2e path is complete;
  • pilot results are not overstated.

Suggested validation before merge

cargo fmt --check
cargo clippy -p agent-file-tools --all-targets --features semantic-model2vec,semantic-fts5 -- -D warnings
cargo test -p agent-file-tools --features semantic-model2vec,semantic-fts5
bun install
bun run typecheck
bun test

Optional benchmark smoke:

cargo build --release -p agent-file-tools --features semantic-model2vec,semantic-fts5
bun run benchmarks/semble/corpus.ts sync --pilot
bun run benchmarks/semble/pilot.ts --binary ./target/release/aft --k 10

Optional local reranker comparison:

AFT_BENCH_EMBED_BASE_URL=http://127.0.0.1:8090/v1 \
AFT_BENCH_EMBED_MODEL=CodeRankEmbed \
AFT_BENCH_RERANK_BASE_URL=http://127.0.0.1:8090/v1 \
AFT_BENCH_RERANK_MODEL=GTE-Reranker-Modernbert \
bun run benchmarks/semble/run-semble-bench.ts \
  --profile c \
  --k 10 \
  --rerank \
  --rerank-candidates 30 \
  --binary ./target/release/aft

Known limits

The following should remain explicit:

  • FTS5 is opt-in and should not be presented as the default search backend.
  • Benchmark tooling is useful for local comparison and regression detection, not final external claims.
  • The five-repo pilot is intentionally small.
  • Full Semble, RepoQA, SWE-Explore, and CORE-Bench adapters are follow-up work.
  • Reranking quality depends heavily on candidate pool size. A weak rerank delta may indicate candidate starvation rather than a bad reranker.
  • Semantic search should complement exact symbol and lexical search, not replace them.

Feature traceability

Relevant workstreams:

  • aft-t6p — semantic search upgrade;
  • aft-t6p.m2v — model2vec / Potion Code 16M backend;
  • aft-t6p.tok — tokenizer fixture compatibility;
  • aft-t6p.bench — benchmark adaptation;
  • aft-t6p.bench.quick — targeted quick benchmark mode;
  • aft-fts5e2e — FTS5 end-to-end feature work.

Merge posture

I would merge this when the semantic retrieval foundation is safe, tested, and diagnosable.

I would not block the PR on full benchmark coverage or publishable external comparisons. Those are useful follow-up work, but they should not be required for merging the retrieval substrate.

I would block the PR if any of these are true:

  • untrusted project config can redirect embedding or reranker endpoints;
  • switching vector format/provider/model can reuse an incompatible index;
  • reranker failures are silent in normal diagnostics;
  • benchmark scripts skip failed backends without surfacing that failure;
  • FTS5 is documented as complete before its e2e path is actually validated.

FTS5 index will allow introducing more advanced symbol operations.


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.


Summary by cubic

Production‑ready semantic retrieval with provider‑aware typed embeddings, safe reranking, contextualized doc‑chunk indexing, full diagnostics/doctor/eval, and an optional FTS5 path. CI/release build with semantic-model2vec and semantic-fts5 by default, plus a cross‑platform build-aft workflow and Docker‑based Rust validation.

  • New Features

    • Reranking: overfetch → rerank → truncate; chat and /v1/rerank; 2 MiB body cap; cross‑encoder support; fence stripping; overflow chunking (max_embed_tokens, chunk_overlap_tokens).
    • Contextualized embeddings: Perplexity document‑chunk mode with split/retry diagnostics; model prompt profiles (use_model_profiles), document_prompt_template, deterministic ordering.
    • Typed vectors/storage: correct metrics per kind (incl. Hamming for BinaryPacked), file manifest + chunk_hash, fingerprint invalidation, include/exclude file policy, per‑file caps.
    • Backends/tooling: local model2vec with HF auto‑download, catalog/health/version checks; OpenAI‑compatible embeddings; experimental FTS5 search/symbol commands; Semble benchmark suite (pilot, runners, reports); eval harness runs real searches with per‑case k.
    • Build/CI: default‑feature release builds (semantic-model2vec, semantic-fts5), manual multi‑platform build-aft workflow, and Docker‑based Rust validation for fmt/check/clippy/test.
  • Bug Fixes

    • Trust/safety: strip rerank_prompt_template from untrusted configs; SSRF guard for rerank_base_url; traversal‑safe jsonl_path.
    • Rerank correctness/UX: score‑sorted and de‑duplicated outputs; out‑of‑bounds index warnings; fix candidate starvation; more_available signals correctly after truncation.
    • Stability/diagnostics: clamp cosine NaNs; bounded streaming reads; warning dedup; surface RerankerFailure in minimal mode; accurate query cache‑hit.
    • Indexing/eval: hardened contextualized split/retry mapping; fixed chunk line bounds; configure parses all semantic/JSONL fields; eval harness wired to real search with per‑case k.

Written for commit 92744a1. Summary will update on new commits.

Review in cubic

Greptile Summary

This alpha PR replaces the prototype semantic search subsystem with a production-oriented retrieval pipeline: typed vectors (DenseF32, Int8SourceDecoded, BinaryPacked), provider capability profiles, fingerprint-driven index invalidation, background build with cooperative cancellation, optional reranking, diagnostics/doctor/eval tooling, and a security trust boundary that prevents project configs from redirecting embeddings to attacker-controlled endpoints.

  • Core pipeline overhaul (semantic_index.rs, vector_store.rs): typed vectors with correct distance metrics (Hamming for binary, cosine for f32), stale-vector pruning, priority-ordered cold-start build, and fingerprint-based rebuild decisions.
  • Reranking pipeline (semantic_rerank.rs): supports chat-completions and cross-encoder /v1/rerank endpoints with multiple response-format parsers, body-size cap, and safe fallback to original cosine order on any failure.
  • Diagnostics, doctor, and eval (semantic_diagnostics.rs, semantic_doctor.rs, semantic_eval.rs): JSONL diagnostic logging, health-check command, and JSONL-based eval harness with recall@k and MRR scoring.

Confidence Score: 4/5

Safe to merge with awareness that several known issues from the previous review cycle remain open (reranker failures silent in minimal mode, more_available undercount, data-format rerank ordering, non-ASCII eval file corruption), plus the new gap where project configs can supply an adversarial rerank prompt template.

The new rerank_prompt_template stripping gap is the only genuinely new finding in this update; all other open issues were already identified in earlier review rounds. The change is an alpha build with documented limitations and strong test coverage (~93 tests). The trust boundary is mostly correct, the TypeScript enum mismatches are now fixed, and the vector-store and Hamming-distance math are sound.

packages/opencode-plugin/src/config.ts — add rerank_prompt_template to the stripProjectSemanticFields function. crates/aft/src/semantic_diagnostics.rs — surface RerankerFailure in minimal output mode.

Security Review

  • Prompt injection via rerank_prompt_template in project config (packages/opencode-plugin/src/config.ts:1481): query_prompt_template and document_prompt_template are correctly stripped from project-level configs; rerank_prompt_template is not. A hostile repository can supply an adversarial reranker prompt that manipulates search result ordering for the user. Data exfiltration is not possible (the reranker endpoint URL is user-only), but result integrity can be silently degraded.
  • No new credential-leakage, injection, or authentication bypass issues introduced by this PR. The trust-boundary stripping of backend, base_url, api_key_env, rerank_base_url, and rerank_api_key_env from project configs is correctly implemented.

Important Files Changed

Filename Overview
crates/aft/src/semantic_rerank.rs New reranking pipeline with body-size cap, fence stripping, and multiple response-format parsers. The build_rerank_endpoint trailing-slash fix is present. The data/results formats in extract_indices_from_rerank_results return insertion order rather than score-sorted order (flagged in previous review cycle).
crates/aft/src/semantic_eval.rs New eval harness with JSONL parsing, recall@k, and MRR scoring. strip_trailing_commas is implemented but corrupts non-ASCII bytes (previously flagged). The score_case docstring says hits beyond k don't affect first_hit_rank, but the code does set it, so MRR includes full-list ranks, not @k ranks.
crates/aft/src/semantic_diagnostics.rs New diagnostic subsystem. format_warning_minimal returns None for RerankerFailure, silently suppressing reranker failures in the default Minimal output mode (flagged in previous review cycle).
crates/aft/src/commands/semantic_search.rs Reranking, diagnostics, and fusion-limit integration. more_available is computed against fusion_limit rather than top_k, so candidates between top_k and fusion_limit are silently dropped after reranking without setting more_available = true (flagged in previous cycle). OOB reranker-index warning still gated on diagnostics_enabled.
packages/opencode-plugin/src/config.ts TypeScript Zod schema updated with 16+ new fields including corrected enum values (base64_binary, binary_packed, dot_product, perplexity). Trust-boundary stripping function added. rerank_prompt_template is exposed in the schema but not stripped from project configs, unlike query_prompt_template and document_prompt_template.
crates/aft/src/vector_store.rs New VectorStore trait with FlatF32 (cosine) and FlatBinaryHamming (Hamming distance) implementations. Score normalization, orphan pruning, and stale-vector pruning look correct. Hamming similarity computation is sound.
crates/aft/src/config.rs Provider capability profiles, RerankApiType enum, trust-boundary doc comments, and expanded SemanticBackendConfig. RerankApiType uses snake_case serde, matching the TypeScript ["chat", "rerank"] values.
crates/aft/src/compress/trust.rs Added security-focused tests: atomic writes, multi-project trust, idempotent untrust, and reload survival. No logic changes.

Comments Outside Diff (2)

  1. packages/opencode-plugin/src/config.ts, line 37-54 (link)

    P1 TypeScript enum values don't match the Rust serde strings — config will fail to deserialize

    Several new enum schemas use values that don't align with the Rust serde representation:

    • SemanticOutputEncodingEnum allows "binary", "ubinary", "int8", "uint8" but Rust OutputEncoding deserializes from "base64_binary" and "base64_int8".
    • SemanticStorageStrategyEnum allows "flat" and "binary_pack" but Rust StorageStrategy expects "native_f32" and "binary_packed".
    • SemanticInputModeEnum includes "chunk_extracts" and "contextualized" but Rust InputMode only has "flat_texts" and "document_chunks".
    • SemanticDistanceMetricEnum uses "dot" but Rust DistanceMetric expects "dot_product".
    • SemanticBackendEnum is missing the new "perplexity" variant added to Rust.

    A user who follows the TypeScript autocomplete and picks output_encoding: "int8" will pass TypeScript validation but receive a deserialization error (or silent fallback to default) from the Rust binary at runtime.

  2. crates/aft/src/commands/semantic_search.rs, line 116-119 (link)

    P1 more_available understates available results when reranking is active

    fused_more_available is now computed as results.len() > fusion_limit (e.g., > 20) rather than > top_k (e.g., > 10). After reranking, results.truncate(top_k) discards any candidates between positions top_k and fusion_limit, but more_available has already been set and stays false. Concretely: if the fused pool yields 15 candidates (top_k=10, rerank_max_candidates=20), fused_more_available = 15 > 20 = false, more_available = false, and the 5 reranked-but-discarded candidates are silently dropped with no "more results" hint surfaced to the agent.

    Capture the pool size before truncation and fold it into more_available after the rerank block, before results.truncate(top_k).

Reviews (29): Last reviewed commit: "feat(semantic): add model prompt profile..." | Re-trigger Greptile

Zireael and others added 30 commits May 24, 2026 11:10
Add scripts, docs, Dockerfile, and package.json scripts for Docker-based
Rust validation (fmt/check/clippy/test) so Windows users without MSVC
Build Tools can still validate Rust code.

- scripts/docker-rust.ps1: PowerShell script supporting fmt/check/clippy/
  test/validate/shell tasks with persistent Docker volumes
- Dockerfile.rust: minimal Rust image with rustfmt + clippy pre-installed
- docs/docker-rust-validation.md: full usage and design documentation
- package.json: 6 new docker:rust:* convenience scripts

Design: Linux-target validation via rust:1-bookworm, persistent cargo
volumes for caching, fail-fast sequential validation.
- SemanticFilePolicy config struct with include_code/include_docs/
  include_configs/binary_detection/generated_file_detection/globs
- parse_semantic_files_config handler in configure.rs
- File policy evaluation: should_index_file(), is_generated_file(),
  is_config_file(), is_docs_file()
- Docs chunker: collect_docs_chunks() with heading-based splitting
  for markdown, splitting by file for other doc types
- collect_chunks routes doc files through docs chunker, skips
  binary/generated/config files per policy
- SemanticIndexFingerprint extended with file_policy_hash and
  docs_chunker_version; diff() triggers rebuild on policy change
- build_with_progress/refresh_stale_files accept &SemanticFilePolicy
- compute_file_policy_hash() deterministic hash of policy fields
- Re-export SemanticFilePolicy from semantic_index module
- All test callers updated with &SemanticFilePolicy::default()
…iority ordering, backoff

- CancellationToken (Arc<AtomicU64> generation counter) for cooperative build cancellation on reconfigure
- Cancel old semantic index builds instead of detaching when config changes
- Priority file ordering: README/docs first, then core source, then tests, then rest
- Embedding backoff: exponential retry with jitter for remote provider rate limits
- SemanticIndexStatus::Partial variant with completeness percentage for partial builds
- Search reports partial index state during cold start
- Phase-boundary cancellation checks between model init, disk read, incremental refresh, and full rebuild
Add Perplexity backend with InputMode::DocumentChunks support for
contextualized embedding where chunks carry document-level context.

- SemanticBackend::Perplexity variant with config, profile, engine
- DocumentChunks/PerDocumentChunks/DocumentEmbeddings structs
- embed_document_chunks() routes Perplexity to grouped embedding API
- build_with_progress_contextualized() groups chunks by document
- Wire configure.rs to branch on input_mode: DocumentChunks
- SemanticEmbeddingModel::input_mode() public accessor
- EmbeddingModelProfile with contextualized_supported guard
- Response validation: index continuity, missing documents, dimension
…to trait-backed module

Bead: aft-t6p.12

Extracts Vec<EmbeddingEntry> storage and search from SemanticIndexSnapshot
into a VectorStore trait with FlatF32VectorStore implementation. This
decouples the storage layer from the lifecycle logic and prepares for
alternative backends (binary Hamming, approximate ANN).

Key changes:
- vector_store.rs: VectorStore trait + ScoredChunk/PruneStats types
- FlatF32VectorStore: flat scan with cosine similarity (preserves existing
  behaviour exactly)
- FlatBinaryHammingVectorStore: forward-looking Hamming-search impl
- SemanticIndexSnapshot delegates search/len/prune/entries to store
- Fixed dimension-sync bug where set_dimension updated the snapshot
  dimension but not the store dimension, causing search to return 0
- EmbeddingEntry and IndexedFileMetadata made pub for trait compatibility
On Windows, use copyFileSync for the binary replacement (which overwrites
the target — renameSync fails with EEXIST). If it fails, the original
binary at binaryPath is preserved.

The temp file cleanup is now wrapped in its own try/catch so a cleanup
failure does NOT propagate as a download failure — the binary was already
successfully placed at binaryPath.

Addresses PR cortexkit#69 cubic review finding P2.
Implement bead aft-t6p.24: file identity manifest + vector ownership records.

Changes:
- **FileRecord struct**: identity record with content_hash, size_bytes, mtime,
  language, document_kind, inclusion_policy_hash, indexed_at
- **file_manifest on SemanticIndexSnapshot**: HashMap<PathBuf, FileRecord>
  tracking which files produced which vectors, enabling precise stale-vector
  pruning when files are edited, deleted, or excluded
- **V8 serialization format**: extends V7 with per-entry chunk_hash (after
  each vector) and file manifest block (after all entry vectors). Full
  backward compatibility with V1-V7 reads.
- **chunk_hash on EmbeddingEntry**: deterministic hash of chunk content fields
  for tracing which version of a chunk produced a stored vector
- **compute_chunk_hash**: blake3-based deterministic hash
- **build_manifest_from_store helper**: populates file_manifest from store's
  file_metadata, called in all builder functions (build_from_chunks,
  build_with_progress_contextualized, refresh_stale_files) and from_bytes
  for V1-V7 cache migration
- **next_chunk_id, fingerprint_string**: forward-looking fields on snapshot
  for future unique ID assignment and fingerprint tracking
…rmalization, and model profiles

Adds aft-t6p.20 (Typed embedding vector representation +
storage-strategy resolution):

- TypedVector (source-side) and StoredVector (persisted) enums
  with DenseF32, DenseInt8, BinaryPacked, and Quantized variants
- StorageStrategy (NativeF32, DecodeNormalizeF32, BinaryPacked)
- VectorKind enum for runtime type tagging
- DistanceMetric (Cosine, DotProduct, Euclidean, Hamming)
- NormalizationPolicy (AlreadyNormalized, NormalizeOnInsertQuery,
  NotApplicable)
- EmbeddingModelProfile fields: source_vector_kind, stored_vector_kind,
  metric, normalization
- convert_vector() / validate_compatible() on EmbeddingModelProfile
- blake3 dependency for chunk hashing
… + dummy base_url for Perplexity profile test

Two fixes for `fingerprint_invalidation_tests`:
- Mock HTTP server now lowercases header names before matching
  Content-Length (reqwest/hyper sends lowercase `content-length:`).
- `base64_int8_profile_from_config_selects_correctly` test provides a
  dummy `base_url` for the Perplexity backend (required by `from_config`).

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
- Add StorageStrategy::BinaryPacked variant for packed-bit vector storage
- Add EmbeddingModelProfile::perplexity_binary() with BinaryPacked → Hamming path
- Wire from_config to select perplexity_binary profile when Base64Binary encoding
- Implement parse_embedding_value for Base64Binary (decode → 0.0/1.0 f32 vec)
- Implement into_stored for TypedVector::BinaryPacked (requires BinaryPacked strategy)
- Update validate_config and validate_compatible to accept Base64Binary+BinaryPacked
- Replace old "not yet supported" test with parse_embedding_value_base64_binary_succeeds
- 886/893 tests pass (7 pre-existing Docker failures)

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Add semantic_diagnostics module with SearchDiagnostics, SearchPipelineType,
SearchWarning, SearchMetricsCollector, PhaseTimer, score_statistics,
top1_margin. Instrument handle_semantic_search with per-phase timing
and warning collection. Wire SearchMetricsCollector into AppContext.
17 new tests, 902/910 lib tests pass (8 pre-existing Docker failures).

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
- Add SemanticDiagnosticsLogger with file append, rotation (50 MB), and
  retention cleanup (file-deletion based on mtime)
- Add SearchDiagnosticsEvent struct for JSONL serialization with
  raw_query redaction (opt-in via include_raw_queries) and snippet
  placeholder (include_snippets)
- Add config fields: jsonl_logging, jsonl_path, include_raw_queries,
  include_snippets, retention_days to SemanticBackendConfig
- Add lazy-init diagnostics_logger on AppContext with
  resolve_diagnostics_log_path helper (env var → project root → ~/.cache)
- Wire JSONL record into handle_semantic_search diagnostics block
- 4 new tests: raw query redaction, raw query inclusion, disk write
  verification, missing-file recovery
- 907/914 lib tests pass (7 pre-existing Docker failures)

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
…rch output

Add DiagnosticsOutputMode enum (Off/Minimal/Verbose) and output_mode field
to SemanticBackendConfig. Implement format_diagnostics_prefix() for
Minimal (warnings only) and Verbose (scores + latency + warnings)
output modes. Wire into handle_semantic_search response text.
4 new tests, 25 diagnostics tests total. 910/918 lib tests pass
(8 pre-existing Docker failures).

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Add optional reranking via OpenAI-compatible chat endpoint. When
enabled, aft_search overfetches candidates, sends them to a reranker
model, and re-sorts by relevance. Falls back gracefully on any error.

- Add RerankConfig fields to SemanticBackendConfig (rerank_enabled,
  rerank_model, rerank_base_url, rerank_api_key_env, rerank_timeout_ms,
  rerank_max_candidates)
- Create semantic_rerank.rs with RerankerClient, RerankOutcome enum,
  and rerank_candidates function
- Add RerankerFailure warning variant to SearchWarning
- Wire reranking into handle_semantic_search (overfetch → rerank → re-sort)
- Add rerank_latency_ms to SearchDiagnostics and SearchDiagnosticsEvent
- Include rerank latency in verbose diagnostics output
- 6 unit tests for reranker parsing, skip conditions, and failure handling

All 25 diagnostics + 6 reranker tests pass. 917/924 total tests pass
(7 pre-existing Docker infrastructure failures).
Add 40+ unit tests to fingerprint_invalidation_tests covering:
- SemanticBackendConfig deserialization (minimal, all-fields, defaults)
- EmbeddingModelProfile validation for all encoding types
- TypedVector conversion and StoredVector roundtrip
- convert_vector and validate_compatible rejection paths
- Distance metric auto-resolution for f32/int8/binary
- base64_int8 signed int8 decode correctness
- Template hashing, enum roundtrips, resolve helpers

Minor: add #[derive(Debug)] to StoredVector for test ergonomics.

Closes aft-t6p.6.1
Add 6 new tests to fingerprint_invalidation_tests covering:
- file_policy_hash mismatch triggers rebuild
- docs_chunker_version mismatch triggers rebuild
- multi-field changes still trigger rebuild
- rebuild+query_prompt: rebuild wins
- only query_prompt change: ClearQueryCache
- non-fingerprint field changes: NoChange

Total: 22 fingerprint tests. Closes aft-t6p.6.2
Add 29 tests covering:
- is_generated_file: protobuf, minified, dist, build, generated, dart
- is_doc_extension and is_config_extension validation
- classify_semantic_file for code/doc/config
- collect_docs_chunks markdown heading splitting
- SemanticFilePolicy defaults and builtin globs
- FileRecord field population
- build_manifest_from_store construction and cleanup

Closes aft-t6p.6.3
… tests

Add 23 tests covering:
- FlatF32VectorStore: search, empty, dimension mismatch, CRUD, prune, stats
- FlatBinaryHammingVectorStore: search, ranking, prune, delete, stats
- hamming_distance and popcount64 correctness
- Binary decode: byte-aligned, non-byte-aligned, padding, error

Closes aft-t6p.6.4
Add 8 tests covering:
- SemanticIndexLifecycle: cold start, set/get, failed+error, all variants
- SemanticIndexSnapshot: search ranking, immutability after clone
- VectorStore: prune_stale_vectors, prune_orphans

Closes aft-t6p.6.5
Add 10 tests covering:
- HybridRerank pipeline type display
- Metrics collector: window size 1, cache hit rate, zero result rate,
  low confidence rate, latency percentiles
- Diagnostics output mode defaults
- Warning formatting: minimal (all variants, verifies suppressed),
  verbose (all 9 variants)
- SearchWarning serde roundtrip for all 8 variants

Closes aft-t6p.6.6
Add 4 tests covering:
- Concurrent snapshot clones produce independent results
- Concurrent read threads see identical data via Arc
- Mutex contention across 10 threads does not deadlock
- Arc strong_count tracks clone/drop correctly

Closes aft-t6p.6.7
Add 6 tests covering:
- Trust file atomic write (no tmp files left behind)
- Multiple projects trusted independently
- Untrust is idempotent
- Trust state survives reload (serde roundtrip)
- Nonexistent project path is untrusted (fail-closed)

Closes aft-t6p.6.8
The validate_compatible_rejects_binary_stored_with_cosine_metric test
was missing source_vector_kind: BinaryPacked, causing the first match
block to fail with 'unsupported source→stored vector conversion' instead
of reaching the metric compatibility check.
Zireael added 30 commits June 16, 2026 22:35
…, add lexical query execution

1. Rename rerankResults() function to applyRerank() to avoid collision
   with rerankResults variable (metrics object). This caused
   "rerankResults is not a function" TypeError when --rerank was used.

2. Auto-clone missing repos: when a repo is not found in .bench-cache,
   automatically clone it via git clone --depth 1. Handles both semantic
   repos (from fixtures.json) and lexical repos (opencode-aft, reth).

3. Execute lexical identifier queries: the LEXICAL_QUERIES array from
   search-bench-v2.py was defined but never actually processed. Now runs
   rg, fts5, aft-grep, and all semantic backends for each identifier
   query against its target repos.
- Add "verify" dispatch arm in main.rs to route to handle_verify
- Create aft_verify tool for OpenCode plugin (verify.ts)
- Create aft_verify tool for Pi plugin (verify.ts)
- Register tools in both plugin index files

This wires the existing verify handler (which suggests verification
actions for changed files) into the production NDJSON dispatch path,
making it available to agents through both plugin surfaces.
- Add "semantic_doctor" dispatch arm in main.rs
- Create aft_semantic_doctor tool for OpenCode plugin
- Create aft_semantic_doctor tool for Pi plugin
- Register tools in both plugin index files

This wires the existing semantic_doctor handler (which produces semantic
search health reports) into the production NDJSON dispatch path.
- Add context parameter to all execute functions
- Return Promise<string> instead of Promise<Record<string, unknown>>
- Add error handling for failed responses
- Return text or JSON stringified response

This fixes TypeScript compilation errors that prevented the OpenCode
plugin from building.
- Add "semantic_eval" dispatch arm in main.rs
- Create aft_semantic_eval tool for OpenCode plugin
- Create aft_semantic_eval tool for Pi plugin
- Register tools in both plugin index files

This wires the existing semantic_eval handler (which runs JSONL eval
suites against semantic search) into the production NDJSON dispatch path.
- Add METHODOLOGY.md with binding decisions:
  - No runtime ripgrep oracle for relevance truth
  - Suite-separated aggregation (semantic_nl, identifier_exact/prefix, path_lookup, structural)
  - Strict attempt-row denominators (empty/error/unavailable score zero)
  - Latency decomposition (configure, index, model_load, warm_query, rerank, e2e)
  - Hybrid/rerank paired metrics
  - Mode eligibility per canon query
- Update README.md to reference methodology document
…6p.bench.quick.01)

- Copy canon files from aft-lexical-canon-package to benchmarks/semble/canon/
  - identifier-exact.json (31 queries, all seed)
  - identifier-prefix.json (14 queries, all seed)
  - path-lookup.json (29 queries, all seed)
  - structural.json (10 queries, all seed)
  - unverified-seeds.json (8 queries)
  - repos.json, mode-matrix.json, lexical-canon.schema.json
- Add benchmarks/semble/tools/validate-lexical-canon.ts validator
- Add benchmarks/semble/bench-profiles.ts with smoke/quick/extended/manual-full profiles
- Add benchmarks/semble/canon-loader.ts for suite loading and profile filtering
- Validator passes: 92 canon query rows across 5 files
…ch.quick.02)

- Add benchmarks/semble/bench-cli.ts with:
  - Profile/suite/mode CLI parsing (smoke/quick/extended/manual-full)
  - Legacy flag normalization (--output, --backend both, etc.)
  - Preflight checks for binary, repos, mode availability, FTS5, semantic
  - Preflight table output
  - Strict mode (exit non-zero on unavailable) and degraded mode (emit rows)
- Integrate preflight into pilot.ts main function
- CLI validates unknown profiles/suites/modes with helpful errors
- --backend both,semantic-api correctly parses to 3 backends
- --help shows full flag reference
- Add benchmarks/semble/bench-modes.ts with adapters for:
  - rg (ripgrep baseline)
  - aft-grep (trigram-indexed)
  - fts5_search (FTS5 full-text)
  - fts5_find_symbol_exact (exact symbol lookup)
  - fts5_find_symbol_prefix (prefix symbol lookup)
  - glob (path pattern matching)
  - ast_search (structural AST pattern)
  - semantic_m2v/fe/api (dense embedding search)
  - rrfFusion (hybrid RRF combination)
- Session initializers: initGrepSession, initFts5Session, initSemanticSession
- Each adapter returns ModeAttempt with status, results, latency
- Mode adapters use consistent input/output behavior
….bench.quick.04)

- Add benchmarks/semble/bench-metrics.ts with:
  - QueryAttempt type (schema_version, suite, mode, query_id, status, latency_parts, results, recall/mrr/ndcg)
  - LatencyParts decomposition (configure, index_update, model_load, warmup, query, rerank, end_to_end)
  - createAttempt factory (scores 0 for empty/error/unavailable)
  - aggregateBySuiteMode (denominator = ALL attempted rows, not only non-empty)
  - computeRerankPair (paired pre/post metrics per query)
  - Robust path matching (string/object/undefined relevant entries)
- Add benchmarks/semble/bench-metrics.test.ts with 11 tests covering:
  - Denominator includes empty attempts (4 attempted, 2 ok -> avg over 4)
  - Suite-separated aggregation
  - Unavailable modes visible in aggregation
  - Rerank pairing correctness
  - Path matching for string and object relevant entries
…bench.quick.05)

- Add benchmarks/semble/bench-report.ts with:
  - BenchmarkReport type (schema_version, config, repos, preflight, suites, failures, warnings, summary)
  - writeJsonReport: full JSON report with per-suite attempts and aggregates
  - writeJsonlReport: one attempt per line for streaming
  - writeMarkdownReport: per-suite tables with status counts
  - compareBaseline: threshold-based regression detection by suite+mode
  - ThresholdConfig: recall_drop, mrr_drop, ndcg_drop, empty_rate_increase
  - printComparison: console output of regressions and improvements
…t6p.bench.quick.06)

- Add benchmarks/semble/QUICK-BENCHMARK.md with:
  - Quick start commands for smoke/quick/extended profiles
  - Profile table (queries, repetitions, seed rows)
  - Suite table (query type, primary modes)
  - Mode table (command, notes)
  - Metric model (attempt rows, denominator rule, latency decomposition, rerank pairing)
  - Canon files documentation (review status, validation)
  - Command examples (specific suite+mode, rerank, baseline comparison)
  - Limitations (small corpus, seed canon, local variability, feature-gated, decision support only)
  - Extension guide (add repo, promote seed rows)
- Replace rg.results.map(...) as allRelevant with checked-in canon relevance
- Load identifier canon from benchmarks/semble/canon/ for ground truth
- Skip queries with no canon relevance instead of using rg as oracle
- rg remains as baseline contestant, scored against canon truth
- Add benchmarks/semble/model-discovery.ts:
  - discoverModels(): queries /v1/models, probes /v1/embeddings and /v1/rerank
  - Classifies models as embedding (with dim), reranker, chat, or unknown
  - selectBestEmbeddingModel/selectBestRerankerModel helpers
  - formatDiscoveredModels() for terminal display
- Integrate discovery into pilot.ts main function:
  - Auto-discovers models from --semantic-api-url endpoint
  - Auto-detects embedding model name if not specified
  - Discovers reranker models from --rerank-url endpoint
  - Displays discovered models in terminal header instead of hardcoded names
- Tested against live localhost:8090 endpoint:
  - CodeRankEmbed (dim=768), GTE-Reranker-Modernbert (dim=768), OASIS (dim=1536)
  - CodeRankLLM.Q4_K_M (reranker), 5 chat/LLM models
- QUICK-BENCHMARK.md: expanded Modes table with AFT Tool and Rust Command columns
- pilot.ts printHeader: add mode mapping reference table to terminal output
- bench-report.ts: add MODE_MAPPING constant and mode_mapping field to JSON report schema
- bench-report.ts: add mode mapping table to Markdown report output
- Users can now see how each benchmark mode relates to the AFT tool used in OpenCode/Pi
- Fix repo name: opencode-aft → aft (https://github.com/cortexkit/aft)
  - repos.json: URL and name corrected
  - unverified-seeds.json: repo_name and IDs updated
  - pilot.ts: LEXICAL_QUERIES and LEXICAL_REPOS updated
- Add --interactive flag for model selection:
  - Discovers models from API endpoints
  - Prompts user to select embedding and reranker models
  - Shows model dimensions and types
  - Confirms selection before proceeding
  - Integrates with existing auto-detection as fallback
- Skip legacy detectModel() call when --interactive flag is set
- Interactive discovery handles model selection instead
- Updated warning message to suggest --interactive as alternative
- Reranker probe now runs before embedding probe
- GTE-Reranker-Modernbert correctly classified as reranker (was misclassified as embedding)
- Reranker endpoint is more specific than embedding endpoint
- When --semantic-api-model and --rerank-model are both specified:
  - Skip full model discovery (avoids probing/unloading other models from GPU)
  - Only verify specified models work
  - Re-probe desired models after any discovery to reload into GPU
- Add verifySpecificModels() for targeted model verification
- Add ensureModelLoaded() to re-probe and reload models
- Show m2v and fe model names in terminal header (not just hardcoded defaults)
- Updated printHeader to accept m2vModel and feModel parameters
- Hybrid: allow semantic-only when FTS5 returns empty (was requiring both)
  - Before: fts5Results.length > 0 && semResults.length > 0
  - After: semResults.length > 0 (FTS5 optional)
- Rerank: add fallback path resolution for Windows UNC paths
  - Try normalized path first, then raw path as fallback
  - Include file read time in total latency measurement
  - Add verbose warning when documents are short (path strings, not content)
- rerankerModel → rerankModel (variable name, not property)
- interactiveResult.rerankerModel stays as-is (property from model-discovery)
- When user passes --rerank-url http://localhost:8090 (no path)
  - Before: sends to http://localhost:8090 → 404
  - After: normalizes to http://localhost:8090/v1/rerank
- Also handles /rerank without /v1/ prefix
- Use snippet field from semantic search results instead of reading
  entire files (20KB → ~500 chars per document). Fixes slow reranking
  (18s → <1s) and improves quality.
- Falls back to line-range extraction when snippet unavailable.
- Auto-apply CodeRankEmbed query prefix ('Represent this query for
  searching relevant code:') when model name matches.
- Add --query-prompt flag for manual prompt template override.
- Map snippet/start_line/end_line from semantic search results.
Add approximate token counting and per-query logging of chunk sizes
sent to semantic search results and reranker documents. Shows:
- Number of chunks, average/max tokens, count over 2048
- Helps diagnose context window overflows and quality issues
The reranker was reading arbitrary line ranges when snippet was missing,
which is not a semantic unit. AFT's semantic index already chunks code
into tree-sitter symbols (functions, structs, classes, etc.) and returns
them in the snippet field. The line range fallback was incorrect.

Now the benchmark only uses:
1. snippet field from AFT results (tree-sitter-based code blocks)
2. File path as label when snippet is missing

This ensures the reranker gets actual code snippets, not arbitrary
line ranges.
Short snippets are now valid small symbols, not path strings.
Changed warning to only trigger when documents are file paths
(no snippet available), which is the actual problem.
Default oversampling increased from k*5 to k*10 to give reranker more
candidates to reorder. Use --oversample <n> to control the multiplier.

Examples:
  --oversample 5   (previous default)
  --oversample 10  (new default)
  --oversample 20  (aggressive oversampling)
- Add --rerank-instruction flag for reranker instruction prompt
- Update QUICK-BENCHMARK.md with:
  - Complete CLI reference table
  - Semantic API model discovery section
  - CodeRankEmbed prompt template docs
  - Reranker instruction docs
  - Oversampling documentation
  - Updated commands with new flag examples
- Add PLAN-snippet-enrichment.md for configurable snippet enrichment
- Fix reranker to read symbol line ranges when snippet is missing
- Log document source breakdown (snippets/line-ranges/paths)
- Add SemanticDoctor and SemanticEval to architecture layers
- Update tool group locations for new tool files
- Update analysis engine layer with new modules
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant