GitHub - mrsladoje/sweet-search: Local code search for AI agents: six fast, purpose-built tools that return ranked answers, not raw grep. Because maybe grep isn't all you need... 🍬

Local code search for AI coding agents. Six fast, purpose-built tools that hand Claude Code, Codex & friends ranked answers, not raw grep. Zero API keys, 100% on-device.

Maybe grep isn't all you need… 🍬
Every coding agent today reaches for grep + Read by reflex. sweet-search challenges the narrative. 😎

✨ Highlights

Hybrid retrieval — one of the six tools uses BM25F lexical + dense semantic + structural graph signals, fused per query and reranked by late-interaction
Agent-native by design — token-budgeted output tiers, an optional MCP server (and default zero-overhead CLI), and a GEPA-evolved system prompt installed into Claude Code, Codex, Gemini CLI, and Cursor with one command
Indexed grep, ~10× faster than ripgrep — a sparse n-gram prefilter skips the files that provably can't match
ColBERT-style reranking, locally — per-token MaxSim late interaction on hand-written SIMD kernels
GPU-accelerated indexing — Apple Metal, CUDA, CoreML Neural Engine, or plain CPU via ORT; same engine, auto-selected
Never stale — incremental indexing keeps the index aligned with your working tree, uncommitted edits included
No storage hassle — indexed artifacts maximally optimized without any accuracy tradeoff; up to INT4 quantization
Local-first — all models run on-device; nothing is sent anywhere, ever. CPU-inference supported for all models

📚 Table of Contents

GET STARTED

🚀 Quickstart
_{three commands to a searchable repo}

🖥️ Platform Support
_{macOS · Linux · WASM fallback}

USE IT

🧰 The Six Tools
_{search · grep · find · semantic · trace · read}

🧠 The Evolved Agent Prompt
_{GEPA-optimized search discipline}

🔌 Works With Your Agent
_{MCP · Claude Code · Codex · Gemini · Cursor}

UNDER THE HOOD

⚡ GPU-Accelerated Indexing
_{candle · fused kernels · cAST chunking}

🔄 An Index That Never Goes Stale
_{reconcile daemon tracks your working tree}

🦀 The Native Engine Room
_{four Rust crates + INT4 LI compression}

THE RECEIPTS

📊 Benchmarks
_{agent cost savings · engine speed · full-corpus MRR}

🧭 Where sweet-search Fits
_{honest wins & trade-offs vs peers}

🙏 Prior Art & Acknowledgements
_{the shoulders we stand on}

📄 License
_Apache-2.0

🚀 Quickstart

npm install -g sweet-search

cd your-repo
sweet-search init     # one-time: downloads local models, wires up your agent
sweet-search index    # builds the index — GPU-accelerated where available

sweet-search "where do we validate JWT tokens?"

That's it. init is idempotent and SHA256-verifies every model binary; re-running it is always safe. From then on the index maintains itself — edit, save, search.

Setup options & details

sweet-search init --wizard          # interactive: shows your hardware, recommends a model tier
sweet-search init --profile core    # lexical-only, no model downloads (CI-friendly)
sweet-search init --li-model edge   # compact late-interaction model for constrained machines
sweet-search uninstall              # clean removal: models, caches, config — never your code

Requirements: Node ≥ 18. macOS (arm64/x64) and Linux (x64/arm64) ship native binaries; other platforms fall back to WASM/JS automatically.
Footprint: CPU-only hosts download a few hundred MB of INT8 models; GPU hosts add ~1.2 GB of FP32 backbones (skipped automatically where they'd be useless); M3+ Macs can additionally fetch a ~3.2 GB CoreML cascade for Neural Engine acceleration. Everything lands in ~/.cache/sweet-search/models/ and is used strictly on-device.
Agent wiring: init injects the tool-routing system prompt into CLAUDE.md (and AGENTS.md, GEMINI.md, Cursor rules via flags), registers a session-start prewarm hook so your first query hits a warm daemon, and installs a /sweet-index skill in Claude Code.
What gets indexed: what you'd expect — .gitignore is respected, node_modules/build dirs/minified artifacts are denied, files over 1 MB skipped, with a .sweet-search-ignore for extra rules.

📊 Benchmarks

We measure sweet-search four ways — from how much it helps a real agent down to raw engine throughput:

🤖 ① Code-retrieval (agent-in-the-loop) _{Does it make a real coding agent cheaper and more useful when it searches your repo? Paired against each model's own grep-and-read loop.}	🚧 ② Task-completion (coming soon) _{Does cheaper, denser context compound into a higher resolve-rate on multi-step engineering tasks? Harness in progress.}
📄 ③ Paper-type IR (academic) _{The standard NL→code retrieval suites (GCSN, M2CRB, CoSQA…), full-corpus MRR@10.}	⚡ ④ Engine speed _{Raw systems numbers — grep throughput, query latency, rerank kernels, HNSW.}

🤖 1. Code-retrieval benchmarks — the agent-in-the-loop test

We install the evolved agent prompt (the GEPA-evolved search discipline), point a coding agent at a real repo, and pair it probe-for-probe against the same model running its own native grep-and-read loop. Same model, same tasks, same judge — the only difference is whether sweet-search is wired in.

_{top-of-range figures · full per-harness ranges in the dropdown · 11 model×harness cells, paired, multiplicity-controlled}

The headline, in four claims:

💰 Cheaper where the agent thrashes — up to −34% realized cost on Codex; −18 to −32% across the GPT-5.5 / opencode / bare-API harnesses.
🔧 Fewer round-trips — up to −56% tool calls, significant on 9 of 11 cells.
✨ More useful per response — +0.18 to +0.31 on a 5-dimension usefulness score, and still denser when length-matched (significant on 8 of 11 cells).
🎯 Accuracy held — and lifted on the weak — a statistical tie on flagship models (saturated at 0.94–0.99), and +3 pp (up to +8 pp out-of-distribution) on weaker models like GLM-5.1 and DeepSeek.

📋 Full per-harness results & how it's measured

The win is harness-adaptive: where the native loop is disciplined (Claude Code) it shows up as denser, more useful context per token; where it thrashes (Codex floods 30k+ tokens of its own grep output into context) it shows up as a large cost and tool-call cut. Either way, final-answer accuracy never significantly regresses.

🧰 Native agent harness	💰 Realized cost	🔧 Tool calls	✨ Useful content / response	🎯 Final accuracy
🤖 Codex (GPT-5.5)	−30 to −34%	−44 to −56%	+0.06 → +0.17 ↑	tie (saturated)
🐚 opencode (GPT-5.5 / GLM-5.1)	−18 to −22%	−15 to −49%	+0.23 to +0.31 ↑	tie
🔌 bare API (GPT-5.5 / GLM / DeepSeek)	−15 to −32% ᵃ	−15 to −33%	+0.08 to +0.24 ↑	tie · +3 pp on weak models
🟣 Claude Code (Sonnet / Opus)	−10% to +14% ᵇ	−5 to −33%	+0.18 to +0.29 ↑	tie

_{↑ "Useful content / response" is the per-response delta on a 5-dimension usefulness score (answer-grounding · workable-code · navigability · edit-locality · sufficiency), 0–1 scale. "tie" = final-answer correctness statistically indistinguishable (saturated in the 0.94–0.99 band on flagships).
ᵃ the two cheapest bare models cost fractions of a cent either way (GLM +27% of $0.008; DeepSeek −15% of $0.004). ᵇ Opus −5/−10%; Sonnet +8–14%, which is ≈1¢ on a flat-rate subscription for a richer answer.}

Denser, not just longer. The usefulness lift survives length-matching — comparing sweet-search and native responses of equal token length, sweet-search's content is significantly higher on 8 of 11 cells. The validated single-number usefulness composite (grounding × content × density) is significant on all 11 sealed cells.

What's being compared: the installed sweet-search agent prompt + tools vs. the same model using only its built-in file-reading and shell-grep tools. Not a different model — the same model, with and without sweet-search.
Design: 11 model×harness cells. Sealed vault (n=60/arm, the pre-registered primary) opened once; plus held-out (n=30) and out-of-distribution (n=40) sets for generalization. Stratified, fixed-seed splits.
Judging: 3-judge panel (DeepSeek-V4-flash + Gemini-3.1-flash-lite + MiniMax-M2.7), paired by probe, 20k-sample bootstrap CIs, Benjamini–Hochberg FDR multiplicity correction across each metric family. We report family-level survival counts, never a single cherry-picked cell.
What survives FDR (vault): useful-content 10/11, density-composite 11/11, length-matched content 8/11, fewer-tool-calls 9/11. Generalization (held-out + OOD): content 17–18/20, fewer calls 14/20.
The token fact that drives everything: sweet-search's footprint is nearly constant (~1.3k–3.3k tokens) because the tool responses are capped; native's footprint is whatever the model decides to grep — up to 37k tokens on Codex. That single fact is what drives the cost and tool-call gaps.
Honest caveats we keep attached: (1) accuracy ties on flagship models — it is not an accuracy win there, it's saturated; the accuracy gains are real only on weaker models. (2) The two weakest cells for length-matched density (Codex-low, DeepSeek) are correct-sign but underpowered — Codex's responses are so token-divergent that too few equal-length pairs exist to reach significance, and DeepSeek is simply under-powered. Those are honest non-victories, not wins.
Full methodology and per-cell tables: docs/PHASE7.md.

🚧 2. Task-completion benchmarks — coming soon

Retrieval quality is necessary but not sufficient. Cheaper, denser context only matters if it compounds across a real, multi-step engineering task — finding the code, understanding it, changing it, and not breaking anything. The next suite measures exactly that: resolve-rate on SWE-bench-style multi-file tasks, sweet-search-wired vs. native, on the same paired, multiplicity-controlled bar as above. Harness and pilot are in progress — numbers land here when they clear that bar, and not before.

📄 3. Paper-type retrieval benchmarks — academic NL→code IR

Every number below is the ss-search pipeline end-to-end — the same binary you install — run against the full benchmark corpus (no 99-distractor shortcuts), zero-shot (we never fine-tune on these tasks). Where a benchmark's queries are docstrings, we strip the docstring out of the indexed code so the query can't trivially match itself — the standard retrieval protocol.

We're SOTA in June 2026 on 3/4 attempted benchmarks at HARDER settings (running on full pool) than most other attempts!

📚 Benchmark	🔍 What it tests	# Queries	📂 Pool	🎯 MRR@10	🏆 SOTA?
🌐 GenCodeSearchNet	NL→code, 6 languages	6,000	full 6,000	86.6	YES ✅
🐍 CoSQA	web queries → Python	500	full 6,267	65.5	✅ (zero-shot)
🗺️ M2CRB	multilingual NL→code (ES/PT/DE/FR → Py/Java/JS)	5,795	full 5,795	54.0	YES ✅
🛡️ AdvTest	adversarial, identifier-obfuscated Python	19,210	full 19,210	51.4	NO ❌

_{SOTA = best result we can find in the published literature as of June 2026; cross-metric/protocol comparisons are spelled out per benchmark below.}

🌐 GenCodeSearchNet → `86.6` · 🏆 SOTA in June 2026

The BEST PUBLISHED number we can find, anywhere
The benchmark's own paper caps at MRR ≤ 0.42 for fine-tuned baselines (≤ 0.10 cross-lingual); even zero-shot OpenAI Ada-2 reaches 0.79–0.94 — but all of it against a tiny 99-distractor pool.
We score 0.866 against the entire 6,000-document corpus — a strictly harder setting — and zero-shot. 🔥

🐍 CoSQA → `65.5` · 🥇 Zero-shot SOTA in June 2026

Beats EVERY PUBLISHED zero-shot model
Canonical setup: 500 real web queries → the fixed 6,267-code database, no fine-tuning.
Clears the strongest zero-shot results out there — CodeSage-Large 47.5 · OpenAI text-embedding-3-large 55.4 · OASIS 55.8 — and goes toe-to-toe with fine-tuned CodeBERT / GraphCodeBERT (64.7 / 67.5). 💪
_{CoSQA has known label noise, so we read the absolute height with a pinch of salt.}

🗺️ M2CRB → `54.0` · 🏆 SOTA in June 2026

the BEST PUBLISHED number we can find, anywhere — and zero-shot
🇪🇸 Spanish · 🇵🇹 Portuguese · 🇩🇪 German · 🇫🇷 French → Python / Java / JavaScript.
The paper's best — a CodeBERT fine-tuned on the task — reaches 52.7 auMRRc, a metric that averages over easier, smaller pools (so auMRRc ≥ full-pool MRR for any model). Our 54.0 is full-pool MRR@10 over all 5,795 functions in one pool — a strictly harder measure, cleared with no fine-tuning. 🔥

🛡️ AdvTest → `51.4` · 🧪 our honest worst case — and we publish it anyway

Adversarial obfuscation (def Func(arg_0):) deletes the lexical + graph signals our hybrid feeds on — yet we still beat the classic fine-tuned baselines (CodeBERT 27 · GraphCodeBERT 35 · UniXcoder 41), and our stack still lifts our own encoder ~3pp even here.
🔍 Full transparency: we could not reproduce the often-cited 59.5 for the bare CodeRankEmbed encoder — the reference FP32 model scores 54.7 on our leak-free corpus, our shipped INT8 build 51.4. The gap is stricter preprocessing + INT8 quantization, not the retrieval pipeline. We report exactly what we measured.

Methodology, protocol & honesty notes

Reproduction: result artifacts live in eval/results/; rerun via eval/run_all.js. The canonical full-pool loaders are in eval/download_data.py.
Full corpus, not distractors. Published baselines for GCSN- and CoSQA-style benchmarks typically rank the gold against 99 sampled distractors; every number here ranks against the benchmark's full corpus (6k–19k candidates) — strictly harder.
Zero-shot + docstring-stripped. We never fine-tune on these tasks. For docstring-derived benchmarks (AdvTest, M2CRB) we strip the docstring from the indexed code — otherwise the NL query matches itself verbatim (a no-strip AdvTest run scores a meaningless 0.98). This is the standard protocol; it is also why our AdvTest is lower than naïve setups that leave the docstring in.
What we deliberately don't claim yet. CoIR (official metric NDCG@10 over per-subtask corpora up to ~1M docs), CoSQA+ (multi-positive, MAP-primary), and CLARC (per-group pools) use protocols and metrics our single-pool MRR@10 harness doesn't currently match. Rather than publish apples-to-oranges numbers, we omit them; faithful per-subtask CoIR (NDCG@10) runs are queued.
M2CRB — the paper's metric is auMRRc (area under the MRR-vs-pool-size curve; best published 52.7, fine-tuned). Because that area averages over easier small pools, auMRRc ≥ full-pool MRR for any model — so our 54.0 full-pool MRR@10 (all 5,795 functions, zero-shot) clears their best on a strictly harder measure. No one publishes a plain full-corpus MRR@10 on M2CRB, so ours is the best available.
AdvTest honesty note. We could not reproduce the commonly-cited 59.5 for the bare CodeRankEmbed encoder on our corpus: the reference FP32 model scores 54.7 on our leak-free, docstring-stripped, full-19,210 setup, and our shipped INT8 build 51.4. We report our measured numbers and the reference check rather than the leaderboard figure.
Honesty corner: CrossCodeEval — cross-file completion-context retrieval, a different task than NL search — sits at 0.12. We don't optimize for it and report it anyway.

⚡ 4. Engine speed — systems benchmarks, measured in-repo

10.2× ripgrep's median grep · 2.9 ms warm queries · 47× MaxSim kernels · −33% HNSW search p50

⚙️ What	📈 Result	📄 Source
⚡ Indexed grep vs ripgrep	10.2× faster at the median (8.5–17.7× across 5 repos, 353 realistic queries, 1 ms p50 — identical match counts on every query)	`docs/GREP_INDEXING_STRATEGY.md`
⏱️ Warm query latency (native CLI)	2.9 ms warm · 108 ms cold	`docs/INIT_STRATEGY.md`
🧮 MaxSim rerank kernels	1.26 s → 27 ms for a 231-candidate pass (47× native Rust; 16× WASM SIMD)	`docs/MAXSIM_OPTIMIZATION.md`
🧠 HNSW tuning for code	−33% search p50, +5.9 pp recall@200	`docs/HNSW_APPROACH.md`
💾 Indexing memory	peak JS heap 785 MB → 213 MB	`docs/DISK_FLUSHING_STRATEGY.md`
🍏 CoreML cascade (M3 Max)	18% faster full indexing vs the Metal baseline	`docs/INIT_STRATEGY.md`

🧭 Where sweet-search Fits

Code search is a crowded space. Here's an honest read on where sweet-search wins and where it gives ground, against the trending leaders and our closest local peers.

Capability	sweet-search	claude-context	Cursor index	codebase-memory	SocratiCode
100% local — code never leaves your machine	✅	⚠️¹	❌	✅	✅
Works with zero API keys	✅	❌	❌	✅	✅
No external service to run (vector DB · Ollama · Docker)	✅	❌ Milvus	❌ cloud	✅	⚠️⁵
ColBERT late-interaction rerank	✅	❌	❌	❌	❌
Faster-than-ripgrep exact grep	✅	❌	✅⁷	❌	❌
Call-graph trace (callers · callees · impact)	✅	❌	❌	✅	✅
Drives any terminal agent (Claude Code · Codex · Gemini CLI)	✅	✅	❌²	✅	✅
Published NL→code retrieval benchmarks	✅	⚠️³	❌	⚠️³	⚠️³
…and where sweet-search gives ground
Native Windows	❌⁴	✅	✅	✅	⚠️⁸
Deep-AST language coverage	⚠️ 14 (+70 via regex)	⚠️	⚠️	✅ 158	⚠️
In-editor GUI · writes & edits code	❌	❌	✅	❌	❌⁶
Org-wide, multi-repo scale	❌	⚠️	⚠️	⚠️	✅

_{✅ yes · ⚠️ partial / with caveats · ❌ no. Verified June 2026; capabilities drift.

¹ claude-context can run local via Milvus Lite + Ollama, but defaults to OpenAI/Voyage embeddings + Zilliz Cloud. ² Cursor's index is editor-locked — external terminal agents can't query it. ³ Reports token-reduction / efficiency, not a public NL→code retrieval-quality leaderboard. ⁴ Runs on Windows via WSL2. ⁵ SocratiCode manages a bundled Qdrant for you, but uses an auto-detected Ollama for local embeddings. ⁶ Ships an interactive HTML graph viewer, but doesn't edit code. ⁷ Cursor's local Instant Grep — a literal + regex index it benchmarks at ripgrep 16.8 s → 13 ms (the post that inspired our own n-gram prefilter). ⁸ SocratiCode runs on Windows via Docker only — no native binary, and no GPU there.}

Where we lose, plainly: no native Windows yet, no editor GUI, and we index one repo at a time. If you need org-wide search across many repos and branches, that's where SocratiCode and Sourcegraph are built to win. If you live inside one editor, Cursor's index is already there. sweet-search is for the terminal agent that wants the best local retrieval on the repo in front of it. No one else combines all of it: ColBERT late-interaction reranking and faster-than-grep search, fully on-device, with nothing to sign up for.

_{Also in the space: Sourcegraph/Cody (org-scale, server-based), Continue.dev (local-default RAG), Serena (LSP symbol search, no embeddings), grepai (local CLI + trace), and cocoindex-code (embedded AST search).}

🧰 The Six Tools

Six small tools, one shared index. Each returns ranked, deduplicated, token-budgeted output designed to be consumed by an agent — a useful answer, not a wall of matches to scroll through.

Tool	What you give it	What you get back
1. `ss-search`	a natural-language query	ranked, self-contained code blocks
2. `ss-grep`	an exact regex/literal	every `file:line` hit, ripgrep-identical
3. `ss-find`	a regex + a query	regex matches, semantically re-ranked, as code blocks
4. `ss-semantic`	a file + a question	just the relevant spans of that file
5. `ss-trace`	a symbol	callers + callees + impact, in one call
6. `ss-read`	a file (± line range)	exact bytes + symbol metadata

1. 🔍 `ss-search` — hybrid search powerhouse

A hybrid search pipeline with late interaction reranking that returns actual code blocks.

Leading published-benchmark results — strongest we can find on GenCodeSearchNet, and above every published zero-shot model on CoSQA. See benchmarks.

flowchart TD
    Q(["🔍  natural-language query"]) --> ROUTE{{"🧭 WASM CatBoost router · lexical / hybrid"}}

    ROUTE --> BM["📑 <b>BM25F</b><br/>field-weighted FTS5"]
    ROUTE --> ANN

    subgraph ANN ["🧬 three-stage ANN cascade"]
        direction LR
        BIN["binary <b>HNSW</b><br/>Hamming · ~100µs"] --> INT["INT8<br/>rescore"] --> FL["float32<br/>mmap sidecar"]
    end

    BM  --> FUSE
    ANN --> FUSE
    FUSE["🔀 <b>CCFusion</b><br/>convex combo · RRF fallback"] --> ROW1

    subgraph ROW1 [" "]
        direction LR
        IAR["⚓ <b>IAR</b><br/>exact-symbol injection"] --> INTENT["🎯 intent rerank<br/>demote docs · tests · config"]
    end

    ROW1 --> ROW2

    subgraph ROW2 [" "]
        direction LR
        GRAPH["🕸️ graph expansion<br/>typed edges · 1–2 hops · <b>PathRAG</b>"] --> MAXSIM["🧮 <b>Late-Interaction Rerank</b><br/>⚡ native Rust MaxSim kernel"] --> OUT(["🏁 <b>self-contained code blocks</b><br/>whole functions · 3k/8k/12k budget"])
    end

    classDef io    fill:#fde68a,stroke:#f59e0b,color:#000;
    classDef out   fill:#bbf7d0,stroke:#15803d,color:#000,stroke-width:3px;
    classDef route fill:#e0e7ff,stroke:#818cf8,color:#000;
    classDef lex   fill:#dbeafe,stroke:#60a5fa,color:#000;
    classDef fuse  fill:#f3e8ff,stroke:#c084fc,color:#000;
    classDef rank  fill:#ffe4e6,stroke:#fb7185,color:#000;

    class Q io;
    class OUT out;
    class ROUTE route;
    class BM,BIN,INT,FL lex;
    class FUSE,IAR fuse;
    class INTENT,GRAPH,MAXSIM rank;

    style ANN  fill:#eff6ff,stroke:#93c5fd,color:#000;
    style ROW1 fill:none,stroke:none;
    style ROW2 fill:none,stroke:none;

_{↑ The diagram traces the hybrid route. A pure-lexical query — or a literal file path — short-circuits at the router straight to BM25F, skipping the vector cascade and fusion.}

Stage	What it actually does
🧭 Route	WASM-exported CatBoost · lexical / hybrid · ~10 µs routing · low-confidence → max-recall hybrid
🧬 Retrieve	• Lexical — BM25F over field-weighted FTS5 (name 10× · signature 5× · alias 4× · doc 1×) • Embed — query vectorized by the local CodeRankEmbed model (swappable for Voyage / Jina / Codestral) • Vector cascade — binary HNSW (Hamming, 64-byte, ~100 µs) → INT8 rescore → exact float32 from a memory-mapped sidecar
🔀 Fuse	• CCFusion — convex-combine both rankings · per-route weights · quantile-normalized • MMR (λ=0.9) diversity pass over the fused list • auto RRF (k=60) fallback on degenerate score distributions
⚓ Anchor	• IAR (Identifier Anchor Retrieval) — a real symbol in the query fires an exact-name code-graph lookup that injects that entity, even when the encoder ranked it too low
🎯 Intent Rerank	• demote docs / tests / config when you want implementation • log-scaled call-site boosts surface the most-referenced function
🕸️ Graph Expansion	• typed-edge walks (`imports`/`extends`/`calls`/`uses`) · adaptive 2-hop on the AST graph · edges picked by intent • PathRAG flow pruning + degree normalization → hubs can't dominate
🧮 Late interaction Rerank	• Query embedded per-token by LateOn-Code (149M; a 17M edge variant auto-selected on low-RAM hosts) • MaxSim against the pre-indexed quantized token vectors • native Rust+Rayon MaxSim kernel ⚡ · WASM-SIMD fallback (1.26 s → 27 ms on a 231-candidate rerank)
📦 Package	• entity-aware expansion → whole functions (imports, docstrings, decorators) • same-file overlap demotion → diverse, non-overlapping spans • auto-selected 3k / 8k / 12k token budget

🌶️ Extra spice — the bits that didn't fit the diagram

🧠 The HNSW, in full (full writeup). Stage 1 is a from-scratch binary HNSW, and every "advanced" trick ships on by default:

Heuristic neighbor selection (HNSW Algorithm 4) + M0 = 2M on layer 0 — a real graph backbone, not naïve closest-M
Shuffled insertion order — no filesystem-ordering bias baked into the highway structure
Discovery-rate adaptive early termination + adaptive ef — easy queries stop early, hard ones keep their budget
A denser graph than most vendors ship (M=64 · efC=800 · efS=400) — which broke an 80.6 % → 86.5 % recall@200 plateau and cut p50 latency ~33 %
Zero-GC search: typed-array heaps + generation-stamped visited lists — no per-query allocation
64-byte sign-bit vectors (Hamming) → INT8 → exact float32 from a memory-mapped sidecar

⚡ Why it's quick. A native Rust + Rayon MaxSim kernel (47× over scalar; 16× WASM-SIMD fallback) · int4-quantized, binary-packed token vectors (plain INT4 is the shipped path — the full TurboQuant algorithm is researched but deferred; binary packing alone cut the LI index ~3.4×, 1.34 GiB → ~396 MiB) · a memory-mapped float32 sidecar that skips SQL on the rescore hot path · score-spread adaptive pooling (decisive queries shrink the rescore pool, ambiguous ones widen it) · and a warm daemon that answers in a single NAPI call — no process is ever forked.

🎛️ Priors & structure.

Quality priors: every chunk carries a 0–1 prior from test proximity, git recency, symbol centrality (PageRank), comment density, and complexity — production code surfaces, stale fixtures sink.
Community structure: a canonical Leiden pass detects code communities on the entity graph at index time, feeding vocabulary prewarming and structural signals — it understands your modules, not just your directories.
Multilingual: 14 languages get full tree-sitter AST treatment; a 39-config registry covers 70+ extensions beyond that. Router features handle camelCase/snake_case, CJK density, and German compounds.
Format-gated signals: structure-aware boosts and demotions (symbol-exact, path-token, mega-entity) fire only in agent mode — they help agent-shaped queries and would hurt plain NL, so they stay gated by default.

🛟 Rescues & honest trade-offs.

Long-query rescue: wordy NL queries that FTS5 would tokenize into an unsatisfiable AND fall back to multi-query BM25F + RRF — one query per content keyword, fused.
Near-duplicate dedup: a SimHash + MinHash-LSH pass (Jaccard τ=0.9) clusters copy-paste and vendored code at index time; aliases reuse their exemplar's vectors and skip both the bi-encoder and late-interaction encoding.
A negative result we ship anyway: we built a full cross-encoder rerank cascade behind an adaptive confidence gate, measured it on our eval sets — and it didn't beat MaxSim at 3× the latency. So it ships disabled (SWEET_SEARCH_CASCADE_ENABLED=true to try it). We'd rather ship the faster path than a fancier diagram.
Budget tiers: the expensive 8k/12k tiers fire on ~1–5 % of queries — the default stays cheap. Force one with --full / --xl, or pick a mode with --mode lexical|semantic|hybrid|pattern.

Also available as sweet-search "<query>" on the CLI and the search MCP tool.

2. ⚡ `ss-grep` — grep, minus every wasted millisecond

10.2× faster than ripgrep end-to-end at the median — measured across 353 realistic queries on 5 real repos (range 8.5–17.7× per repo, 1 ms p50), with identical match counts on every single query. Three things buy that:

A sparse n-gram index (inspired by Cursor's fast-regex-search and GitHub's Blackbird): instead of a fixed trigram table, gram boundaries adapt to your codebase's character-pair frequencies, so common trigrams get absorbed into longer, more selective grams.
Regex-AST literal extraction + SIMD intersection: required substrings are pulled from the pattern's syntax tree, posting lists are intersected with NEON/SSE2 block merges (galloping search for skewed sizes), and only the files that can match — typically 0.1–5% of the corpus — see the real regex.
Fully in-process: verification runs on Rust's regex crate with Rayon across all cores, inside the warm daemon, in a single NAPI call. No child process is ever spawned — zero fork/exec, zero pipe I/O, zero JSON re-parsing.

Every match comes back in stable file:line order — ripgrep-identical counts, optional context lines — with no relevance guessing, no subprocess, in one warm call.

More

Full methodology, per-repo table, and the optimization log: docs/GREP_INDEXING_STRATEGY.md.
Regexes with no extractable literals fall back to native grep over the indexed file set; fixed-string and glob queries use a ripgrep fallback.

3. `ss-find` — ColGrep, on a faster engine

ss-find "token refresh logic" --regex "refresh.*[Tt]oken"

Inspired by LightOn's ColGrep — regex precision, semantically ranked — but rebuilt on our own substrate:

The regex stage runs on the same indexed sparse-gram engine as ss-grep (in-process, no subprocess), not a filesystem scan.
The ranking stage scores candidates with per-token MaxSim over pre-indexed late-interaction embeddings — no model inference over documents at query time — on our custom kernels: native Rust + Rayon takes a 231-candidate MaxSim pass from 1.26 s down to 27 ms (WASM SIMD fallback at 16×).
Regex tokens are merged into the semantic query, so the ranking sees both what you typed and what you matched.
Like ss-search, it answers with ranked, self-contained code snippets — not bare file:line — so the find and the read collapse into one tool call. In our 30-question agent-workflow eval that eliminated every follow-up read and cut tokens 25.4% vs a grep + read workflow, at quality parity (gap of 0.01 on a 5-point scale).
On the 60-query pattern benchmark, MaxSim ranking lifts MRR@10 to 0.45 vs 0.11 for raw grep ordering — 4× more likely the right hit lands on top.

More

Requires the late-interaction index (built by default; --li-model none disables pattern mode).
Also available as sweet-search --mode pattern and via the search MCP tool's regex argument.

4. `ss-semantic` — hybrid retrieval, scoped to one file

ss-semantic src/auth/session.ts "where does the cookie get its expiry?"

You know the file; this finds the lines. Every indexed chunk of the file is scored by three independent signals — BM25-style lexical term match, exact symbol-name match (weighted 1.5×), and per-token MaxSim late interaction over the LateOn-Code embeddings — fused with Reciprocal Rank Fusion (k=60), with symbol-less fragment chunks demoted 0.85× so real definitions win ties. The top spans are then re-read from disk (±2 context lines, overlapping spans merged), so the answer is filesystem ground truth even mid-edit; if the file is newer than its index entry you get an explicit staleness warning.

The useful answer: just the relevant spans with line numbers — not the whole file through your context window.

More

Unindexed files degrade gracefully to a plain read. Defaults: top 5 spans, relevance threshold 0.4, 8k-char cap.
Also available as sweet-search read-semantic and the read-semantic MCP tool.

5. `ss-trace` — graph algorithms, not grep guesswork

ss-trace processOrder --in src/orders/service.py

One call returns a symbol's callers, callees, and transitive impact paths from the AST-derived code graph (entities + typed calls/imports/extends/uses edges, persisted in SQLite at index time). Ranking fuses three signals:

Query-time Personalized PageRank via Forward Push — a local algorithm that spreads mass directionally from your target symbol and touches only the neighborhood it reaches, never the whole graph;
Index-time edge-weighted global PageRank (damping 0.85), precomputed into a page_rank column — a function called from five sites carries five units of mass, and it costs zero at query time;
Structural heuristics — relationship type, depth, exported-API status, fan-in — with penalties for test-only and external paths.

Because the graph is prebuilt, the global ranking is precomputed, and the personalized walk is local, a full three-section trace costs milliseconds. The relation word (callers / callees / impact) re-weights how the response token budget is split; --in disambiguates duplicate names; --depth bounds impact traversal (1–4).

More

Honest caveat: call-graph extraction is precise but incomplete on highly dynamic code (bare-name dispatch, metaprogramming) — traces can be sparse there, and the agent prompt teaches a recovery strategy for exactly that case.
Also available as sweet-search trace and the trace MCP tool.

6. `ss-read` — exact bytes, with the index's knowledge attached

ss-read src/db/pool.js 120 180

A read tool that is filesystem-grounded by construction: bytes come straight from disk (never from the index, so never stale), but each indexed file arrives annotated with its cAST chunk metadata — symbol name, entity type, signature, line span — joined from the AST chunk index. The agent gets the code and the structural map of what it's looking at in one call: cite, navigate, or trace next without another search.

More

The CLI/MCP form scales it up: sweet-search read <file...> (and the read MCP tool) batches 1–20 files in a single call, each with the same symbol metadata — twenty files for the price of one tool invocation.

The ss-* wrappers ship in the npm package and are what the installed agent prompt drives. Every capability is equally available as sweet-search CLI subcommands and as MCP tools — see Works With Your Agent.

🧠 An Agent Prompt That Was Evolved, Not Written

Shipping six tools is easy. Getting an agent to stop grepping in circles is the hard part.

So sweet-search init installs a ~1k-token system prompt that we didn't write — we grew it. A GEPA-style loop mutated candidate prompts, scored each on a dual Pareto front (accuracy × cost) against two different production agents at once — Claude Code (Sonnet) and Codex (GPT-5.5) — kept the survivors, and repeated. A final correctness pass hardened the winner. ~1k tokens, one job: teach the agent to search well.

🎓 The five rules it encodes:

	Rule	What it kills
🥇	Cheapest tool first	Got an exact symbol? One `ss-grep`, trust the top hit, stop — no semantic search "just to confirm."
🎯	Trust the ranking	At most one narrow read to confirm; never re-run a hit that already matched.
🚫	Absence is an answer	Two empty probes (one semantic, one lexical) settle a negative — no third synonym, no `find`/`ls` spiral.
⛔	No raw-shell escape	The #1 token-waster in our trace analysis: agents bailing to dozens of raw `grep`/`find` calls after one miss. Door closed.
📝	Think before you dig	Before a third probe, the agent states what it knows and what its blind spot is.

🧾 The receipts — held-out discipline throughout: a dev set to iterate on, a held-out set touched only at milestones, a sealed vault opened exactly once.

Validation gate	Result
🎯 Held-out (30 probes × both agents)	joint score (worst of the two) 0.988
🌍 Out-of-distribution (8 languages never seen in the loop)	0.952 — every language ≥ 0.79, zero weak spots
🛡️ Adversarial counter-probes	1.00 / 1.00
🔀 Held-out model families (never optimized on)	MiMo 0.988 · Qwen 0.980 — it generalizes, it doesn't memorize
🧩 Paraphrase robustness (reword the prompt, same behavior)	correctness-weighted 0.95 / 0.93

🔬 How it was actually built (the honest version)

Seeds → survivors: 15 hand-authored seed prompts entered a reflective-evolution loop (an agent reads the real tool-call traces, proposes one targeted edit, we keep what helps). Operators included trajectory crossover, structural pivots, tool-name masking, and a pruner that fights prompt bloat.
Two targets, jointly: every candidate was scored on both Claude Code/Sonnet and Codex/GPT-5.5 with Maximin discipline (a prompt is only as good as its worse target), so it can't overfit one model's quirks.
What actually won: not clever phrasing — terseness (a shorter prompt re-sent every turn is cheaper), a leaner tool mix (grep/read over heavy semantic blocks that fatten the transcript), and decisiveness on no-match (stop spiraling). We report this plainly because it's what the traces showed.
The correctness pass: the shipped prompt ("M++") is the cost-winner plus 7 edits that fix factual descriptions of the tools — routing byte-identical, accuracy held, cost unchanged. A lateral move that buys honesty.
Held-out everything: dev to iterate, held-out checked only at milestones, a sealed vault opened once, plus held-out model families (MiMo, Qwen) and a reasoning-mode replay (MiniMax 0.963) it never trained against. Figures: docs/PHASE7.md (internal probe suites; an externally-reproducible suite is in progress).
Idempotent install: init writes a marker-delimited block into CLAUDE.md / AGENTS.md / GEMINI.md / .cursor/rules — re-run it freely, it never touches anything else you wrote.

⚡ GPU-Accelerated Indexing, Fully Local

Chunk → enrich → embed → quantize — every step on-device and in Rust. Batches are sized to your CPU's actual cache, two open code-models do the encoding, and two separate quantizations make the index both faster to build and small enough to live in RAM. Zero API keys; nothing ever leaves the machine.

① 🧩 Structure-aware chunk _{cAST over tree-sitter ASTs — whole functions, never sliced mid-body}	② 🏷️ Enrich from structure _{deterministic preamble from the code graph — no LLM call}
③ 🤖 Embed — two models _{dense CodeRankEmbed + per-token LateOn-Code}	④ 🗜️ Quantize + persist _{INT8 weights → 2× faster build · INT4 vectors → fits in RAM}

The inference engine, picked for your silicon:

Your hardware	What runs
🍏 Apple Silicon (M1+)	candle Metal, BF16, fused SDPA attention
🍏 Apple Silicon (M3+)	… plus a CoreML Neural Engine cascade — ~18% faster full index (measured, M3 Max)
🟩 NVIDIA GPU (SM 7.0+)	candle CUDA; flash-attention on Ampere+
💻 No accelerator	ONNX Runtime INT8 — tuned CPU path, 132 MB model, zero GPU weights downloaded

🧩 Chunking — every chunk is whole code, never a fixed window

cAST structure-aware chunking over real tree-sitter ASTs: a recursive split-then-merge greedily packs sibling AST nodes up to the size cap and recurses into nodes too big to fit. So a chunk is always a function, a class, or a contiguous run of declarations — never a body cut in half, never a string split mid-literal.
14 languages get true AST grammars — JS · TS · TSX · Python · Go · Rust · Java · C · C++ · Ruby · PHP · Kotlin · Swift · C# — and a 39-config regex registry carries structure-aware chunking to 70+ more extensions.

🏷️ Metadata — context the encoder can actually see

Every chunk ships its symbol name · entity type · signature · line span — the metadata that powers the code graph, ss-read annotations, and the self-contained answers everywhere else.
Contextual enrichment: before embedding, each chunk is prefixed with a structured preamble assembled from the AST + code graph — file path · enclosing-scope breadcrumb · name & type · merged siblings · the imports it actually uses. Both encoders see it, so a bare getId() still retrieves on the class and module around it.
Our nod to Anthropic's Contextual Retrieval — except they prepend an LLM-generated summary (one model call per chunk); we derive the context deterministically from structure: no LLM, no per-chunk inference, regenerated for free on every reindex. Tuned per language from GenCodeSearchNet ablations — Python stays minimal, the Java family keeps a slug-stripped path, JS/Ruby/Go/C/C++/Rust get the full preamble where closures and imports earn their keep.

🧠 Cache-aware batching — we read your CPU before we batch it

We detect your last-level cache at runtime — hw.perflevel0.l2cachesize (the 16 MB P-cluster on Apple Silicon, not the smaller E-cluster), Intel L3, or /sys/.../cache on Linux — then size every embedding batch so one transformer layer's weights plus the batch's activations stay resident in cache. No spilling to main memory mid-layer; on a long-sequence tail that's the difference between B=1 and a measured 2.1× per-chunk slowdown.
Uses every core the hardware really has — full count on ARM/Apple Silicon; x86 SMT siblings discounted because they don't scale inference linearly.
ORT drives the CPU path (ONNX Runtime); GPU hosts swap in fused kernels (below). Either way inference runs off the event loop as a napi AsyncTask, so tokenization and SQLite writes overlap compute instead of stalling behind it.

🗜️ Two quantizations — one buys speed, one buys size

	Model weights · INT8 ORT	Index vectors · INT4 binary
Job	build the index faster on CPU	keep the on-disk index tiny
Win	~2× faster indexing · 4× smaller model (132 MB)	LI index 1.34 GiB → ~396 MiB · INT4 nibble-packing halves it again
Fidelity	≥ 0.96 cosine vs FP32	no measurable retrieval loss (A/B-tested vs INT8)

🤖 Two models — both open, both local, both code-specialized

CodeRankEmbed — 768-d dense bi-encoder (137M, Apache-2.0) for first-stage recall.
LateOn-Code — ModernBERT per-token late interaction (149M) for the rerank.
Edge fallback for leaner machines: a 17M edge LateOn-Code (~9× smaller FP32 backbone) auto-selects on low-RAM hosts, and the whole CPU path runs INT8 with no GPU weights ever downloaded — full local search on a laptop with no accelerator.

What's actually custom here — the kernels we hand-wrote

Surgical attention swap: we vendor the upstream model implementations (NomicBERT for embeddings, ModernBERT for late interaction) and replace only the attention forward pass — an MLX-ported fused SDPA kernel on Metal, candle-flash-attn with varlen packing on CUDA Ampere+, and byte-for-byte upstream math on CPU so the fallback is provably identical.
A silent-NaN bug, found and fixed: Apple's Metal SDPA kernel downcasts attention masks to F16, which saturates the standard f32::MIN mask to -Inf and quietly produces NaN on padded rows — collapsing retrieval quality. We clamp the mask and serialize Metal command-buffer submissions (concurrent submission corrupts outputs on shared queues). Details in crates/sweet-search-native/src/inference/.
CoreML cascade: 18 pre-traced .mlpackage variants (bucketed by sequence length) dispatched to the Apple Neural Engine through an Objective-C shim; oversized batches fall through to Metal. Gated to M3+ because on M1/M2 the ANE doesn't beat its own compile overhead — we measured, so it's off there.
Structure-routed enrichment: the preamble (path · scope chain · symbol · siblings · imports) is assembled at index time from a code-graph line-range overlap query — never an LLM call — then routed per language family (full enriched text for JS/Ruby/Go/C-family/Rust, a slimmer path policy for Python and the Java family), every decision settled by per-language ablation rather than a global default.
Pipelined, crash-safe indexing: while batch N+1 embeds, batch N's vectors stream into SQLite through zero-copy buffer views; full rebuilds write to a temp file and atomically swap, so a crash never leaves you serving half an index.

🔄 An Index That Never Goes Stale

Most code indexes rot the moment you start typing. sweet-search ships a reconcile daemon that keeps every tier of the index converged with your working tree — uncommitted edits included — without you ever running a command.

Save → searchable at the next reconcile tick — auto-tuned per machine between 15 s and 300 s, typically 15–60 s on a warm, idle box
Tracks the filesystem, not git — unstaged and uncommitted changes are first-class; deleted or newly-gitignored files disappear from results automatically
Atomic by construction — every tick publishes all five index tiers (float HNSW, binary HNSW, late-interaction segments, sparse-gram, code graph) through a single fsync-renamed epoch manifest, so a query never sees a half-updated index
No-op edits cost almost nothing — content hashing collapses byte-identical rewrites and editor touch events into skipped re-encoding work

Deep dive

Baseline gate: the daemon never plays first-index-builder. It verifies a full-indexer fingerprint (epoch manifest + merkle config fingerprint + the vectors DB it names) before touching anything, and reports waiting_for_initial_index otherwise — no corrupted partial baselines.
One admission policy: the full indexer and the reconciler share a single createAdmissionPolicy module (include globs → deny list → .sweet-search-ignore → 1 MB size cap → batched git check-ignore), so the two paths cannot drift.
Orphan sweep: files that are deleted, newly excluded, or newly oversized get tombstoned across every tier; the index converges to exactly what a fresh full rebuild would produce.
Self-maintenance: per-tier health watermarks (tombstone fraction, stale-doc ratio, delta ratio) schedule low-priority background compaction in a separate worker — the index stays fast over months without a manual rebuild.
Worktree-safe: a worktree stamp plus a single-writer lockfile prevent two daemons from silently interleaving index histories across git worktrees.
Resource-polite: ticks are budgeted (≤50 files / ≤2 s CPU per tick), run CPU-only (the GPU is reserved for cold full indexing), and the interval auto-tunes from load average, churn, and backlog.
sweet-search reconcile status / reconcile inspect <path> explain exactly what the daemon thinks and why. Opt out any time with SWEET_SEARCH_RECONCILE_V2=0.

🦀 The Native Engine Room

Four Rust crates do the heavy lifting, each with a graceful fallback so the engine runs everywhere:

Crate	What it does
`sweet-search-native`	candle GPU/CPU inference, sparse-gram grep engine, SIMD posting-list intersection, SimHash/MinHash-LSH dedup, HuggingFace tokenizers — all over zero-copy NAPI
`wasm-maxsim`	a hand-written WASM SIMD kernel computing ColBERT MaxSim in ~4 KB (~1.6 KB gzipped), with fused INT8 dequantization inside the SIMD pipeline plus a 4-bit nibble-packed path
`wasm-router`	the 498-tree CatBoost query router, loop-unrolled, zero-allocation
`sweet-search-cli`	a native CLI that talks to a warm search daemon over a per-project Unix socket — 2.9 ms measured warm-path queries

Deep dive

MaxSim, three speeds: scoring auto-selects the best available tier — native Rust + Rayon across all cores (47× vs baseline JS in our microbenchmark), portable WASM SIMD (16×), or a norm-cached pure-JS fallback (3.5×). Equivalent rankings, any platform.
SIMD set intersection: posting-list intersection dispatches per-pair — galloping search when one list is ≥8× smaller, 4-wide NEON/SSE2 block merges for balanced lists, scalar merge for small ones — following the Lemire/Clausecker line of work.
Dedup at index time: near-duplicate chunks are fingerprinted (64-bit SimHash + 128-permutation MinHash), clustered with banded LSH + union-find, then re-validated pairwise against the exemplar so transitive weak links can't glue unrelated clusters together. Duplicates skip embedding entirely — and at query time the best-matching sibling can take the exemplar's slot, so collapsing copies never hides the right answer.
Per-project warm daemon: the CLI derives an isolated socket path from an FNV-1a hash of the project root, auto-starts the server on first use, and falls back to pure JS where no native binary exists (measured: 2.9 ms warm / 108 ms cold / 64.7 ms JS fallback).
Native tokenization: the official HuggingFace tokenizers crate over NAPI — batched, cached, no Python anywhere in the stack.

🗜️ INT4 binary segments: the on-disk format behind the RAM-sized index

The quantization headline lives up in indexing — 1.34 GiB → ~396 MiB, INT4-halved again. Here's the SSLX segment format that delivers it: crash-safe by construction, and the three-stage retrieval it feeds at query time.

Deep dive

INT4 by default: per-token min/scale quantization with nibble packing (two values per byte), A/B-tested against the INT8 baseline with no meaningful retrieval regression before becoming the default. We borrowed the rotation insight from Google's TurboQuant, but ship plain INT4 — the full TurboQuant algorithm (WHT + PolarQuant + QJL) is researched and deferred, not in the product path.
SSLX binary segments: the index persists as ~10k-document binary segment files with structured headers and CRC32 footers — a crash costs you at most one segment, not the index.
Three-stage retrieval: a binary HNSW (Hamming distance over 64-byte binarized vectors, ~32× smaller than float HNSW) produces candidates in ~100 µs, INT8 rescoring narrows them, and a float32 sidecar rescores the final pool — speed without giving up top-result quality.
Memory-mapped HNSW: the float graph index loads via mmap (USearch view()), contributing 0 MB to the V8 heap at search time; the OS reclaims pages under pressure.
Streaming indexer: vectors stream from SQLite cursors instead of materializing in arrays — peak JS heap during indexing dropped from ~785 MB to ~213 MB, with 30-second fsync-ordered checkpoints bounding crash loss. The OOM cliff that used to appear above ~200k chunks is gone; large repos index comfortably on an 8 GB machine.
Tuned HNSW parameters and zero-GC search internals (typed-array heaps, generation-stamped visited lists) cut search p50 by 33% while raising recall@200 by 5.9 pp in our internal evaluation (docs/HNSW_APPROACH.md).

🔌 Works With Your Agent

sweet-search meets your agent wherever it is — shell tools, MCP, or injected instructions:

// .mcp.json (project root) — that's the whole integration
// or just run: sweet-search init --mcp
{
  "mcpServers": {
    "sweet-search": {
      "command": "npx",
      "args": ["-y", "sweet-search-mcp", "--project-root", "/absolute/path/to/your/repo"]
    }
  }
}

MCP server — 8 tools (search, trace, read, read-semantic, index, health, repo-map, vocab-prewarm), 2 resources, 2 prompts; all search tools declared read-only and idempotent
Harness injection — init writes the evolved system prompt into Claude Code, Codex (--codex, including session hooks), Gemini CLI (--gemini), and Cursor (--cursor) from one canonical source
Repo maps for sub-agents — the repo-map tool returns a PageRank-ranked symbol overview squeezed into any token budget, perfect for briefing a delegated agent
Warm from the first query — a SessionStart hook pre-launches the search daemon so models, vocabulary, and indexes are loaded before you ask anything

Deep dive

Tool routing enforcement (opt-in): init --enforce-tools denies the native Grep tool in Claude Code and installs a hint hook nudging native Read toward ss-read/ss-semantic — for when you want the discipline guaranteed, not suggested.
/sweet-index skill: a Claude Code slash command for a full GPU-aware reindex, installed by init.
Vocabulary prewarm: sweet-search prewarm-vocab mines your repo's real identifiers, detects code communities (Leiden), and pre-warms all three search modes so even the first semantic query of a session is cache-warm.
Honest committed-state: init never writes machine-specific absolute paths into committed settings files, and all instruction injection is marker-delimited and reversible.

🖥️ Platform Support

Platform	Engine	Acceleration
macOS arm64 (Apple Silicon)	native	Metal (M1+) · CoreML Neural Engine (M3+)
macOS x64 (Intel)	native	ONNX Runtime INT8 CPU
Linux x64 (glibc)	native	CUDA (SM 7.0+, flash-attn on Ampere+) or INT8 CPU
Linux arm64 (glibc)	native	CUDA (Jetson Orin / Grace) or INT8 CPU
Windows	—	via WSL2 (= Linux x64)
Everything else	WASM/JS fallback	runs everywhere Node ≥ 18 runs

Native binaries are selected automatically at npm install time via optionalDependencies — no flags, no postinstall scripts to debug. Every native fast path has a WASM or JS fallback that produces the same results.

🙏 Prior Art & Acknowledgements

sweet-search stands on a lot of shoulders, and we'd rather name them than pretend otherwise:

ColBERT (Khattab & Zaharia) — late interaction; LightOn for the LateOn-Code models and the ColGrep concept our pattern mode parallels
ripgrep (BurntSushi) — the bar for grep, and our verification baseline
GitHub's Blackbird — the sparse n-gram indexing idea we tuned per-codebase
candle & MLX — Rust ML and the fused SDPA kernels we build on; HuggingFace tokenizers
Aider — the repo-map idea, here rebuilt on a real knowledge graph
USearch — memory-mapped HNSW; Malkov & Yashunin for HNSW itself
CatBoost — the query router model; Traag et al. for the Leiden algorithm; Cormack et al. for RRF; PathRAG for flow-pruned graph expansion; cAST for structure-aware chunking
GEPA — the reflective evolutionary prompt-optimization paradigm behind our agent prompt
nomic-ai — the CodeRankEmbed embedding model
Anthropic — the Contextual Retrieval idea behind our chunk enrichment, here derived from code structure instead of an LLM summary

📄 License

Found it useful?

If sweet-search saves your agent's tokens, a ⭐ helps other agents' humans find it.

Name		Name	Last commit message	Last commit date
Latest commit History 890 Commits
.github/workflows		.github/workflows
assets		assets
core		core
crates		crates
docs		docs
eval		eval
mcp		mcp
packages		packages
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
README_PLAN.md		README_PLAN.md
eslint.config.mjs		eslint.config.mjs
package-lock.json		package-lock.json
package.json		package.json
vitest.config.js		vitest.config.js

Folders and files

Latest commit

History

Repository files navigation

✨ Highlights

📚 Table of Contents

🚀 Quickstart

📊 Benchmarks

🤖 1. Code-retrieval benchmarks — the agent-in-the-loop test

🚧 2. Task-completion benchmarks — coming soon

📄 3. Paper-type retrieval benchmarks — academic NL→code IR

🌐 GenCodeSearchNet → 86.6 · 🏆 SOTA in June 2026

🐍 CoSQA → 65.5 · 🥇 Zero-shot SOTA in June 2026

🗺️ M2CRB → 54.0 · 🏆 SOTA in June 2026

🛡️ AdvTest → 51.4 · 🧪 our honest worst case — and we publish it anyway

⚡ 4. Engine speed — systems benchmarks, measured in-repo

🧭 Where sweet-search Fits

🧰 The Six Tools

1. 🔍 ss-search — hybrid search powerhouse

2. ⚡ ss-grep — grep, minus every wasted millisecond

3. ss-find — ColGrep, on a faster engine

4. ss-semantic — hybrid retrieval, scoped to one file

5. ss-trace — graph algorithms, not grep guesswork

6. ss-read — exact bytes, with the index's knowledge attached

🧠 An Agent Prompt That Was Evolved, Not Written

⚡ GPU-Accelerated Indexing, Fully Local

🧩 Chunking — every chunk is whole code, never a fixed window

🏷️ Metadata — context the encoder can actually see

🧠 Cache-aware batching — we read your CPU before we batch it

🗜️ Two quantizations — one buys speed, one buys size

🤖 Two models — both open, both local, both code-specialized

🔄 An Index That Never Goes Stale

🦀 The Native Engine Room

🗜️ INT4 binary segments: the on-disk format behind the RAM-sized index

🔌 Works With Your Agent

🖥️ Platform Support

🙏 Prior Art & Acknowledgements

📄 License

Found it useful?

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🌐 GenCodeSearchNet → `86.6` · 🏆 SOTA in June 2026

🐍 CoSQA → `65.5` · 🥇 Zero-shot SOTA in June 2026

🗺️ M2CRB → `54.0` · 🏆 SOTA in June 2026

🛡️ AdvTest → `51.4` · 🧪 our honest worst case — and we publish it anyway

1. 🔍 `ss-search` — hybrid search powerhouse

2. ⚡ `ss-grep` — grep, minus every wasted millisecond

3. `ss-find` — ColGrep, on a faster engine

4. `ss-semantic` — hybrid retrieval, scoped to one file

5. `ss-trace` — graph algorithms, not grep guesswork

6. `ss-read` — exact bytes, with the index's knowledge attached

Packages