Skip to content

mrsladoje/sweet-search

Repository files navigation

sweet-search — local code search for AI coding agents

Local code search for AI coding agents. Six fast, purpose-built tools that hand Claude Code, Codex & friends ranked answers, not raw grep. Zero API keys, 100% on-device.

Maybe grep isn't all you need… 🍬
Every coding agent today reaches for grep + Read by reflex. sweet-search challenges the narrative. 😎

npm GitHub stars license node platforms inference


✨ Highlights

  • Hybrid retrieval — one of the six tools uses BM25F lexical + dense semantic + structural graph signals, fused per query and reranked by late-interaction
  • Agent-native by design — token-budgeted output tiers, an optional MCP server (and default zero-overhead CLI), and a GEPA-evolved system prompt installed into Claude Code, Codex, Gemini CLI, and Cursor with one command
  • Indexed grep, ~10× faster than ripgrep — a sparse n-gram prefilter skips the files that provably can't match
  • ColBERT-style reranking, locally — per-token MaxSim late interaction on hand-written SIMD kernels
  • GPU-accelerated indexing — Apple Metal, CUDA, CoreML Neural Engine, or plain CPU via ORT; same engine, auto-selected
  • Never stale — incremental indexing keeps the index aligned with your working tree, uncommitted edits included
  • No storage hassle — indexed artifacts maximally optimized without any accuracy tradeoff; up to INT4 quantization
  • Local-first — all models run on-device; nothing is sent anywhere, ever. CPU-inference supported for all models

📚 Table of Contents

GET STARTED

🚀 Quickstart
three commands to a searchable repo

🖥️ Platform Support
macOS · Linux · WASM fallback

USE IT

🧰 The Six Tools
search · grep · find · semantic · trace · read

🧠 The Evolved Agent Prompt
GEPA-optimized search discipline

🔌 Works With Your Agent
MCP · Claude Code · Codex · Gemini · Cursor

UNDER THE HOOD

⚡ GPU-Accelerated Indexing
candle · fused kernels · cAST chunking

🔄 An Index That Never Goes Stale
reconcile daemon tracks your working tree

🦀 The Native Engine Room
four Rust crates + INT4 LI compression

THE RECEIPTS

📊 Benchmarks
agent cost savings · engine speed · full-corpus MRR

🧭 Where sweet-search Fits
honest wins & trade-offs vs peers

🙏 Prior Art & Acknowledgements
the shoulders we stand on

📄 License
Apache-2.0

🚀 Quickstart

npm install -g sweet-search

cd your-repo
sweet-search init     # one-time: downloads local models, wires up your agent
sweet-search index    # builds the index — GPU-accelerated where available

sweet-search "where do we validate JWT tokens?"

That's it. init is idempotent and SHA256-verifies every model binary; re-running it is always safe. From then on the index maintains itself — edit, save, search.

Setup options & details
sweet-search init --wizard          # interactive: shows your hardware, recommends a model tier
sweet-search init --profile core    # lexical-only, no model downloads (CI-friendly)
sweet-search init --li-model edge   # compact late-interaction model for constrained machines
sweet-search uninstall              # clean removal: models, caches, config — never your code
  • Requirements: Node ≥ 18. macOS (arm64/x64) and Linux (x64/arm64) ship native binaries; other platforms fall back to WASM/JS automatically.
  • Footprint: CPU-only hosts download a few hundred MB of INT8 models; GPU hosts add ~1.2 GB of FP32 backbones (skipped automatically where they'd be useless); M3+ Macs can additionally fetch a ~3.2 GB CoreML cascade for Neural Engine acceleration. Everything lands in ~/.cache/sweet-search/models/ and is used strictly on-device.
  • Agent wiring: init injects the tool-routing system prompt into CLAUDE.md (and AGENTS.md, GEMINI.md, Cursor rules via flags), registers a session-start prewarm hook so your first query hits a warm daemon, and installs a /sweet-index skill in Claude Code.
  • What gets indexed: what you'd expect — .gitignore is respected, node_modules/build dirs/minified artifacts are denied, files over 1 MB skipped, with a .sweet-search-ignore for extra rules.

📊 Benchmarks

We measure sweet-search four ways — from how much it helps a real agent down to raw engine throughput:

🤖 Code-retrieval (agent-in-the-loop)
Does it make a real coding agent cheaper and more useful when it searches your repo? Paired against each model's own grep-and-read loop.

🚧 Task-completion (coming soon)
Does cheaper, denser context compound into a higher resolve-rate on multi-step engineering tasks? Harness in progress.

📄 Paper-type IR (academic)
The standard NL→code retrieval suites (GCSN, M2CRB, CoSQA…), full-corpus MRR@10.

Engine speed
Raw systems numbers — grep throughput, query latency, rerank kernels, HNSW.


🤖 1. Code-retrieval benchmarks — the agent-in-the-loop test

We install the evolved agent prompt (the GEPA-evolved search discipline), point a coding agent at a real repo, and pair it probe-for-probe against the same model running its own native grep-and-read loop. Same model, same tasks, same judge — the only difference is whether sweet-search is wired in.

up to 34% lower cost on Codex · up to 56% fewer tool calls · 1.5–2× more useful context per response · +3pp accuracy on weak models

top-of-range figures · full per-harness ranges in the dropdown · 11 model×harness cells, paired, multiplicity-controlled

The headline, in four claims:

  • 💰 Cheaper where the agent thrashes — up to −34% realized cost on Codex; −18 to −32% across the GPT-5.5 / opencode / bare-API harnesses.
  • 🔧 Fewer round-trips — up to −56% tool calls, significant on 9 of 11 cells.
  • More useful per response+0.18 to +0.31 on a 5-dimension usefulness score, and still denser when length-matched (significant on 8 of 11 cells).
  • 🎯 Accuracy held — and lifted on the weak — a statistical tie on flagship models (saturated at 0.94–0.99), and +3 pp (up to +8 pp out-of-distribution) on weaker models like GLM-5.1 and DeepSeek.
📋 Full per-harness results & how it's measured

The win is harness-adaptive: where the native loop is disciplined (Claude Code) it shows up as denser, more useful context per token; where it thrashes (Codex floods 30k+ tokens of its own grep output into context) it shows up as a large cost and tool-call cut. Either way, final-answer accuracy never significantly regresses.

🧰 Native agent harness 💰 Realized cost 🔧 Tool calls ✨ Useful content / response 🎯 Final accuracy
🤖 Codex (GPT-5.5) −30 to −34% −44 to −56% +0.06 → +0.17 ↑ tie (saturated)
🐚 opencode (GPT-5.5 / GLM-5.1) −18 to −22% −15 to −49% +0.23 to +0.31 tie
🔌 bare API (GPT-5.5 / GLM / DeepSeek) −15 to −32% ᵃ −15 to −33% +0.08 to +0.24 ↑ tie · +3 pp on weak models
🟣 Claude Code (Sonnet / Opus) −10% to +14% ᵇ −5 to −33% +0.18 to +0.29 ↑ tie

↑ "Useful content / response" is the per-response delta on a 5-dimension usefulness score (answer-grounding · workable-code · navigability · edit-locality · sufficiency), 0–1 scale. "tie" = final-answer correctness statistically indistinguishable (saturated in the 0.94–0.99 band on flagships).
ᵃ the two cheapest bare models cost fractions of a cent either way (GLM +27% of $0.008; DeepSeek −15% of $0.004). ᵇ Opus −5/−10%; Sonnet +8–14%, which is ≈1¢ on a flat-rate subscription for a richer answer.

Denser, not just longer. The usefulness lift survives length-matching — comparing sweet-search and native responses of equal token length, sweet-search's content is significantly higher on 8 of 11 cells. The validated single-number usefulness composite (grounding × content × density) is significant on all 11 sealed cells.

  • What's being compared: the installed sweet-search agent prompt + tools vs. the same model using only its built-in file-reading and shell-grep tools. Not a different model — the same model, with and without sweet-search.
  • Design: 11 model×harness cells. Sealed vault (n=60/arm, the pre-registered primary) opened once; plus held-out (n=30) and out-of-distribution (n=40) sets for generalization. Stratified, fixed-seed splits.
  • Judging: 3-judge panel (DeepSeek-V4-flash + Gemini-3.1-flash-lite + MiniMax-M2.7), paired by probe, 20k-sample bootstrap CIs, Benjamini–Hochberg FDR multiplicity correction across each metric family. We report family-level survival counts, never a single cherry-picked cell.
  • What survives FDR (vault): useful-content 10/11, density-composite 11/11, length-matched content 8/11, fewer-tool-calls 9/11. Generalization (held-out + OOD): content 17–18/20, fewer calls 14/20.
  • The token fact that drives everything: sweet-search's footprint is nearly constant (~1.3k–3.3k tokens) because the tool responses are capped; native's footprint is whatever the model decides to grep — up to 37k tokens on Codex. That single fact is what drives the cost and tool-call gaps.
  • Honest caveats we keep attached: (1) accuracy ties on flagship models — it is not an accuracy win there, it's saturated; the accuracy gains are real only on weaker models. (2) The two weakest cells for length-matched density (Codex-low, DeepSeek) are correct-sign but underpowered — Codex's responses are so token-divergent that too few equal-length pairs exist to reach significance, and DeepSeek is simply under-powered. Those are honest non-victories, not wins.
  • Full methodology and per-cell tables: docs/PHASE7.md.

🚧 2. Task-completion benchmarks — coming soon

Retrieval quality is necessary but not sufficient. Cheaper, denser context only matters if it compounds across a real, multi-step engineering task — finding the code, understanding it, changing it, and not breaking anything. The next suite measures exactly that: resolve-rate on SWE-bench-style multi-file tasks, sweet-search-wired vs. native, on the same paired, multiplicity-controlled bar as above. Harness and pilot are in progress — numbers land here when they clear that bar, and not before.


📄 3. Paper-type retrieval benchmarks — academic NL→code IR

Every number below is the ss-search pipeline end-to-end — the same binary you install — run against the full benchmark corpus (no 99-distractor shortcuts), zero-shot (we never fine-tune on these tasks). Where a benchmark's queries are docstrings, we strip the docstring out of the indexed code so the query can't trivially match itself — the standard retrieval protocol.

We're SOTA in June 2026 on 3/4 attempted benchmarks at HARDER settings (running on full pool) than most other attempts!

📚 Benchmark 🔍 What it tests # Queries 📂 Pool 🎯 MRR@10 🏆 SOTA?
🌐 GenCodeSearchNet NL→code, 6 languages 6,000 full 6,000 86.6 YES ✅
🐍 CoSQA web queries → Python 500 full 6,267 65.5 ✅ (zero-shot)
🗺️ M2CRB multilingual NL→code (ES/PT/DE/FR → Py/Java/JS) 5,795 full 5,795 54.0 YES ✅
🛡️ AdvTest adversarial, identifier-obfuscated Python 19,210 full 19,210 51.4 NO ❌

SOTA = best result we can find in the published literature as of June 2026; cross-metric/protocol comparisons are spelled out per benchmark below.

🌐 GenCodeSearchNet → 86.6  ·  🏆 SOTA in June 2026

  • The BEST PUBLISHED number we can find, anywhere
  • The benchmark's own paper caps at MRR ≤ 0.42 for fine-tuned baselines (≤ 0.10 cross-lingual); even zero-shot OpenAI Ada-2 reaches 0.79–0.94 — but all of it against a tiny 99-distractor pool.
  • We score 0.866 against the entire 6,000-document corpusa strictly harder setting — and zero-shot. 🔥

🐍 CoSQA → 65.5  ·  🥇 Zero-shot SOTA in June 2026

  • Beats EVERY PUBLISHED zero-shot model
  • Canonical setup: 500 real web queries → the fixed 6,267-code database, no fine-tuning.
  • Clears the strongest zero-shot results out there — CodeSage-Large 47.5 · OpenAI text-embedding-3-large 55.4 · OASIS 55.8 — and goes toe-to-toe with fine-tuned CodeBERT / GraphCodeBERT (64.7 / 67.5). 💪
  • CoSQA has known label noise, so we read the absolute height with a pinch of salt.

🗺️ M2CRB → 54.0  ·  🏆 SOTA in June 2026

  • the BEST PUBLISHED number we can find, anywhere — and zero-shot
  • 🇪🇸 Spanish · 🇵🇹 Portuguese · 🇩🇪 German · 🇫🇷 French → Python / Java / JavaScript.
  • The paper's best — a CodeBERT fine-tuned on the task — reaches 52.7 auMRRc, a metric that averages over easier, smaller pools (so auMRRc ≥ full-pool MRR for any model). Our 54.0 is full-pool MRR@10 over all 5,795 functions in one pool — a strictly harder measure, cleared with no fine-tuning. 🔥

🛡️ AdvTest → 51.4  ·  🧪 our honest worst case — and we publish it anyway

  • Adversarial obfuscation (def Func(arg_0):) deletes the lexical + graph signals our hybrid feeds on — yet we still beat the classic fine-tuned baselines (CodeBERT 27 · GraphCodeBERT 35 · UniXcoder 41), and our stack still lifts our own encoder ~3pp even here.
  • 🔍 Full transparency: we could not reproduce the often-cited 59.5 for the bare CodeRankEmbed encoder — the reference FP32 model scores 54.7 on our leak-free corpus, our shipped INT8 build 51.4. The gap is stricter preprocessing + INT8 quantization, not the retrieval pipeline. We report exactly what we measured.
Methodology, protocol & honesty notes
  • Reproduction: result artifacts live in eval/results/; rerun via eval/run_all.js. The canonical full-pool loaders are in eval/download_data.py.
  • Full corpus, not distractors. Published baselines for GCSN- and CoSQA-style benchmarks typically rank the gold against 99 sampled distractors; every number here ranks against the benchmark's full corpus (6k–19k candidates) — strictly harder.
  • Zero-shot + docstring-stripped. We never fine-tune on these tasks. For docstring-derived benchmarks (AdvTest, M2CRB) we strip the docstring from the indexed code — otherwise the NL query matches itself verbatim (a no-strip AdvTest run scores a meaningless 0.98). This is the standard protocol; it is also why our AdvTest is lower than naïve setups that leave the docstring in.
  • What we deliberately don't claim yet. CoIR (official metric NDCG@10 over per-subtask corpora up to ~1M docs), CoSQA+ (multi-positive, MAP-primary), and CLARC (per-group pools) use protocols and metrics our single-pool MRR@10 harness doesn't currently match. Rather than publish apples-to-oranges numbers, we omit them; faithful per-subtask CoIR (NDCG@10) runs are queued.
  • M2CRB — the paper's metric is auMRRc (area under the MRR-vs-pool-size curve; best published 52.7, fine-tuned). Because that area averages over easier small pools, auMRRc ≥ full-pool MRR for any model — so our 54.0 full-pool MRR@10 (all 5,795 functions, zero-shot) clears their best on a strictly harder measure. No one publishes a plain full-corpus MRR@10 on M2CRB, so ours is the best available.
  • AdvTest honesty note. We could not reproduce the commonly-cited 59.5 for the bare CodeRankEmbed encoder on our corpus: the reference FP32 model scores 54.7 on our leak-free, docstring-stripped, full-19,210 setup, and our shipped INT8 build 51.4. We report our measured numbers and the reference check rather than the leaderboard figure.
  • Honesty corner: CrossCodeEval — cross-file completion-context retrieval, a different task than NL search — sits at 0.12. We don't optimize for it and report it anyway.

⚡ 4. Engine speed — systems benchmarks, measured in-repo

10.2× ripgrep's median grep  ·  2.9 ms warm queries  ·  47× MaxSim kernels  ·  −33% HNSW search p50

⚙️ What 📈 Result 📄 Source
⚡ Indexed grep vs ripgrep 10.2× faster at the median (8.5–17.7× across 5 repos, 353 realistic queries, 1 ms p50 — identical match counts on every query) docs/GREP_INDEXING_STRATEGY.md
⏱️ Warm query latency (native CLI) 2.9 ms warm · 108 ms cold docs/INIT_STRATEGY.md
🧮 MaxSim rerank kernels 1.26 s → 27 ms for a 231-candidate pass (47× native Rust; 16× WASM SIMD) docs/MAXSIM_OPTIMIZATION.md
🧠 HNSW tuning for code −33% search p50, +5.9 pp recall@200 docs/HNSW_APPROACH.md
💾 Indexing memory peak JS heap 785 MB → 213 MB docs/DISK_FLUSHING_STRATEGY.md
🍏 CoreML cascade (M3 Max) 18% faster full indexing vs the Metal baseline docs/INIT_STRATEGY.md

🧭 Where sweet-search Fits

Code search is a crowded space. Here's an honest read on where sweet-search wins and where it gives ground, against the trending leaders and our closest local peers.

Capability sweet-search claude-context Cursor index codebase-memory SocratiCode
100% local — code never leaves your machine ⚠️¹
Works with zero API keys
No external service to run (vector DB · Ollama · Docker) ❌ Milvus ❌ cloud ⚠️
ColBERT late-interaction rerank
Faster-than-ripgrep exact grep ✅⁷
Call-graph trace (callers · callees · impact)
Drives any terminal agent (Claude Code · Codex · Gemini CLI) ❌²
Published NL→code retrieval benchmarks ⚠️³ ⚠️³ ⚠️³
…and where sweet-search gives ground
Native Windows ❌⁴ ⚠️
Deep-AST language coverage ⚠️ 14 (+70 via regex) ⚠️ ⚠️ ✅ 158 ⚠️
In-editor GUI · writes & edits code ❌⁶
Org-wide, multi-repo scale ⚠️ ⚠️ ⚠️

✅ yes · ⚠️ partial / with caveats · ❌ no. Verified June 2026; capabilities drift.
¹ claude-context can run local via Milvus Lite + Ollama, but defaults to OpenAI/Voyage embeddings + Zilliz Cloud. ² Cursor's index is editor-locked — external terminal agents can't query it. ³ Reports token-reduction / efficiency, not a public NL→code retrieval-quality leaderboard. ⁴ Runs on Windows via WSL2. ⁵ SocratiCode manages a bundled Qdrant for you, but uses an auto-detected Ollama for local embeddings. ⁶ Ships an interactive HTML graph viewer, but doesn't edit code. ⁷ Cursor's local Instant Grep — a literal + regex index it benchmarks at ripgrep 16.8 s → 13 ms (the post that inspired our own n-gram prefilter). ⁸ SocratiCode runs on Windows via Docker only — no native binary, and no GPU there.

Where we lose, plainly: no native Windows yet, no editor GUI, and we index one repo at a time. If you need org-wide search across many repos and branches, that's where SocratiCode and Sourcegraph are built to win. If you live inside one editor, Cursor's index is already there. sweet-search is for the terminal agent that wants the best local retrieval on the repo in front of it. No one else combines all of it: ColBERT late-interaction reranking and faster-than-grep search, fully on-device, with nothing to sign up for.

Also in the space: Sourcegraph/Cody (org-scale, server-based), Continue.dev (local-default RAG), Serena (LSP symbol search, no embeddings), grepai (local CLI + trace), and cocoindex-code (embedded AST search).

🧰 The Six Tools

Six small tools, one shared index. Each returns ranked, deduplicated, token-budgeted output designed to be consumed by an agent — a useful answer, not a wall of matches to scroll through.

Tool What you give it What you get back
1. ss-search a natural-language query ranked, self-contained code blocks
2. ss-grep an exact regex/literal every file:line hit, ripgrep-identical
3. ss-find a regex + a query regex matches, semantically re-ranked, as code blocks
4. ss-semantic a file + a question just the relevant spans of that file
5. ss-trace a symbol callers + callees + impact, in one call
6. ss-read a file (± line range) exact bytes + symbol metadata

1. 🔍 ss-search — hybrid search powerhouse

A hybrid search pipeline with late interaction reranking that returns actual code blocks.

Leading published-benchmark results — strongest we can find on GenCodeSearchNet, and above every published zero-shot model on CoSQA. See benchmarks.

flowchart TD
    Q(["🔍  natural-language query"]) --> ROUTE{{"🧭 WASM CatBoost router · lexical / hybrid"}}

    ROUTE --> BM["📑 <b>BM25F</b><br/>field-weighted FTS5"]
    ROUTE --> ANN

    subgraph ANN ["🧬 three-stage ANN cascade"]
        direction LR
        BIN["binary <b>HNSW</b><br/>Hamming · ~100µs"] --> INT["INT8<br/>rescore"] --> FL["float32<br/>mmap sidecar"]
    end

    BM  --> FUSE
    ANN --> FUSE
    FUSE["🔀 <b>CCFusion</b><br/>convex combo · RRF fallback"] --> ROW1

    subgraph ROW1 [" "]
        direction LR
        IAR["⚓ <b>IAR</b><br/>exact-symbol injection"] --> INTENT["🎯 intent rerank<br/>demote docs · tests · config"]
    end

    ROW1 --> ROW2

    subgraph ROW2 [" "]
        direction LR
        GRAPH["🕸️ graph expansion<br/>typed edges · 1–2 hops · <b>PathRAG</b>"] --> MAXSIM["🧮 <b>Late-Interaction Rerank</b><br/>⚡ native Rust MaxSim kernel"] --> OUT(["🏁 <b>self-contained code blocks</b><br/>whole functions · 3k/8k/12k budget"])
    end

    classDef io    fill:#fde68a,stroke:#f59e0b,color:#000;
    classDef out   fill:#bbf7d0,stroke:#15803d,color:#000,stroke-width:3px;
    classDef route fill:#e0e7ff,stroke:#818cf8,color:#000;
    classDef lex   fill:#dbeafe,stroke:#60a5fa,color:#000;
    classDef fuse  fill:#f3e8ff,stroke:#c084fc,color:#000;
    classDef rank  fill:#ffe4e6,stroke:#fb7185,color:#000;

    class Q io;
    class OUT out;
    class ROUTE route;
    class BM,BIN,INT,FL lex;
    class FUSE,IAR fuse;
    class INTENT,GRAPH,MAXSIM rank;

    style ANN  fill:#eff6ff,stroke:#93c5fd,color:#000;
    style ROW1 fill:none,stroke:none;
    style ROW2 fill:none,stroke:none;
Loading

↑ The diagram traces the hybrid route. A pure-lexical query — or a literal file path — short-circuits at the router straight to BM25F, skipping the vector cascade and fusion.

Stage What it actually does
🧭 Route WASM-exported CatBoost · lexical / hybrid · ~10 µs routing · low-confidence → max-recall hybrid
🧬 Retrieve LexicalBM25F over field-weighted FTS5 (name 10× · signature 5× · alias 4× · doc 1×)
Embed — query vectorized by the local CodeRankEmbed model (swappable for Voyage / Jina / Codestral)
Vector cascade — binary HNSW (Hamming, 64-byte, ~100 µs) → INT8 rescore → exact float32 from a memory-mapped sidecar
🔀 Fuse CCFusion — convex-combine both rankings · per-route weights · quantile-normalized
MMR (λ=0.9) diversity pass over the fused list
• auto RRF (k=60) fallback on degenerate score distributions
Anchor IAR (Identifier Anchor Retrieval) — a real symbol in the query fires an exact-name code-graph lookup that injects that entity, even when the encoder ranked it too low
🎯 Intent Rerank • demote docs / tests / config when you want implementation
• log-scaled call-site boosts surface the most-referenced function
🕸️ Graph Expansion • typed-edge walks (imports/extends/calls/uses) · adaptive 2-hop on the AST graph · edges picked by intent
PathRAG flow pruning + degree normalization → hubs can't dominate
🧮 Late interaction Rerank • Query embedded per-token by LateOn-Code (149M; a 17M edge variant auto-selected on low-RAM hosts)
MaxSim against the pre-indexed quantized token vectors
• native Rust+Rayon MaxSim kernel ⚡ · WASM-SIMD fallback (1.26 s → 27 ms on a 231-candidate rerank)
📦 Package • entity-aware expansion → whole functions (imports, docstrings, decorators)
• same-file overlap demotion → diverse, non-overlapping spans
• auto-selected 3k / 8k / 12k token budget
🌶️ Extra spice — the bits that didn't fit the diagram

🧠 The HNSW, in full (full writeup). Stage 1 is a from-scratch binary HNSW, and every "advanced" trick ships on by default:

  • Heuristic neighbor selection (HNSW Algorithm 4) + M0 = 2M on layer 0 — a real graph backbone, not naïve closest-M
  • Shuffled insertion order — no filesystem-ordering bias baked into the highway structure
  • Discovery-rate adaptive early termination + adaptive ef — easy queries stop early, hard ones keep their budget
  • A denser graph than most vendors ship (M=64 · efC=800 · efS=400) — which broke an 80.6 % → 86.5 % recall@200 plateau and cut p50 latency ~33 %
  • Zero-GC search: typed-array heaps + generation-stamped visited lists — no per-query allocation
  • 64-byte sign-bit vectors (Hamming) → INT8 → exact float32 from a memory-mapped sidecar

⚡ Why it's quick. A native Rust + Rayon MaxSim kernel (47× over scalar; 16× WASM-SIMD fallback) · int4-quantized, binary-packed token vectors (plain INT4 is the shipped path — the full TurboQuant algorithm is researched but deferred; binary packing alone cut the LI index ~3.4×, 1.34 GiB → ~396 MiB) · a memory-mapped float32 sidecar that skips SQL on the rescore hot path · score-spread adaptive pooling (decisive queries shrink the rescore pool, ambiguous ones widen it) · and a warm daemon that answers in a single NAPI call — no process is ever forked.

🎛️ Priors & structure.

  • Quality priors: every chunk carries a 0–1 prior from test proximity, git recency, symbol centrality (PageRank), comment density, and complexity — production code surfaces, stale fixtures sink.
  • Community structure: a canonical Leiden pass detects code communities on the entity graph at index time, feeding vocabulary prewarming and structural signals — it understands your modules, not just your directories.
  • Multilingual: 14 languages get full tree-sitter AST treatment; a 39-config registry covers 70+ extensions beyond that. Router features handle camelCase/snake_case, CJK density, and German compounds.
  • Format-gated signals: structure-aware boosts and demotions (symbol-exact, path-token, mega-entity) fire only in agent mode — they help agent-shaped queries and would hurt plain NL, so they stay gated by default.

🛟 Rescues & honest trade-offs.

  • Long-query rescue: wordy NL queries that FTS5 would tokenize into an unsatisfiable AND fall back to multi-query BM25F + RRF — one query per content keyword, fused.
  • Near-duplicate dedup: a SimHash + MinHash-LSH pass (Jaccard τ=0.9) clusters copy-paste and vendored code at index time; aliases reuse their exemplar's vectors and skip both the bi-encoder and late-interaction encoding.
  • A negative result we ship anyway: we built a full cross-encoder rerank cascade behind an adaptive confidence gate, measured it on our eval sets — and it didn't beat MaxSim at 3× the latency. So it ships disabled (SWEET_SEARCH_CASCADE_ENABLED=true to try it). We'd rather ship the faster path than a fancier diagram.
  • Budget tiers: the expensive 8k/12k tiers fire on ~1–5 % of queries — the default stays cheap. Force one with --full / --xl, or pick a mode with --mode lexical|semantic|hybrid|pattern.

Also available as sweet-search "<query>" on the CLI and the search MCP tool.


2. ⚡ ss-grep — grep, minus every wasted millisecond

10.2× faster than ripgrep end-to-end at the median — measured across 353 realistic queries on 5 real repos (range 8.5–17.7× per repo, 1 ms p50), with identical match counts on every single query. Three things buy that:

  • A sparse n-gram index (inspired by Cursor's fast-regex-search and GitHub's Blackbird): instead of a fixed trigram table, gram boundaries adapt to your codebase's character-pair frequencies, so common trigrams get absorbed into longer, more selective grams.
  • Regex-AST literal extraction + SIMD intersection: required substrings are pulled from the pattern's syntax tree, posting lists are intersected with NEON/SSE2 block merges (galloping search for skewed sizes), and only the files that can match — typically 0.1–5% of the corpus — see the real regex.
  • Fully in-process: verification runs on Rust's regex crate with Rayon across all cores, inside the warm daemon, in a single NAPI call. No child process is ever spawned — zero fork/exec, zero pipe I/O, zero JSON re-parsing.

Every match comes back in stable file:line order — ripgrep-identical counts, optional context lines — with no relevance guessing, no subprocess, in one warm call.

More
  • Full methodology, per-repo table, and the optimization log: docs/GREP_INDEXING_STRATEGY.md.
  • Regexes with no extractable literals fall back to native grep over the indexed file set; fixed-string and glob queries use a ripgrep fallback.

3. ss-find — ColGrep, on a faster engine

ss-find "token refresh logic" --regex "refresh.*[Tt]oken"

Inspired by LightOn's ColGrep — regex precision, semantically ranked — but rebuilt on our own substrate:

  • The regex stage runs on the same indexed sparse-gram engine as ss-grep (in-process, no subprocess), not a filesystem scan.
  • The ranking stage scores candidates with per-token MaxSim over pre-indexed late-interaction embeddings — no model inference over documents at query time — on our custom kernels: native Rust + Rayon takes a 231-candidate MaxSim pass from 1.26 s down to 27 ms (WASM SIMD fallback at 16×).
  • Regex tokens are merged into the semantic query, so the ranking sees both what you typed and what you matched.
  • Like ss-search, it answers with ranked, self-contained code snippets — not bare file:line — so the find and the read collapse into one tool call. In our 30-question agent-workflow eval that eliminated every follow-up read and cut tokens 25.4% vs a grep + read workflow, at quality parity (gap of 0.01 on a 5-point scale).
  • On the 60-query pattern benchmark, MaxSim ranking lifts MRR@10 to 0.45 vs 0.11 for raw grep ordering — 4× more likely the right hit lands on top.
More
  • Requires the late-interaction index (built by default; --li-model none disables pattern mode).
  • Also available as sweet-search --mode pattern and via the search MCP tool's regex argument.

4. ss-semantic — hybrid retrieval, scoped to one file

ss-semantic src/auth/session.ts "where does the cookie get its expiry?"

You know the file; this finds the lines. Every indexed chunk of the file is scored by three independent signals — BM25-style lexical term match, exact symbol-name match (weighted 1.5×), and per-token MaxSim late interaction over the LateOn-Code embeddings — fused with Reciprocal Rank Fusion (k=60), with symbol-less fragment chunks demoted 0.85× so real definitions win ties. The top spans are then re-read from disk (±2 context lines, overlapping spans merged), so the answer is filesystem ground truth even mid-edit; if the file is newer than its index entry you get an explicit staleness warning.

The useful answer: just the relevant spans with line numbers — not the whole file through your context window.

More
  • Unindexed files degrade gracefully to a plain read. Defaults: top 5 spans, relevance threshold 0.4, 8k-char cap.
  • Also available as sweet-search read-semantic and the read-semantic MCP tool.

5. ss-trace — graph algorithms, not grep guesswork

ss-trace processOrder --in src/orders/service.py

One call returns a symbol's callers, callees, and transitive impact paths from the AST-derived code graph (entities + typed calls/imports/extends/uses edges, persisted in SQLite at index time). Ranking fuses three signals:

  • Query-time Personalized PageRank via Forward Push — a local algorithm that spreads mass directionally from your target symbol and touches only the neighborhood it reaches, never the whole graph;
  • Index-time edge-weighted global PageRank (damping 0.85), precomputed into a page_rank column — a function called from five sites carries five units of mass, and it costs zero at query time;
  • Structural heuristics — relationship type, depth, exported-API status, fan-in — with penalties for test-only and external paths.

Because the graph is prebuilt, the global ranking is precomputed, and the personalized walk is local, a full three-section trace costs milliseconds. The relation word (callers / callees / impact) re-weights how the response token budget is split; --in disambiguates duplicate names; --depth bounds impact traversal (1–4).

More
  • Honest caveat: call-graph extraction is precise but incomplete on highly dynamic code (bare-name dispatch, metaprogramming) — traces can be sparse there, and the agent prompt teaches a recovery strategy for exactly that case.
  • Also available as sweet-search trace and the trace MCP tool.

6. ss-read — exact bytes, with the index's knowledge attached

ss-read src/db/pool.js 120 180

A read tool that is filesystem-grounded by construction: bytes come straight from disk (never from the index, so never stale), but each indexed file arrives annotated with its cAST chunk metadata — symbol name, entity type, signature, line span — joined from the AST chunk index. The agent gets the code and the structural map of what it's looking at in one call: cite, navigate, or trace next without another search.

More
  • The CLI/MCP form scales it up: sweet-search read <file...> (and the read MCP tool) batches 1–20 files in a single call, each with the same symbol metadata — twenty files for the price of one tool invocation.

The ss-* wrappers ship in the npm package and are what the installed agent prompt drives. Every capability is equally available as sweet-search CLI subcommands and as MCP tools — see Works With Your Agent.


🧠 An Agent Prompt That Was Evolved, Not Written

Shipping six tools is easy. Getting an agent to stop grepping in circles is the hard part.

So sweet-search init installs a ~1k-token system prompt that we didn't write — we grew it. A GEPA-style loop mutated candidate prompts, scored each on a dual Pareto front (accuracy × cost) against two different production agents at once — Claude Code (Sonnet) and Codex (GPT-5.5) — kept the survivors, and repeated. A final correctness pass hardened the winner. ~1k tokens, one job: teach the agent to search well.

🎓 The five rules it encodes:

Rule What it kills
🥇 Cheapest tool first Got an exact symbol? One ss-grep, trust the top hit, stop — no semantic search "just to confirm."
🎯 Trust the ranking At most one narrow read to confirm; never re-run a hit that already matched.
🚫 Absence is an answer Two empty probes (one semantic, one lexical) settle a negative — no third synonym, no find/ls spiral.
No raw-shell escape The #1 token-waster in our trace analysis: agents bailing to dozens of raw grep/find calls after one miss. Door closed.
📝 Think before you dig Before a third probe, the agent states what it knows and what its blind spot is.

🧾 The receiptsheld-out discipline throughout: a dev set to iterate on, a held-out set touched only at milestones, a sealed vault opened exactly once.

Validation gate Result
🎯 Held-out (30 probes × both agents) joint score (worst of the two) 0.988
🌍 Out-of-distribution (8 languages never seen in the loop) 0.952every language ≥ 0.79, zero weak spots
🛡️ Adversarial counter-probes 1.00 / 1.00
🔀 Held-out model families (never optimized on) MiMo 0.988 · Qwen 0.980 — it generalizes, it doesn't memorize
🧩 Paraphrase robustness (reword the prompt, same behavior) correctness-weighted 0.95 / 0.93
🔬 How it was actually built (the honest version)
  • Seeds → survivors: 15 hand-authored seed prompts entered a reflective-evolution loop (an agent reads the real tool-call traces, proposes one targeted edit, we keep what helps). Operators included trajectory crossover, structural pivots, tool-name masking, and a pruner that fights prompt bloat.
  • Two targets, jointly: every candidate was scored on both Claude Code/Sonnet and Codex/GPT-5.5 with Maximin discipline (a prompt is only as good as its worse target), so it can't overfit one model's quirks.
  • What actually won: not clever phrasing — terseness (a shorter prompt re-sent every turn is cheaper), a leaner tool mix (grep/read over heavy semantic blocks that fatten the transcript), and decisiveness on no-match (stop spiraling). We report this plainly because it's what the traces showed.
  • The correctness pass: the shipped prompt ("M++") is the cost-winner plus 7 edits that fix factual descriptions of the tools — routing byte-identical, accuracy held, cost unchanged. A lateral move that buys honesty.
  • Held-out everything: dev to iterate, held-out checked only at milestones, a sealed vault opened once, plus held-out model families (MiMo, Qwen) and a reasoning-mode replay (MiniMax 0.963) it never trained against. Figures: docs/PHASE7.md (internal probe suites; an externally-reproducible suite is in progress).
  • Idempotent install: init writes a marker-delimited block into CLAUDE.md / AGENTS.md / GEMINI.md / .cursor/rules — re-run it freely, it never touches anything else you wrote.

⚡ GPU-Accelerated Indexing, Fully Local

Chunk → enrich → embed → quantize — every step on-device and in Rust. Batches are sized to your CPU's actual cache, two open code-models do the encoding, and two separate quantizations make the index both faster to build and small enough to live in RAM. Zero API keys; nothing ever leaves the machine.

① 🧩 Structure-aware chunk
cAST over tree-sitter ASTs — whole functions, never sliced mid-body

② 🏷️ Enrich from structure
deterministic preamble from the code graph — no LLM call

③ 🤖 Embed — two models
dense CodeRankEmbed + per-token LateOn-Code

④ 🗜️ Quantize + persist
INT8 weights → 2× faster build · INT4 vectors → fits in RAM

The inference engine, picked for your silicon:

Your hardware What runs
🍏 Apple Silicon (M1+) candle Metal, BF16, fused SDPA attention
🍏 Apple Silicon (M3+) …​ plus a CoreML Neural Engine cascade — ~18% faster full index (measured, M3 Max)
🟩 NVIDIA GPU (SM 7.0+) candle CUDA; flash-attention on Ampere+
💻 No accelerator ONNX Runtime INT8 — tuned CPU path, 132 MB model, zero GPU weights downloaded

🧩 Chunking — every chunk is whole code, never a fixed window

  • cAST structure-aware chunking over real tree-sitter ASTs: a recursive split-then-merge greedily packs sibling AST nodes up to the size cap and recurses into nodes too big to fit. So a chunk is always a function, a class, or a contiguous run of declarations — never a body cut in half, never a string split mid-literal.
  • 14 languages get true AST grammars — JS · TS · TSX · Python · Go · Rust · Java · C · C++ · Ruby · PHP · Kotlin · Swift · C# — and a 39-config regex registry carries structure-aware chunking to 70+ more extensions.

🏷️ Metadata — context the encoder can actually see

  • Every chunk ships its symbol name · entity type · signature · line span — the metadata that powers the code graph, ss-read annotations, and the self-contained answers everywhere else.
  • Contextual enrichment: before embedding, each chunk is prefixed with a structured preamble assembled from the AST + code graph — file path · enclosing-scope breadcrumb · name & type · merged siblings · the imports it actually uses. Both encoders see it, so a bare getId() still retrieves on the class and module around it.
  • Our nod to Anthropic's Contextual Retrieval — except they prepend an LLM-generated summary (one model call per chunk); we derive the context deterministically from structure: no LLM, no per-chunk inference, regenerated for free on every reindex. Tuned per language from GenCodeSearchNet ablations — Python stays minimal, the Java family keeps a slug-stripped path, JS/Ruby/Go/C/C++/Rust get the full preamble where closures and imports earn their keep.

🧠 Cache-aware batching — we read your CPU before we batch it

  • We detect your last-level cache at runtimehw.perflevel0.l2cachesize (the 16 MB P-cluster on Apple Silicon, not the smaller E-cluster), Intel L3, or /sys/.../cache on Linux — then size every embedding batch so one transformer layer's weights plus the batch's activations stay resident in cache. No spilling to main memory mid-layer; on a long-sequence tail that's the difference between B=1 and a measured 2.1× per-chunk slowdown.
  • Uses every core the hardware really has — full count on ARM/Apple Silicon; x86 SMT siblings discounted because they don't scale inference linearly.
  • ORT drives the CPU path (ONNX Runtime); GPU hosts swap in fused kernels (below). Either way inference runs off the event loop as a napi AsyncTask, so tokenization and SQLite writes overlap compute instead of stalling behind it.

🗜️ Two quantizations — one buys speed, one buys size

Model weights · INT8 ORT Index vectors · INT4 binary
Job build the index faster on CPU keep the on-disk index tiny
Win ~2× faster indexing · 4× smaller model (132 MB) LI index 1.34 GiB → ~396 MiB · INT4 nibble-packing halves it again
Fidelity ≥ 0.96 cosine vs FP32 no measurable retrieval loss (A/B-tested vs INT8)

🤖 Two models — both open, both local, both code-specialized

  • CodeRankEmbed — 768-d dense bi-encoder (137M, Apache-2.0) for first-stage recall.
  • LateOn-Code — ModernBERT per-token late interaction (149M) for the rerank.
  • Edge fallback for leaner machines: a 17M edge LateOn-Code (~9× smaller FP32 backbone) auto-selects on low-RAM hosts, and the whole CPU path runs INT8 with no GPU weights ever downloaded — full local search on a laptop with no accelerator.
What's actually custom here — the kernels we hand-wrote
  • Surgical attention swap: we vendor the upstream model implementations (NomicBERT for embeddings, ModernBERT for late interaction) and replace only the attention forward pass — an MLX-ported fused SDPA kernel on Metal, candle-flash-attn with varlen packing on CUDA Ampere+, and byte-for-byte upstream math on CPU so the fallback is provably identical.
  • A silent-NaN bug, found and fixed: Apple's Metal SDPA kernel downcasts attention masks to F16, which saturates the standard f32::MIN mask to -Inf and quietly produces NaN on padded rows — collapsing retrieval quality. We clamp the mask and serialize Metal command-buffer submissions (concurrent submission corrupts outputs on shared queues). Details in crates/sweet-search-native/src/inference/.
  • CoreML cascade: 18 pre-traced .mlpackage variants (bucketed by sequence length) dispatched to the Apple Neural Engine through an Objective-C shim; oversized batches fall through to Metal. Gated to M3+ because on M1/M2 the ANE doesn't beat its own compile overhead — we measured, so it's off there.
  • Structure-routed enrichment: the preamble (path · scope chain · symbol · siblings · imports) is assembled at index time from a code-graph line-range overlap query — never an LLM call — then routed per language family (full enriched text for JS/Ruby/Go/C-family/Rust, a slimmer path policy for Python and the Java family), every decision settled by per-language ablation rather than a global default.
  • Pipelined, crash-safe indexing: while batch N+1 embeds, batch N's vectors stream into SQLite through zero-copy buffer views; full rebuilds write to a temp file and atomically swap, so a crash never leaves you serving half an index.

🔄 An Index That Never Goes Stale

Most code indexes rot the moment you start typing. sweet-search ships a reconcile daemon that keeps every tier of the index converged with your working tree — uncommitted edits included — without you ever running a command.

  • Save → searchable at the next reconcile tick — auto-tuned per machine between 15 s and 300 s, typically 15–60 s on a warm, idle box
  • Tracks the filesystem, not git — unstaged and uncommitted changes are first-class; deleted or newly-gitignored files disappear from results automatically
  • Atomic by construction — every tick publishes all five index tiers (float HNSW, binary HNSW, late-interaction segments, sparse-gram, code graph) through a single fsync-renamed epoch manifest, so a query never sees a half-updated index
  • No-op edits cost almost nothing — content hashing collapses byte-identical rewrites and editor touch events into skipped re-encoding work
Deep dive
  • Baseline gate: the daemon never plays first-index-builder. It verifies a full-indexer fingerprint (epoch manifest + merkle config fingerprint + the vectors DB it names) before touching anything, and reports waiting_for_initial_index otherwise — no corrupted partial baselines.
  • One admission policy: the full indexer and the reconciler share a single createAdmissionPolicy module (include globs → deny list → .sweet-search-ignore → 1 MB size cap → batched git check-ignore), so the two paths cannot drift.
  • Orphan sweep: files that are deleted, newly excluded, or newly oversized get tombstoned across every tier; the index converges to exactly what a fresh full rebuild would produce.
  • Self-maintenance: per-tier health watermarks (tombstone fraction, stale-doc ratio, delta ratio) schedule low-priority background compaction in a separate worker — the index stays fast over months without a manual rebuild.
  • Worktree-safe: a worktree stamp plus a single-writer lockfile prevent two daemons from silently interleaving index histories across git worktrees.
  • Resource-polite: ticks are budgeted (≤50 files / ≤2 s CPU per tick), run CPU-only (the GPU is reserved for cold full indexing), and the interval auto-tunes from load average, churn, and backlog.
  • sweet-search reconcile status / reconcile inspect <path> explain exactly what the daemon thinks and why. Opt out any time with SWEET_SEARCH_RECONCILE_V2=0.

🦀 The Native Engine Room

Four Rust crates do the heavy lifting, each with a graceful fallback so the engine runs everywhere:

Crate What it does
sweet-search-native candle GPU/CPU inference, sparse-gram grep engine, SIMD posting-list intersection, SimHash/MinHash-LSH dedup, HuggingFace tokenizers — all over zero-copy NAPI
wasm-maxsim a hand-written WASM SIMD kernel computing ColBERT MaxSim in ~4 KB (~1.6 KB gzipped), with fused INT8 dequantization inside the SIMD pipeline plus a 4-bit nibble-packed path
wasm-router the 498-tree CatBoost query router, loop-unrolled, zero-allocation
sweet-search-cli a native CLI that talks to a warm search daemon over a per-project Unix socket — 2.9 ms measured warm-path queries
Deep dive
  • MaxSim, three speeds: scoring auto-selects the best available tier — native Rust + Rayon across all cores (47× vs baseline JS in our microbenchmark), portable WASM SIMD (16×), or a norm-cached pure-JS fallback (3.5×). Equivalent rankings, any platform.
  • SIMD set intersection: posting-list intersection dispatches per-pair — galloping search when one list is ≥8× smaller, 4-wide NEON/SSE2 block merges for balanced lists, scalar merge for small ones — following the Lemire/Clausecker line of work.
  • Dedup at index time: near-duplicate chunks are fingerprinted (64-bit SimHash + 128-permutation MinHash), clustered with banded LSH + union-find, then re-validated pairwise against the exemplar so transitive weak links can't glue unrelated clusters together. Duplicates skip embedding entirely — and at query time the best-matching sibling can take the exemplar's slot, so collapsing copies never hides the right answer.
  • Per-project warm daemon: the CLI derives an isolated socket path from an FNV-1a hash of the project root, auto-starts the server on first use, and falls back to pure JS where no native binary exists (measured: 2.9 ms warm / 108 ms cold / 64.7 ms JS fallback).
  • Native tokenization: the official HuggingFace tokenizers crate over NAPI — batched, cached, no Python anywhere in the stack.

🗜️ INT4 binary segments: the on-disk format behind the RAM-sized index

The quantization headline lives up in indexing1.34 GiB → ~396 MiB, INT4-halved again. Here's the SSLX segment format that delivers it: crash-safe by construction, and the three-stage retrieval it feeds at query time.

Deep dive
  • INT4 by default: per-token min/scale quantization with nibble packing (two values per byte), A/B-tested against the INT8 baseline with no meaningful retrieval regression before becoming the default. We borrowed the rotation insight from Google's TurboQuant, but ship plain INT4 — the full TurboQuant algorithm (WHT + PolarQuant + QJL) is researched and deferred, not in the product path.
  • SSLX binary segments: the index persists as ~10k-document binary segment files with structured headers and CRC32 footers — a crash costs you at most one segment, not the index.
  • Three-stage retrieval: a binary HNSW (Hamming distance over 64-byte binarized vectors, ~32× smaller than float HNSW) produces candidates in ~100 µs, INT8 rescoring narrows them, and a float32 sidecar rescores the final pool — speed without giving up top-result quality.
  • Memory-mapped HNSW: the float graph index loads via mmap (USearch view()), contributing 0 MB to the V8 heap at search time; the OS reclaims pages under pressure.
  • Streaming indexer: vectors stream from SQLite cursors instead of materializing in arrays — peak JS heap during indexing dropped from ~785 MB to ~213 MB, with 30-second fsync-ordered checkpoints bounding crash loss. The OOM cliff that used to appear above ~200k chunks is gone; large repos index comfortably on an 8 GB machine.
  • Tuned HNSW parameters and zero-GC search internals (typed-array heaps, generation-stamped visited lists) cut search p50 by 33% while raising recall@200 by 5.9 pp in our internal evaluation (docs/HNSW_APPROACH.md).

🔌 Works With Your Agent

sweet-search meets your agent wherever it is — shell tools, MCP, or injected instructions:

// .mcp.json (project root) — that's the whole integration
// or just run: sweet-search init --mcp
{
  "mcpServers": {
    "sweet-search": {
      "command": "npx",
      "args": ["-y", "sweet-search-mcp", "--project-root", "/absolute/path/to/your/repo"]
    }
  }
}
  • MCP server — 8 tools (search, trace, read, read-semantic, index, health, repo-map, vocab-prewarm), 2 resources, 2 prompts; all search tools declared read-only and idempotent
  • Harness injectioninit writes the evolved system prompt into Claude Code, Codex (--codex, including session hooks), Gemini CLI (--gemini), and Cursor (--cursor) from one canonical source
  • Repo maps for sub-agents — the repo-map tool returns a PageRank-ranked symbol overview squeezed into any token budget, perfect for briefing a delegated agent
  • Warm from the first query — a SessionStart hook pre-launches the search daemon so models, vocabulary, and indexes are loaded before you ask anything
Deep dive
  • Tool routing enforcement (opt-in): init --enforce-tools denies the native Grep tool in Claude Code and installs a hint hook nudging native Read toward ss-read/ss-semantic — for when you want the discipline guaranteed, not suggested.
  • /sweet-index skill: a Claude Code slash command for a full GPU-aware reindex, installed by init.
  • Vocabulary prewarm: sweet-search prewarm-vocab mines your repo's real identifiers, detects code communities (Leiden), and pre-warms all three search modes so even the first semantic query of a session is cache-warm.
  • Honest committed-state: init never writes machine-specific absolute paths into committed settings files, and all instruction injection is marker-delimited and reversible.

🖥️ Platform Support

Platform Engine Acceleration
macOS arm64 (Apple Silicon) native Metal (M1+) · CoreML Neural Engine (M3+)
macOS x64 (Intel) native ONNX Runtime INT8 CPU
Linux x64 (glibc) native CUDA (SM 7.0+, flash-attn on Ampere+) or INT8 CPU
Linux arm64 (glibc) native CUDA (Jetson Orin / Grace) or INT8 CPU
Windows via WSL2 (= Linux x64)
Everything else WASM/JS fallback runs everywhere Node ≥ 18 runs

Native binaries are selected automatically at npm install time via optionalDependencies — no flags, no postinstall scripts to debug. Every native fast path has a WASM or JS fallback that produces the same results.

🙏 Prior Art & Acknowledgements

sweet-search stands on a lot of shoulders, and we'd rather name them than pretend otherwise:

  • ColBERT (Khattab & Zaharia) — late interaction; LightOn for the LateOn-Code models and the ColGrep concept our pattern mode parallels
  • ripgrep (BurntSushi) — the bar for grep, and our verification baseline
  • GitHub's Blackbird — the sparse n-gram indexing idea we tuned per-codebase
  • candle & MLX — Rust ML and the fused SDPA kernels we build on; HuggingFace tokenizers
  • Aider — the repo-map idea, here rebuilt on a real knowledge graph
  • USearch — memory-mapped HNSW; Malkov & Yashunin for HNSW itself
  • CatBoost — the query router model; Traag et al. for the Leiden algorithm; Cormack et al. for RRF; PathRAG for flow-pruned graph expansion; cAST for structure-aware chunking
  • GEPA — the reflective evolutionary prompt-optimization paradigm behind our agent prompt
  • nomic-ai — the CodeRankEmbed embedding model
  • Anthropic — the Contextual Retrieval idea behind our chunk enrichment, here derived from code structure instead of an LLM summary

📄 License

Apache-2.0 © PanonIT


Found it useful?

If sweet-search saves your agent's tokens, a ⭐ helps other agents' humans find it.

Star sweet-search on GitHub



GitHub stars

About

Local code search for AI agents: six fast, purpose-built tools that return ranked answers, not raw grep. Because maybe grep isn't all you need... 🍬

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors