Weight recall exact-match ranking by term rarity and slug match by Kashkovsky · Pull Request #49 · Kashkovsky/threadnote

Kashkovsky · 2026-07-01T11:08:44Z

Summary

Recall ranking was dominated by lexical term count, not relevance. The v1.2.0 exact-match boost (#28) fixed "burial" but over-corrected into flooding: applyExactMatchBoost sorted by category → exactTerms.length → score and promoted exact-only docs at score 0, with no term weighting. A query's common words (background, rollout, engineer, spec, ...) matched dozens of memories and led, and an incidental body mention (a branch name in a CI note, "spec" inside "the author's spec doc") counted as a full topic match — so a desktop-layout PR review outranked the memory whose topic is the query.

This reworks the exact-match signal to be weighted, filters tooling noise, and makes low-confidence results legible. All in src/utils.ts; both the CLI (runRecall) and MCP (runRecallTool) go through the shared buildRecallSections, so no caller changes.

Detail

applyExactMatchBoost now ranks each category by exactMatchStrength + score:

exactMatchStrength = Σ over matched terms of (1/df) × (RECALL_EXACT_SLUG_BONUS if the term is in the doc slug else 1).
df (document frequency) is computed from the exact-match set itself (exactTermDocumentFrequency) — a self-contained inverse-frequency signal, no OpenViking engine change. A term matching many docs (common) → ~0; a term matching one (sharding) → full weight.
Slug match = the term names the memory topic / resource filename (mobile-observability-alerting-spec), a title-level signal, vs an incidental body mention.
Blending with score keeps a distinctive exact match leading (strength ≥ 1 beats any score ≤ 1 — preserves the intended high-precision promotion and every existing test) while a common-word-only promotion (strength ≈ 0.2) sinks below a genuine 0.6 semantic hit instead of flooding.

Other changes:

isAgentArtifactPackUri drops .../agent-artifacts/packs/** (review-pack manifests that list many review dimensions and lexically match almost any query) from both parseRecallHits and grepUrisFromJson, like summary sidecars. categoryForUri routes any /agent-artifacts/ URI to the skills band, out of the leading memory band.
renderRecallHits labels score-0 promotions keyword-only: (vs exact: for a semantic hit a term also matched), so an agent can tell "contains the word" from "semantically relevant".
When the whole shown window is score 0 (no semantic pass matched above threshold), the list leads with RECALL_LOW_CONFIDENCE_NOTE. Trigger is conservative (all-zero only) to avoid false "nothing found".
Expanded the exactRecallTerms stopword set with pure function words (when, what, which, from, into, have, does, ...); df-weighting is the primary fix.

Ranking before → after, verified on the live corpus via the built CLI:

Query	Memory	Before	After
`mobile observability alerting spec`	`mobile-observability-alerting-spec.md` (the bullseye)	#4	#2
	`pr-152757-763-desktop-layout-review` (incidental "observability/alerting/spec")	#2	#9
	`docs-desktop-pagerduty-onboarding` (4 common terms, none in slug)	#1	#7
`iOS CI test sharding to speed up builds`	`speed-up-test-ios-ci` / `ios-ci-test-sharding`	#1 / #2	#1 / #2 (no regression; distinctive terms still win)

Note: the residual mis-rank on ios ... observability and alerting (where native-doc-sidebar-web leads at a genuine semantic 0.61) is OpenViking embedding quality, not ranking — out of scope here.

Test plan

npm run test        # 302 pass (24 files); added coverage for IDF weighting,
                    # slug boost, common-word-only demotion, pack filtering,
                    # keyword-only label, low-confidence note
npm run typecheck   # clean (both root + test/tsconfig.json)
npm run lint        # clean

E2E: node ./bin/threadnote.cjs recall --query "..." against the live corpus produced the before/after table above.

applyExactMatchBoost promoted exact (grep) matches by raw term count, so a query's common words (background, rollout, engineer, spec, ...) flooded the top and incidental body mentions outranked memories whose topic named the query. Rank instead by exactMatchStrength + score: each matched term is weighted by inverse document frequency (computed from the match set — no engine change) and multiplied when it appears in the document slug. A distinctive/title match still leads its category; a common-word-only promotion no longer outranks a genuine semantic hit. Also drop agent-artifacts/packs/** (review-pack manifests that lexically match almost any query) from recall like summary sidecars, route agent-artifacts out of the leading memory band, label score-0 promotions "keyword-only" (vs "exact"), and lead an all-keyword window with a low-confidence note.

…lter Address dev-cycle review of the exact-match weighting: - slug bonus matches whole tokens on word boundaries, so "spec" no longer boosts "design-respec-notes"; hyphenated terms (valencia-v1) still match. - extract isExcludedRecallUri, shared by the semantic (parseRecallHits) and exact (grepUrisFromJson) passes so their exclusion set cannot drift. - key the blended-relevance pre-pass by stripAnchor, matching the termsByUri convention; note the ?? 1 df fallback is defensive-only. - add tests: expanded stop-word list, slug substring-collision, empty-result guard, and positive assertions on low-confidence-note suppression.

Kashkovsky added 2 commits July 1, 2026 13:07

Kashkovsky merged commit e884ca9 into main Jul 1, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weight recall exact-match ranking by term rarity and slug match#49

Weight recall exact-match ranking by term rarity and slug match#49
Kashkovsky merged 2 commits into
mainfrom
threadnote-recall-evaluation

Kashkovsky commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kashkovsky commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant