Skip to content

Weight recall exact-match ranking by term rarity and slug match#49

Merged
Kashkovsky merged 2 commits into
mainfrom
threadnote-recall-evaluation
Jul 1, 2026
Merged

Weight recall exact-match ranking by term rarity and slug match#49
Kashkovsky merged 2 commits into
mainfrom
threadnote-recall-evaluation

Conversation

@Kashkovsky

Copy link
Copy Markdown
Owner

Summary

Recall ranking was dominated by lexical term count, not relevance. The v1.2.0 exact-match boost (#28) fixed "burial" but over-corrected into flooding: applyExactMatchBoost sorted by category → exactTerms.length → score and promoted exact-only docs at score 0, with no term weighting. A query's common words (background, rollout, engineer, spec, ...) matched dozens of memories and led, and an incidental body mention (a branch name in a CI note, "spec" inside "the author's spec doc") counted as a full topic match — so a desktop-layout PR review outranked the memory whose topic is the query.

This reworks the exact-match signal to be weighted, filters tooling noise, and makes low-confidence results legible. All in src/utils.ts; both the CLI (runRecall) and MCP (runRecallTool) go through the shared buildRecallSections, so no caller changes.

Detail

applyExactMatchBoost now ranks each category by exactMatchStrength + score:

  • exactMatchStrength = Σ over matched terms of (1/df) × (RECALL_EXACT_SLUG_BONUS if the term is in the doc slug else 1).
  • df (document frequency) is computed from the exact-match set itself (exactTermDocumentFrequency) — a self-contained inverse-frequency signal, no OpenViking engine change. A term matching many docs (common) → ~0; a term matching one (sharding) → full weight.
  • Slug match = the term names the memory topic / resource filename (mobile-observability-alerting-spec), a title-level signal, vs an incidental body mention.
  • Blending with score keeps a distinctive exact match leading (strength ≥ 1 beats any score ≤ 1 — preserves the intended high-precision promotion and every existing test) while a common-word-only promotion (strength ≈ 0.2) sinks below a genuine 0.6 semantic hit instead of flooding.

Other changes:

  • isAgentArtifactPackUri drops .../agent-artifacts/packs/** (review-pack manifests that list many review dimensions and lexically match almost any query) from both parseRecallHits and grepUrisFromJson, like summary sidecars. categoryForUri routes any /agent-artifacts/ URI to the skills band, out of the leading memory band.
  • renderRecallHits labels score-0 promotions keyword-only: (vs exact: for a semantic hit a term also matched), so an agent can tell "contains the word" from "semantically relevant".
  • When the whole shown window is score 0 (no semantic pass matched above threshold), the list leads with RECALL_LOW_CONFIDENCE_NOTE. Trigger is conservative (all-zero only) to avoid false "nothing found".
  • Expanded the exactRecallTerms stopword set with pure function words (when, what, which, from, into, have, does, ...); df-weighting is the primary fix.

Ranking before → after, verified on the live corpus via the built CLI:

Query Memory Before After
mobile observability alerting spec mobile-observability-alerting-spec.md (the bullseye) #4 #2
pr-152757-763-desktop-layout-review (incidental "observability/alerting/spec") #2 #9
docs-desktop-pagerduty-onboarding (4 common terms, none in slug) #1 #7
iOS CI test sharding to speed up builds speed-up-test-ios-ci / ios-ci-test-sharding #1 / #2 #1 / #2 (no regression; distinctive terms still win)

Note: the residual mis-rank on ios ... observability and alerting (where native-doc-sidebar-web leads at a genuine semantic 0.61) is OpenViking embedding quality, not ranking — out of scope here.

Test plan

npm run test        # 302 pass (24 files); added coverage for IDF weighting,
                    # slug boost, common-word-only demotion, pack filtering,
                    # keyword-only label, low-confidence note
npm run typecheck   # clean (both root + test/tsconfig.json)
npm run lint        # clean

E2E: node ./bin/threadnote.cjs recall --query "..." against the live corpus produced the before/after table above.

applyExactMatchBoost promoted exact (grep) matches by raw term count, so a
query's common words (background, rollout, engineer, spec, ...) flooded the top
and incidental body mentions outranked memories whose topic named the query.
Rank instead by exactMatchStrength + score: each matched term is weighted by
inverse document frequency (computed from the match set — no engine change) and
multiplied when it appears in the document slug. A distinctive/title match still
leads its category; a common-word-only promotion no longer outranks a genuine
semantic hit.

Also drop agent-artifacts/packs/** (review-pack manifests that lexically match
almost any query) from recall like summary sidecars, route agent-artifacts out
of the leading memory band, label score-0 promotions "keyword-only" (vs
"exact"), and lead an all-keyword window with a low-confidence note.
…lter

Address dev-cycle review of the exact-match weighting:
- slug bonus matches whole tokens on word boundaries, so "spec" no longer
  boosts "design-respec-notes"; hyphenated terms (valencia-v1) still match.
- extract isExcludedRecallUri, shared by the semantic (parseRecallHits) and
  exact (grepUrisFromJson) passes so their exclusion set cannot drift.
- key the blended-relevance pre-pass by stripAnchor, matching the termsByUri
  convention; note the ?? 1 df fallback is defensive-only.
- add tests: expanded stop-word list, slug substring-collision, empty-result
  guard, and positive assertions on low-confidence-note suppression.
@Kashkovsky Kashkovsky merged commit e884ca9 into main Jul 1, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant