Weight recall exact-match ranking by term rarity and slug match#49
Merged
Conversation
applyExactMatchBoost promoted exact (grep) matches by raw term count, so a query's common words (background, rollout, engineer, spec, ...) flooded the top and incidental body mentions outranked memories whose topic named the query. Rank instead by exactMatchStrength + score: each matched term is weighted by inverse document frequency (computed from the match set — no engine change) and multiplied when it appears in the document slug. A distinctive/title match still leads its category; a common-word-only promotion no longer outranks a genuine semantic hit. Also drop agent-artifacts/packs/** (review-pack manifests that lexically match almost any query) from recall like summary sidecars, route agent-artifacts out of the leading memory band, label score-0 promotions "keyword-only" (vs "exact"), and lead an all-keyword window with a low-confidence note.
…lter Address dev-cycle review of the exact-match weighting: - slug bonus matches whole tokens on word boundaries, so "spec" no longer boosts "design-respec-notes"; hyphenated terms (valencia-v1) still match. - extract isExcludedRecallUri, shared by the semantic (parseRecallHits) and exact (grepUrisFromJson) passes so their exclusion set cannot drift. - key the blended-relevance pre-pass by stripAnchor, matching the termsByUri convention; note the ?? 1 df fallback is defensive-only. - add tests: expanded stop-word list, slug substring-collision, empty-result guard, and positive assertions on low-confidence-note suppression.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Recall ranking was dominated by lexical term count, not relevance. The v1.2.0 exact-match boost (#28) fixed "burial" but over-corrected into flooding:
applyExactMatchBoostsorted bycategory → exactTerms.length → scoreand promoted exact-only docs at score 0, with no term weighting. A query's common words (background,rollout,engineer,spec, ...) matched dozens of memories and led, and an incidental body mention (a branch name in a CI note, "spec" inside "the author's spec doc") counted as a full topic match — so a desktop-layout PR review outranked the memory whose topic is the query.This reworks the exact-match signal to be weighted, filters tooling noise, and makes low-confidence results legible. All in
src/utils.ts; both the CLI (runRecall) and MCP (runRecallTool) go through the sharedbuildRecallSections, so no caller changes.Detail
applyExactMatchBoostnow ranks each category byexactMatchStrength + score:exactMatchStrength= Σ over matched terms of(1/df) × (RECALL_EXACT_SLUG_BONUS if the term is in the doc slug else 1).df(document frequency) is computed from the exact-match set itself (exactTermDocumentFrequency) — a self-contained inverse-frequency signal, no OpenViking engine change. A term matching many docs (common) → ~0; a term matching one (sharding) → full weight.mobile-observability-alerting-spec), a title-level signal, vs an incidental body mention.scorekeeps a distinctive exact match leading (strength ≥ 1 beats any score ≤ 1 — preserves the intended high-precision promotion and every existing test) while a common-word-only promotion (strength ≈ 0.2) sinks below a genuine 0.6 semantic hit instead of flooding.Other changes:
isAgentArtifactPackUridrops.../agent-artifacts/packs/**(review-pack manifests that list many review dimensions and lexically match almost any query) from bothparseRecallHitsandgrepUrisFromJson, like summary sidecars.categoryForUriroutes any/agent-artifacts/URI to the skills band, out of the leading memory band.renderRecallHitslabels score-0 promotionskeyword-only:(vsexact:for a semantic hit a term also matched), so an agent can tell "contains the word" from "semantically relevant".RECALL_LOW_CONFIDENCE_NOTE. Trigger is conservative (all-zero only) to avoid false "nothing found".exactRecallTermsstopword set with pure function words (when,what,which,from,into,have,does, ...); df-weighting is the primary fix.Ranking before → after, verified on the live corpus via the built CLI:
mobile observability alerting specmobile-observability-alerting-spec.md(the bullseye)pr-152757-763-desktop-layout-review(incidental "observability/alerting/spec")docs-desktop-pagerduty-onboarding(4 common terms, none in slug)iOS CI test sharding to speed up buildsspeed-up-test-ios-ci/ios-ci-test-shardingNote: the residual mis-rank on
ios ... observability and alerting(wherenative-doc-sidebar-webleads at a genuine semantic0.61) is OpenViking embedding quality, not ranking — out of scope here.Test plan
E2E:
node ./bin/threadnote.cjs recall --query "..."against the live corpus produced the before/after table above.