diff --git a/docs/results/adaptive.md b/docs/results/adaptive.md new file mode 100644 index 0000000..8533536 --- /dev/null +++ b/docs/results/adaptive.md @@ -0,0 +1,120 @@ +# Adaptive topology: spend the LLM verifier only when it pays + +The cost/quality A/B (`docs/results/cost-quality.md`) found the LLM relevance +verifier's cleanliness win is dominated by **de-duplication** — which a +deterministic content-hash / canonical-URL check captures at ~none of the LLM +premium — and that an LLM check only earns its dollar on the off-scope tail. The +production move it named was: do the cheap deterministic work first, reserve the +LLM for the ambiguous tail. `createAdaptiveResearchDriver` +(`src/adaptive-driver.ts`) is that driver, and this is its measurement. + +Per candidate source the adaptive driver runs three stages, cheapest first, +stopping at the first that decides: + +1. **Dedup ($0).** Reject a source whose canonical URL (scheme/`www`/trailing + slash/tracking params stripped) or normalized-text content hash matches one + already accepted this round or in the KB. +2. **Heuristic triage ($0).** Classify a unique survivor with host/title/length + signals only: an authoritative host (arxiv, `*.edu`, `*.gov`, official docs, + github, …) with a substantial body is **kept**; an obvious spam/listicle + title or a too-thin body is **dropped**; everything else is **ambiguous**. +3. **LLM escalation ($).** Only ambiguous survivors reach the shipped LLM + relevance verifier (`createVerifyingResearchDriver`) — one call each. + +## Live frontier, n=5 topics (glm-5.2) + +Real web-research worker fetches each topic once; the same fetched proposals +(plus one planted tracking-decorated mirror of the first source, so the dedup +stage has a real duplicate to catch) are gated through all three drivers. Cost +is the per-arm `RouterClient.usage()` diff (#36). Total spend for the run: +**$0.033**. + +| topic | fetched | single admit | full-LLM admit / calls / $ | adaptive admit / LLM calls / $ | +|---|---|---|---|---| +| self-speculative decoding | 3 | 3 | 1 / 3 / $0.0027 | 2 / **0** / **$0.0000** | +| rotary position embeddings | 3 | 3 | 1 / 3 / $0.0031 | 2 / **0** / **$0.0000** | +| grouped-query attention | 7 | 7 | 3 / 7 / $0.0072 | 6 / 3 / $0.0030 | +| KV-cache quantization | 5 | 5 | 3 / 5 / $0.0052 | 4 / **0** / **$0.0000** | +| LoRA fine-tuning | 7 | 7 | 4 / 7 / $0.0079 | 6 / 3 / $0.0037 | +| **total** | **25** | **25** | **12 / 25 / $0.0261** | **20 / 6 / $0.0068** | + +**Cost.** Adaptive cuts LLM verifier calls **76%** (25 → 6) and dollars **74%** +($0.0261 → $0.0068). On 3 of the 5 topics it spent **zero** LLM calls — every +unique survivor was on an authoritative host, so the $0 stages decided +everything. + +## The honest reading: adaptive is a frontier POINT, not a free lunch + +Admitted-source counts (lower = cleaner KB): **single 25, adaptive 20, +full-LLM 12**. Adaptive sits **between** the two: + +- It removes **5 of the 13 sources** the full-LLM judge rejects that the + single-agent loop keeps (every one of them a real duplicate caught by the $0 + dedup stage — exactly the de-dup-dominated win the cost/quality result + predicted). +- It does **NOT** match full-LLM cleanliness. The remaining 8 sources full-LLM + rejects, adaptive keeps. The cause is structural and visible in the trace: on + this topic set every non-duplicate survivor landed on an authoritative host + (arxiv / github / official docs), so the heuristic **kept** it without ever + asking the LLM — and the LLM, when full-LLM did ask it, judged several of + those same authoritative pages not-quite-on-topic and dropped them. The host + prior is coarser than the relevance judge. + +So the frontier tradeoff is concrete: **adaptive recovers the deterministic +de-dup half of full-LLM's cleanliness for ~26% of full-LLM's dollars, and gives +up the relevance-judgment half.** Whether that is the right point depends on the +cost of a kept-but-marginal source. If a slightly-off-topic authoritative page +is cheap to carry, adaptive dominates. If every admitted source must clear a +relevance bar, the host heuristic is too permissive and you want the full LLM — +or a tightened heuristic. + +## Where the heuristic is weak — stated plainly + +The escalation count is the diagnostic. On 3 of 5 topics it was **zero**: the +heuristic never deferred to the LLM, so on those topics adaptive is a +**pure host/title/length rule**, and its cleanliness is exactly that rule's +cleanliness — no smarter than "trust arxiv/github, drop spam." That is fine when +the worker's sources are dominated by authoritative hosts (as here), but it +means the LLM's relevance judgment is contributing nothing on those topics, by +construction. The two topics where adaptive *did* escalate (grouped-query +attention, LoRA) are where some survivors were on unknown hosts — and there the +3 LLM calls per topic are the off-scope tail the verifier is actually for. + +The heuristic would mis-route in two directions a richer worker would expose, +neither seen on this authoritative-host-heavy set: + +- **False keep:** an authoritative-host page that is off-topic or low-value + (an arxiv paper on an unrelated subject) is kept without the LLM ever seeing + it. The host prior cannot catch this; only the relevance judge can. +- **False drop:** a genuinely good source on an unknown blog/host with a + spam-shaped title, or under the 400-char body floor, is dropped before the LLM + could rescue it. + +## What this changes + +The deployable recommendation from the cost/quality result was "deterministic +dedup first, reserve the LLM for the tail." This driver ships that and the +measurement confirms the **cost** half cleanly (76% fewer calls, 74% cheaper) +and qualifies the **quality** half honestly: adaptive captures the de-dup +cleanliness (the dominant, free win) but not the LLM's relevance cleanliness, +because the host heuristic resolves authoritative survivors without asking. For +a worker whose sources are mostly authoritative, adaptive is the right frontier +point. For one whose junk is on-topic-looking pages on unknown hosts, the +ambiguous tail grows and adaptive converges toward full-LLM cost — which is the +correct behavior: it pays for the LLM exactly when the cheap signals can't +decide. + +## Run it + +``` +# offline (controlled wiring + escalates-only-ambiguous proof, no creds) +pnpm exec vitest run tests/loops/adaptive-ab.test.ts + +# live three-topology frontier (needs TANGLE_API_KEY with glm-5.2 credits) +AGENT_KNOWLEDGE_LIVE=1 \ +ADAPTIVE_LIVE_GOALS="self-speculative decoding|rotary position embeddings|grouped-query attention|KV-cache quantization|LoRA fine-tuning" \ +pnpm exec vitest run tests/loops/adaptive-ab.test.ts -t "three-topology" +``` + +A cheap one-call glm-5.2 smoke gates the multi-topic burn (fails fast if the key +or the reasoning-token floor is broken) before any dollars are spent. diff --git a/docs/results/claim-grounding.md b/docs/results/claim-grounding.md new file mode 100644 index 0000000..f262d9f --- /dev/null +++ b/docs/results/claim-grounding.md @@ -0,0 +1,97 @@ +# Claim-grounding: the band where the verifier earns its dollar + +`docs/results/cost-quality.md` found the relevance verifier's cleanliness win is +**dominated by de-duplication** — a deterministic content-hash captures most of it +at ~none of the LLM premium. So the open question was: is there an error band where +a verifier earns its cost — something a hash AND a relevance judge both miss? + +**Yes: misattributed citations.** A source that is on-topic, unique, and real, but +whose cited CLAIM does not appear in the page (the LLM wrote a plausible sentence +and hung a real URL off it). De-dup passes it (it's unique). A relevance judge +passes it (the page is on-topic). Only checking the claim against the fetched text +catches it — and that check is **deterministic text presence, $0 inference**. + +## The mode + +Each proposed source now carries the specific claim it is cited for +(`withCitedClaim` → `metadata.citedClaim`). The verifier +(`createClaimGroundingVerifier`) runs `groundClaimInText(claim, pageText)` over the +`htmlToText` output of the page the worker actually fetched — verbatim, normalized +(punctuation/whitespace-insensitive), or a ≥70% content-word overlap close +paraphrase. A claim that isn't present is rejected as **misattributed**. The oracle +is text presence, not a model call, so it composes with the LLM relevance verifier +(reject off-topic AND misattributed) or runs alone at zero inference cost. + +## Live A/B (glm-5.2, real web fetch, planted misattribution band) + +Real worker (glm-5.2 query-gen → live `/v1/search` → `politeFetch` → `htmlToText`) +fetches the sources once per topic; we then plant ONE misattribution per topic (real +fetched page + a deliberately-wrong claim) and run three verifier arms over the SAME +proposals. Cost diffed per arm with the #36 `RouterClient.usage()` instrumentation. + +| n=5 topics | misattributions caught | marginal $ | $/topic | per-$ caught | +|---|---|---|---|---| +| no-verifier | 0 / 5 | $0.0000 | — | — | +| relevance (LLM judge) | **4 / 5** | $0.0157 | ~$0.0031 | 254 | +| claim-grounding (text) | **5 / 5** | **$0.0000** | $0 | ∞ | + +Per-topic (caught relevance / grounding): self-speculative decoding 1/1, rotary +position embeddings 1/1, grouped-query attention 1/1, **KV-cache quantization 0/1**, +LoRA 1/1. (An earlier 3-topic run missed self-speculative decoding instead — the +miss moves around; it is not a fixed topic.) + +**Reading.** Claim-grounding catches every misattribution at **$0**; the relevance +judge catches most but **misses one in five at ~$0.003/topic**. The miss is the +point: the relevance verifier only ever sees the page text, never the cited claim, +so it is *structurally blind* to misattribution. It catches one only by accident — +when the fabricated claim happens to also make the page read off-topic +(e.g. a "12-billion-parameter draft transformer" claim on a rotary-embeddings page). +When the fabrication stays on-topic (the KV-cache case), the judge waves it through. + +So on THIS band the verifier-per-dollar comparison inverts the cost/quality result: +there, the LLM verifier bought a dedup-shaped gain a free hash already captures — +expensive for what a cheap rule does. Here the cheap, deterministic check +**dominates** the expensive judge: it catches strictly more (5/5 vs 4/5) at strictly +less ($0 vs $0.0157). The verifier earns its dollar on misattribution; it does not on +de-duplication. + +## Why this is a real correctable band (not dedup, not relevance) + +- **Not de-duplication.** Every planted source has a unique URL and unique text; a + content-hash / canonical-URL dedup keeps all of them. +- **Not generic relevance.** Every planted source is on-topic; the relevance judge + (and the offline relevance stand-in) accept them. The error is in the *claim*, not + the *topic*. +- **Executable ground truth.** The check is presence/close-paraphrase of the claim + in the fetched text — deployable in production with no oracle and no model call. + +The offline arm proves the floor with a controlled 4-source pool (2 grounded, 2 +misattributed): claim-grounding admits **0/2** misattributions and keeps **2/2** +grounded sources, while relevance and no-verifier both admit **2/2**. + +## Threats to validity + +- **n=5 topics, 1 misattribution each.** The direction (grounding ≥ relevance caught, + at ≤ cost) is asserted in the test on every run; the magnitude is small-n. The + relevance miss-rate (1/5 here, 1/3 earlier) is an existence proof of the blind + spot, not a calibrated rate. +- **Planted misattributions, not naturally-occurring ones.** Like the cost/quality + offline floor, the misattribution is injected so the band is measurable. It models + the real LLM citation-fabrication failure but does not measure its base rate in the + wild — that needs a corpus of model-written citations checked by hand. +- **The grounding oracle is conservative.** A real paraphrase whose inflected words + differ from the page ("drafts" vs "draft") can score below 0.7 and be rejected — + a false-positive misattribution flag. `minOverlap` tunes this; the worker should + cite the page's own key terms (the `createClaimDecorator` extractor is told to). + +## Run it + +```bash +# offline floor (no creds) +pnpm exec vitest run tests/loops/claim-grounding-ab.test.ts -t "offline" + +# live A/B (creds-gated). A cheap glm-5.2 smoke runs BEFORE the multi-topic burn. +AGENT_KNOWLEDGE_LIVE=1 TANGLE_API_KEY=… \ + CLAIM_GROUNDING_LIVE_GOALS='self-speculative decoding|rotary position embeddings|grouped-query attention|KV-cache quantization|LoRA' \ + pnpm exec vitest run tests/loops/claim-grounding-ab.test.ts -t "three verifier arms" +``` diff --git a/docs/results/cost-quality.md b/docs/results/cost-quality.md new file mode 100644 index 0000000..6a1d7f6 --- /dev/null +++ b/docs/results/cost-quality.md @@ -0,0 +1,27 @@ +# Cost/quality: the two-agent loop's inference premium + +Live 9-topic A/B (glm-5.2, budget B ≤ 4 passes/arm), measured per arm with the +router-client instrumentation. The original A/B reported only admitted-sources at +"equal passes" — which charged the two-agent verify step as one pass while it is +actually N `verifySource` LLM calls. Pricing the calls shows what that hid. + +| per topic (mean) | two-agent | single-agent | ratio | +|---|---|---|---| +| LLM chat calls | 5.4 | 1.0 | ~5.4× | +| tokens (in+out) | ~4,900 | ~530 | ~9× | +| cost (USD) | ~$0.0072 | ~$0.0013 | ~5.5× | +| latency (wall) | ~37 s | ~11 s | ~3.4× | +| cleanliness Δ (single − two admitted) | — | — | +1.56, 95% CI [0.33, 2.67] | + +Per-topic Δ (single − two admitted) this run: self-speculative decoding +4, +grouped-query attention 0, rotary position embeddings +1, KV-cache quantization +**−1**, LoRA +4, ring attention +2, constitutional AI +3, transformer +2, gradient +descent **−1**. Coverage 1.00 every topic, both arms. + +**Reading.** The verifier buys ~1.5–2.7 fewer junk sources for roughly **5× the +dollars, 9× the tokens, and 3× the latency** — and on two topics it admitted *more* +than the single agent (the cleanliness signal is real but noisier than the +2.3 / ++2.7 of earlier runs). The cleanliness gain is dominated by de-duplication, so the +honest production move is a deterministic content-hash / canonical-URL dedup, which +captures most of the cleanliness at ~none of this premium; reserve an LLM check for +the off-scope tail. This is the cost half the "equal passes" framing left out. diff --git a/docs/two-agent-research-ab.md b/docs/two-agent-research-ab.md index 7f45d16..124b597 100644 --- a/docs/two-agent-research-ab.md +++ b/docs/two-agent-research-ab.md @@ -1,4 +1,4 @@ -# A verifier agent mostly deduplicates: a controlled A/B on two-agent web research +# A verifier agent mostly deduplicates: a controlled A/B on two-agent web research, and what its cost buys *Tangle Network · `agent-knowledge`* @@ -15,10 +15,25 @@ fewer sources per topic at identical coverage** — 95% bootstrap intervals effect is real and reproduces. But the mechanism is not the one we set out to test: reading the rejection logs, most of the gain is **de-duplication** — the same paper fetched from arXiv, OpenReview, and the NeurIPS proceedings — not the -relevance filtering we expected. The off-topic rejections we hypothesized were the -minority. Most of the value is therefore recoverable with a content hash; the LLM -verifier earns its cost only on the long tail, where a source looks on-topic but -isn't. +relevance filtering we expected. Pricing the verifier's calls (we added per-arm +router-usage instrumentation) shows the cleanliness costs roughly **5× the +dollars, 9× the tokens, and 3× the latency** of the single agent — and that the +original "equal passes" framing hid this, because it charged the verify step as +one pass while it is actually one LLM call per proposed source. Since the win is +de-dup-dominated, a deterministic content hash recovers most of the cleanliness at +~none of the premium. We then asked the sharper question — is there an error band +where an LLM verifier *does* earn its dollar? — and found two, on opposite sides of +the ledger. **Misattributed citations** (an on-topic, unique, real source whose +cited claim never appears in the page) are caught by a $0 deterministic +text-presence check that the LLM relevance judge misses 1 in 5 times, because the +judge structurally never sees the claim. And we built the deployable shape the +cost result implied — an **adaptive driver** that runs free dedup, then free +heuristic triage, and escalates to the LLM only on the ambiguous tail: it cuts +LLM verifier calls 76% and dollars 74%, recovering the de-dup half of the +verifier's cleanliness while honestly giving up the relevance-judgment half on a +source pool dominated by authoritative hosts. The verifier earns its dollar on +misattribution, not on de-duplication; the right production loop spends it only +where the cheap signals can't decide. ## 1. Setup @@ -32,7 +47,9 @@ The trap in any "more agents help" claim is compute. Two agents that simply do m work will of course produce more — that is a bigger budget, not a finding. So the comparison must hold total compute fixed and ask whether the *topology* — splitting find from check — beats spending the same compute on a single agent that just finds -more. +more. And once topology shows an effect, the second question is what it costs: a +cleaner base bought at 5× the inference is a different product decision than one +bought for free. ## 2. Method @@ -57,7 +74,8 @@ Note what the driver→worker hand-off is and isn't: the driver *steers* the wor by handing it the remaining readiness gaps (`foldGaps`), which is a deterministic formatting of unmet requirements — not an LLM authoring a fresh instruction. The driver's LLM work is in `verifySource` (one call per proposed source) and its own -`research` pass. +`research` pass. This matters for §4.2: the verify step is N calls, not one, and +the cost framing turns on that. The readiness gate is `scoreKnowledgeReadiness` (from `agent-eval`). It scores *pages* (curated `knowledge/*.md`), not raw sources, and only `importance: @@ -78,7 +96,7 @@ driver (`createVerifyingResearchDriver`) is one glm-5.2 chat call per source. The repo *does* ship a real `AgentProfile` for research (`researcherProfile`), and the **offline** control arm uses it with a stub harness — but the live arm bypasses it for the direct pipeline. This is a deliberate shortcut (no harness to stand up, -~$0.20 to run) and also the loop's main simplification debt; see §6. +~$0.20 to run) and also the loop's main simplification debt; see §7. ### 2.3 Equal compute @@ -91,7 +109,21 @@ per topic, that the two-agent loop spent no more passes than the single-agent lo and that both stayed under the ceiling; if that ever fails the comparison has drifted to unequal compute and the result is void. -### 2.4 Topics and readiness +The pass-accounting has a known soft spot, which §4.2 exposes: a "verify pass" is +not one LLM call, it is one `verifySource` call per *proposed* source that round. +Charging it as a single pass keeps the topology comparison fair on agent passes but +understates the verifier's dollar cost. We added explicit per-arm cost +instrumentation to measure that directly. + +### 2.4 Cost instrumentation + +The router client now records usage per call (`RouterClient.usage()`, +`src/web-research-worker.ts`): cumulative chat-completion count, prompt/completion +tokens, glm-5.2 priced cost, and wall latency. Each A/B arm reads the accumulator +before and after its run and diffs, so every reported dollar and token figure is a +measured per-arm delta, not an estimate. + +### 2.5 Topics and readiness 9 topics, each with two blocking requirements (the defining mechanism, and reported results / trade-offs). Seven are "narrow-scope-inside-a-broad-space" (e.g. @@ -99,7 +131,7 @@ results / trade-offs). Seven are "narrow-scope-inside-a-broad-space" (e.g. broad space to leak in; two are clean controls (*the transformer architecture*, *gradient descent*). -## 3. Results +## 3. Result 1 — cleanliness: the verifier admits fewer sources at equal coverage The cleanliness signal is the **admitted-source count**: on live data there is no oracle, so "fewer sources admitted at equal coverage" is the measurable proxy for @@ -125,7 +157,15 @@ are above zero. The effect reproduces; its exact magnitude varies run-to-run wit what the web returns (one topic swung Δ = 0→1→3 across separate runs during development). -## 4. What the verifier actually does +A third, cost-instrumented run (the one priced in §4.2) was noisier: mean Δ **+1.56, +95% CI [0.33, 2.67]**, with two topics where the two-agent loop admitted *more* than +the single agent (KV-cache quantization −1, gradient descent −1). The interval still +clears zero, but it is the lower-bound run — a reminder that the magnitude is +web-variance-bound, while the sign is stable. + +## 4. Result 2 — what the verifier does, and what it costs + +### 4.1 Mostly de-duplication We classified each rejection by the verifier's own stated reason: @@ -149,42 +189,198 @@ which a content hash catches for free. The LLM's distinctive contribution is the page that *looks* on-topic but isn't — the self-speculative-vs-separate-draft distinction a string match would miss. -## 5. Limitations +### 4.2 The inference premium (and what "equal passes" hid) + +`docs/results/cost-quality.md`. The original A/B reported only admitted-sources at +"equal passes," which charged the verify step as one pass while it is actually N +`verifySource` LLM calls. Pricing the calls per arm (B ≤ 4 passes/arm, glm-5.2): + +| per topic (mean) | two-agent | single-agent | ratio | +|---|---|---|---| +| LLM chat calls | 5.4 | 1.0 | ~5.4× | +| tokens (in+out) | ~4,900 | ~530 | ~9× | +| cost (USD) | ~$0.0072 | ~$0.0013 | ~5.5× | +| latency (wall) | ~37 s | ~11 s | ~3.4× | +| cleanliness Δ (single − two admitted) | — | — | +1.56, 95% CI [0.33, 2.67] | + +The verifier buys ~1.5–2.7 fewer junk sources for roughly **5× the dollars, 9× the +tokens, and 3× the latency**. Since the cleanliness gain is de-dup-dominated +(§4.1), the honest production move is a deterministic content-hash / canonical-URL +dedup, which captures most of the cleanliness at ~none of this premium, reserving +an LLM check only for the off-scope tail. This is the cost half the "equal passes" +framing left out — and the rest of the paper is what we built once we saw it. + +## 5. Result 3 — the two bands where an LLM verifier does, and doesn't, earn its dollar + +If de-dup is free and dominates the win, when is the LLM verifier worth its 5×? We +found two bands, and they cut in opposite directions. + +### 5.1 Misattributed citations — the cheap check beats the expensive judge + +`docs/results/claim-grounding.md`. A source can be on-topic, unique, and real, yet +the cited *claim* never appears in the page — the LLM wrote a plausible sentence and +hung a real URL off it. De-dup passes it (unique). A relevance judge passes it (the +page is on-topic). Only checking the claim against the fetched text catches it — and +that check is **deterministic text presence, $0 inference**. + +Each proposed source now carries the claim it is cited for (`withCitedClaim` → +`metadata.citedClaim`). The claim-grounding verifier (`createClaimGroundingVerifier`, +`src/claim-grounding.ts`) runs `groundClaimInText(claim, pageText)` over the +`htmlToText` output of the page the worker actually fetched — verbatim, normalized +(punctuation/whitespace-insensitive), or a ≥70% content-word overlap close +paraphrase. A claim that isn't present is rejected as misattributed. The oracle is +text presence, not a model call, so it composes with the LLM relevance verifier or +runs alone at zero cost. + +Live A/B (glm-5.2, real web fetch, one planted misattribution per topic — a real +fetched page plus a deliberately-wrong claim — over three verifier arms on the same +proposals): + +| n=5 topics | misattributions caught | marginal $ | per-$ caught | +|---|---|---|---| +| no-verifier | 0 / 5 | $0.0000 | — | +| relevance (LLM judge) | 4 / 5 | $0.0157 | 254 | +| claim-grounding (text) | **5 / 5** | **$0.0000** | ∞ | + +The relevance judge catches one only by accident — when the fabricated claim also +makes the page read off-topic (a "12-billion-parameter draft transformer" claim on a +rotary-embeddings page). When the fabrication stays on-topic (the KV-cache case), the +judge waves it through, because the relevance verifier only ever sees the page text, +never the cited claim — it is **structurally blind** to misattribution. On this band +the verifier-per-dollar comparison inverts §4.2: the cheap, deterministic check +catches strictly more (5/5 vs 4/5) at strictly less ($0 vs $0.0157). The offline +floor confirms the wiring: on a controlled 4-source pool (2 grounded, 2 +misattributed), claim-grounding admits **0/2** misattributions and keeps **2/2** +grounded, while relevance and no-verifier both admit **2/2**. + +### 5.2 Adaptive topology — pay the LLM only on the ambiguous tail + +`docs/results/adaptive.md`. The deployable shape §4.2 implied: do the free +deterministic work first, reserve the LLM for what the cheap signals can't decide. +`createAdaptiveResearchDriver` (`src/adaptive-driver.ts`) is that driver. Per +candidate source it runs three stages, cheapest first, stopping at the first that +decides: + +1. **Dedup ($0).** Reject a source whose canonical URL (scheme / `www` / trailing + slash / tracking params stripped) or normalized-text content hash matches one + already accepted this round or in the KB. +2. **Heuristic triage ($0).** Classify a unique survivor with host/title/length + signals only: an authoritative host (arxiv, `*.edu`, `*.gov`, official docs, + github, …) with a substantial body is **kept**; an obvious spam/listicle title + or a too-thin body is **dropped**; everything else is **ambiguous**. +3. **LLM escalation ($).** Only ambiguous survivors reach the shipped LLM relevance + verifier — one call each. + +Live frontier, n=5 topics, glm-5.2, same fetched proposals gated through all three +drivers (plus one planted tracking-decorated mirror of the first source, so the +dedup stage has a real duplicate to catch). Total spend $0.033: + +| topic | fetched | single admit | full-LLM admit / calls / $ | adaptive admit / LLM calls / $ | +|---|---|---|---|---| +| self-speculative decoding | 3 | 3 | 1 / 3 / $0.0027 | 2 / **0** / **$0.0000** | +| rotary position embeddings | 3 | 3 | 1 / 3 / $0.0031 | 2 / **0** / **$0.0000** | +| grouped-query attention | 7 | 7 | 3 / 7 / $0.0072 | 6 / 3 / $0.0030 | +| KV-cache quantization | 5 | 5 | 3 / 5 / $0.0052 | 4 / **0** / **$0.0000** | +| LoRA fine-tuning | 7 | 7 | 4 / 7 / $0.0079 | 6 / 3 / $0.0037 | +| **total** | **25** | **25** | **12 / 25 / $0.0261** | **20 / 6 / $0.0068** | + +Adaptive cuts LLM verifier calls **76%** (25 → 6) and dollars **74%** ($0.0261 → +$0.0068). On 3 of the 5 topics it spent **zero** LLM calls — every unique survivor +was on an authoritative host, so the $0 stages decided everything. + +It is a frontier point, not a free lunch. Admitted counts (lower = cleaner): single +25, **adaptive 20**, full-LLM 12. Adaptive removes the 5 real duplicates the $0 dedup +catches — exactly the de-dup-dominated win — but keeps the 8 sources full-LLM rejects +on relevance, because on this authoritative-host-heavy set the heuristic resolved +every non-duplicate survivor without ever asking the LLM, and the host prior is +coarser than the relevance judge. So adaptive **recovers the deterministic de-dup +half of full-LLM's cleanliness for ~26% of its dollars, and gives up the +relevance-judgment half**. The escalation count is the diagnostic: on the 3 topics +where it was zero, adaptive *is* a pure host/title/length rule and the LLM +contributes nothing by construction; on the 2 topics with unknown-host survivors +(grouped-query attention, LoRA) it escalated 3 calls each — the off-scope tail the +verifier is actually for. + +## 6. Discussion + +The three results compose into one rule. The LLM verifier's headline cleanliness win +is real (§3) but **de-dup-dominated** (§4.1) and **expensive** (§4.2, ~5×/9×/3×), so +spending an LLM call on every source is the wrong default — a free content hash buys +most of it. The verifier earns its 5× exactly where the cheap signals are blind: on +**misattribution** (§5.1), where a $0 text-presence check beats the LLM judge +outright because the judge never sees the claim; and on the **off-scope tail** (§5.2), +where a page looks on-topic, is unique, and isn't fabricated, so only a relevance +judgment can settle it. The deployable loop therefore stratifies by cost: free dedup, +free claim-grounding, free heuristic triage, then an LLM call only on what survives — +which is what the adaptive driver ships. + +Two cross-cutting lessons. First, **the accounting unit decides the verdict**: +charging the verify step as one pass made the topology look near-free; pricing it per +LLM call (§4.2) is what surfaced the 5× and motivated everything after it. Second, +**the same verifier inverts in value across bands** — on de-dup the LLM is expensive +for what a hash does; on misattribution a deterministic check is free for what the LLM +can't do; on the off-scope tail the LLM is the only thing that works. "Add a verifier" +is not a setting; it is a cost-stratified decision per error type. + +## 7. Limitations - **The verifier is also the judge.** Admitted-count is a proxy; we have no independent oracle for whether a dropped source was genuinely redundant. The verifier's stated reasons hold up on inspection, but this is the load-bearing - caveat. + caveat for §3–§4. - **Deltas are conservative.** The single-agent loop stops on the same readiness gate, capping its admits; with more iterations it would admit even more junk, so the true gap is at least this large. -- **n = 2 clean controls** is too thin to compare bands with confidence. -- **glm-5.2-specific.** A weaker or stronger judge would shift rejection rates. -- **High web variance.** One run per topic; results move with what search returns. - -## 6. A simpler loop - -Two simplifications fall out of the above. - -1. **The worker should be an `AgentProfile`, not a bespoke pipeline.** The live - worker is ~500 lines hand-wiring query-generation, search, fetch, and proposal - against the router directly. The repo's own pattern is to *author* a profile - (`researcherProfile`) and run it on a harness with a web-search tool — reusable - and harness-agnostic — rather than re-implement the agent loop. The direct - pipeline is cheaper to run today (no harness, no creds beyond the router) but it - is the loop's main piece of duplication. -2. **The driver doesn't need an LLM for most of its work.** Since the win is - dominated by de-duplication, the efficient shape is a deterministic dedup - (content hash / canonical-URL normalization) followed by a *light* LLM check only - for the off-scope tail — not a full glm-5.2 `verifySource` call on every fetched - source. Same cleanliness, a fraction of the calls. - -Neither is built yet; they are the obvious next step if this loop graduates from -experiment to production. - -## 7. Reproduce - -The loop, the worker, the verifier, and this A/B are all in this repository. +- **Small n.** n = 2 clean controls is too thin to compare bands; the misattribution + and adaptive frontiers are n = 5 each. The directions are asserted in the tests on + every run; the magnitudes are small-n and web-variance-bound (the §3 third run swung + to +1.56 from +2.3/+2.7). +- **Planted error bands.** The misattributions (§5.1) and the adaptive duplicate + (§5.2) are injected so the band is measurable. They model the real LLM + citation-fabrication and mirror-host failures but do not measure their base rate in + the wild — that needs a hand-checked corpus of model-written citations. +- **Adaptive's quality is host-prior-bound.** On an authoritative-host-heavy source + pool the heuristic resolves everything and the LLM's relevance judgment contributes + nothing; a richer worker (good sources on unknown hosts, junk on on-topic-looking + pages) would grow the ambiguous tail and converge adaptive toward full-LLM cost. +- **glm-5.2-specific.** A weaker or stronger judge would shift rejection rates and the + relevance miss-rate. The grounding oracle is also conservative: a real paraphrase + whose inflected words differ ("drafts" vs "draft") can fall below the 0.7 overlap and + be flagged misattributed; `minOverlap` tunes this. +- **High web variance.** One live run per topic per result; numbers move with what + search returns. + +## 8. A simpler loop — built, not deferred + +The original write-up named two simplifications as future work. Both are now built and +measured; this is what changed. + +1. **Deterministic dedup before the LLM, LLM only on the tail — shipped.** The + adaptive driver (`src/adaptive-driver.ts`, §5.2) does exactly this: free + canonical-URL / content-hash dedup, free host/title/length triage, LLM relevance + only on the ambiguous survivors. Measured: **76% fewer LLM calls, 74% cheaper**, + recovering the de-dup half of the verifier's cleanliness. The remaining gap to + full-LLM is the relevance-judgment half, kept honest in §5.2 — adaptive is a + frontier point you choose by how much a kept-but-marginal source costs you, not a + strict improvement. +2. **A free check the LLM judge can't replicate — shipped.** Claim-grounding + (`src/claim-grounding.ts`, §5.1) adds the one verification an LLM relevance judge is + structurally blind to: does the cited claim actually appear in the page? It catches + 5/5 planted misattributions at **$0**, vs the judge's 4/5 at ~$0.003/topic. + +What is still **not** built remains the worker: the live worker is a ~500-line +hand-wired pipeline (query-gen, search, fetch, propose) against the router directly, +where the repo's own pattern is to *author* an `AgentProfile` (`researcherProfile`) +and run it on a harness with a web-search tool — reusable and harness-agnostic. The +direct pipeline is cheaper to run today (no harness, no creds beyond the router) but it +is the loop's main remaining piece of duplication, and the obvious next step if this +loop graduates from experiment to production. + +## 9. Reproduce + +The loop, the worker, the verifier, the claim-grounding mode, the adaptive driver, the +cost instrumentation, and every A/B are all in this repository. Each live test gates a +cheap one-call glm-5.2 smoke before any multi-topic burn. ```bash git clone https://github.com/tangle-network/agent-knowledge @@ -194,16 +390,39 @@ cd agent-knowledge && pnpm install # exercises the same harness against a planted source pool) pnpm exec vitest run tests/loops/research-loop-equal-compute.test.ts -# the live sweep — real web search + a real glm-5.2 verifier (~$0.20 for 9 topics) +# offline claim-grounding + adaptive floors (no credentials) +pnpm exec vitest run tests/loops/claim-grounding-ab.test.ts -t "offline" +pnpm exec vitest run tests/loops/adaptive-ab.test.ts + +# the live cleanliness sweep — real web search + a real glm-5.2 verifier, with +# per-arm cost reported (~$0.20 for 9 topics) export TANGLE_API_KEY= AGENT_KNOWLEDGE_LIVE=1 \ AGENT_KNOWLEDGE_LIVE_GOALS="self-speculative decoding|grouped-query attention|rotary position embeddings|KV-cache quantization|LoRA|ring attention|constitutional AI|the transformer architecture|gradient descent" \ pnpm exec vitest run tests/loops/research-loop-equal-compute.test.ts + +# live misattribution band — three verifier arms over the same proposals +AGENT_KNOWLEDGE_LIVE=1 TANGLE_API_KEY=<…> \ + CLAIM_GROUNDING_LIVE_GOALS='self-speculative decoding|rotary position embeddings|grouped-query attention|KV-cache quantization|LoRA' \ + pnpm exec vitest run tests/loops/claim-grounding-ab.test.ts -t "three verifier arms" + +# live adaptive frontier — single / full-LLM / adaptive on the same fetched proposals +AGENT_KNOWLEDGE_LIVE=1 TANGLE_API_KEY=<…> \ + ADAPTIVE_LIVE_GOALS="self-speculative decoding|rotary position embeddings|grouped-query attention|KV-cache quantization|LoRA fine-tuning" \ + pnpm exec vitest run tests/loops/adaptive-ab.test.ts -t "three-topology" ``` -`AGENT_KNOWLEDGE_LIVE_GOALS` takes a `|`-separated topic list; the live arm runs -both loops on each at equal compute and reports the paired bootstrap. +`AGENT_KNOWLEDGE_LIVE_GOALS` (and the per-result `*_LIVE_GOALS`) take a `|`-separated +topic list; the live arms run the loops on each at equal compute and report the paired +bootstrap and per-arm cost. **Source:** the loop — [`src/two-agent-research-loop.ts`](../src/two-agent-research-loop.ts); -the live worker + verifier — [`src/web-research-worker.ts`](../src/web-research-worker.ts); -the A/B harness — [`tests/loops/research-loop-equal-compute.test.ts`](../tests/loops/research-loop-equal-compute.test.ts). +the live worker + verifier + cost instrumentation — [`src/web-research-worker.ts`](../src/web-research-worker.ts); +the misattribution check — [`src/claim-grounding.ts`](../src/claim-grounding.ts); +the adaptive driver — [`src/adaptive-driver.ts`](../src/adaptive-driver.ts); +the A/B harnesses — [`tests/loops/`](../tests/loops/). +Per-result detail: [`docs/results/cost-quality.md`](results/cost-quality.md), +[`docs/results/claim-grounding.md`](results/claim-grounding.md), +[`docs/results/adaptive.md`](results/adaptive.md). + + diff --git a/src/adaptive-driver.ts b/src/adaptive-driver.ts new file mode 100644 index 0000000..3d70e6e --- /dev/null +++ b/src/adaptive-driver.ts @@ -0,0 +1,401 @@ +/** + * Adaptive verifier mode for `runTwoAgentResearchLoop`. + * + * The cost/quality A/B (`docs/results/cost-quality.md`) found the LLM relevance + * verifier's cleanliness win is dominated by DE-DUPLICATION — which a + * deterministic content-hash / canonical-URL check captures at ~none of the LLM + * premium — and that an LLM check only earns its dollar on the off-scope tail. + * The honest production move it names is: do the cheap deterministic work first, + * spend the LLM only where it pays. This module is that driver. + * + * Per candidate source the adaptive driver runs THREE stages, cheapest first, + * and stops at the first that decides: + * + * 1. DEDUP ($0, no LLM). Reject a source whose CONTENT (normalized-text hash) + * or whose CANONICAL URL matches one already accepted this round or already + * in the knowledge base. This is the de-dup the relevance judge was being + * paid to do; doing it deterministically is free and exact. + * + * 2. HEURISTIC TRIAGE ($0, no LLM). For a unique survivor, a cheap host / + * title / length signal classifies it as clearly-keep, clearly-drop, or + * AMBIGUOUS. Clear cases are resolved without a model: an authoritative host + * (arxiv, *.edu, *.gov, official docs) with a substantial readable body is + * kept; an obvious spam/listicle/marketing title or a too-thin body is + * dropped. Only genuinely ambiguous survivors fall through. + * + * 3. LLM ESCALATION ($, one call). ONLY the ambiguous survivors reach the LLM + * `verifySource` — the shipped `createVerifyingResearchDriver` relevance + * judge. This is where the verifier earns its premium: the off-scope tail a + * cheap rule can't adjudicate. + * + * The result is the cost/quality frontier point the doc predicted: most of the + * cleanliness (dedup + clear drops) at a fraction of the LLM $/calls (only the + * ambiguous tail pays). It is a real `ResearchDriver` — same contract the + * two-agent loop already gates on — and reuses `sha256`, the relevance verifier, + * and the index; it reinvents none of them. + */ + +import { sha256 } from './ids' +import type { + ResearchSourceProposal, + SourceVerdict, + SourceVerificationContext, +} from './two-agent-research-loop' +import { + createVerifyingResearchDriver, + type RouterClient, + type TangleRouterOptions, + type VerifyingDriverOptions, +} from './web-research-worker' + +/** + * Canonicalize a URL for duplicate detection: lowercase host, strip a leading + * `www.`, drop the scheme, the fragment, a trailing slash, and tracking query + * params (`utm_*`, `ref`, `fbclid`, `gclid`, …). Two URLs that differ only by + * those decorations canonicalize to the same key, so the dedup stage treats them + * as the same source. Falls back to the lowercased raw string when the input is + * not a parseable absolute URL (so non-http identifiers still dedup by equality). + */ +export function canonicalizeUrl(uri: string): string { + const trimmed = uri.trim() + try { + const url = new URL(trimmed) + const host = url.hostname.toLowerCase().replace(/^www\./, '') + // Keep only non-tracking query params, sorted for stable ordering. + const kept: [string, string][] = [] + for (const [key, value] of url.searchParams) { + const lower = key.toLowerCase() + if (lower.startsWith('utm_')) continue + if (trackingParams.has(lower)) continue + kept.push([key, value]) + } + kept.sort(([a], [b]) => (a < b ? -1 : a > b ? 1 : 0)) + const query = kept.map(([k, v]) => `${k}=${v}`).join('&') + const path = url.pathname.replace(/\/+$/, '') || '/' + return `${host}${path}${query ? `?${query}` : ''}` + } catch { + return trimmed.toLowerCase() + } +} + +/** Tracking / referrer query params dropped during URL canonicalization. */ +const trackingParams = new Set([ + 'ref', + 'ref_src', + 'source', + 'fbclid', + 'gclid', + 'mc_cid', + 'mc_eid', + 'igshid', + 'spm', + '_hsenc', + '_hsmi', +]) + +/** + * A stable content key for a fetched page: the sha256 of its normalized text + * (lowercased, punctuation/whitespace collapsed). Two pages whose readable body + * is the same modulo formatting collide here, so the dedup stage rejects a + * mirror/syndication of an already-accepted source even when the URL differs. + */ +export function contentKey(text: string): string { + const normalized = text + .toLowerCase() + .replace(/[^\p{L}\p{N}\s]+/gu, ' ') + .replace(/\s+/g, ' ') + .trim() + return sha256(normalized) +} + +/** Why the deterministic dedup stage rejected a candidate (for audit/notes). */ +export type DedupReason = 'duplicate-url' | 'duplicate-content' + +/** The cheap-triage classification of a unique survivor. */ +export type TriageClass = 'keep' | 'drop' | 'ambiguous' + +/** One source's adaptive routing decision, for instrumentation and the doc. */ +export interface AdaptiveDecision { + uri: string + /** The stage that decided this source: dedup | heuristic | llm. */ + stage: 'dedup' | 'heuristic' | 'llm' + accepted: boolean + /** The triage class assigned (set once past dedup). */ + triage?: TriageClass + reason?: string +} + +/** Running tally of where the adaptive driver spent its decisions. */ +export interface AdaptiveStats { + total: number + /** Rejected by deterministic dedup (URL or content). $0. */ + dedupRejected: number + /** Kept by the cheap heuristic without an LLM call. $0. */ + heuristicKept: number + /** Dropped by the cheap heuristic without an LLM call. $0. */ + heuristicDropped: number + /** Escalated to the LLM relevance verifier ($ — the only paid stage). */ + llmCalls: number + /** Of the escalations, how many the LLM accepted. */ + llmAccepted: number + decisions: AdaptiveDecision[] +} + +export interface AdaptiveDriverOptions { + /** Router client for the LLM escalation. Defaults to a live client from env. */ + router?: RouterClient + router_options?: TangleRouterOptions + /** Passed through to the escalation relevance verifier. */ + verifying?: Pick + /** + * Hosts an authoritative source lives on. A unique survivor on one of these, + * with a substantial body, is KEPT deterministically (no LLM). Suffix-matched + * against the canonical host, so `arxiv.org` matches `export.arxiv.org`. The + * defaults cover papers, official docs, and standards bodies. + */ + authoritativeHosts?: string[] + /** + * Title/snippet patterns that mark obvious spam / listicle / marketing — a + * unique survivor matching one is DROPPED deterministically (no LLM). + */ + spamPatterns?: RegExp[] + /** + * Below this many readable chars a survivor is too thin to be a real reference + * and is dropped deterministically. Default 400. + */ + minBodyChars?: number + /** + * A survivor whose body is at or above this many chars AND on an authoritative + * host is kept without an LLM call. Default 600. + */ + substantialBodyChars?: number + /** Receives each routing decision as it is made (for live instrumentation). */ + onDecision?: (decision: AdaptiveDecision) => void +} + +/** Default authoritative host suffixes — papers, official docs, standards. */ +const defaultAuthoritativeHosts = [ + 'arxiv.org', + 'aclanthology.org', + 'openreview.net', + 'dl.acm.org', + 'ieeexplore.ieee.org', + 'nature.com', + 'science.org', + 'pubmed.ncbi.nlm.nih.gov', + 'ncbi.nlm.nih.gov', + '.edu', + '.gov', + 'docs.python.org', + 'pytorch.org', + 'tensorflow.org', + 'huggingface.co', + 'github.com', + 'developer.mozilla.org', + 'wikipedia.org', + 'w3.org', + 'ietf.org', + 'rfc-editor.org', +] + +/** Default spam/listicle/marketing title patterns. */ +const defaultSpamPatterns = [ + /\bbuy\b.*\b(cheap|now|deal|sale|discount)\b/i, + /\b\d+\s+(things|ways|reasons|tips|tricks|secrets|hacks)\b.*\b(you|that|will)\b/i, + /\bshock(ing|ed)?\b/i, + /\bclickbait\b/i, + /!!!|\$\$\$/, + /\b(coupon|promo code|affiliate|sponsored)\b/i, + /\bbest .* (of \d{4}|in \d{4})\b/i, +] + +/** + * Classify a UNIQUE survivor (already past dedup) with cheap host/title/length + * signals only — no LLM. Returns `keep`, `drop`, or `ambiguous`. `ambiguous` is + * the residue the LLM is reserved for: on-topic-looking pages on unknown hosts + * with a plausible body, which a host/title rule cannot adjudicate. + */ +export function triageSource( + source: ResearchSourceProposal, + options: { + authoritativeHosts: string[] + spamPatterns: RegExp[] + minBodyChars: number + substantialBodyChars: number + }, +): { triage: TriageClass; reason: string } { + const titleAndSnippet = `${source.title ?? ''} ${ + typeof source.metadata?.snippet === 'string' ? source.metadata.snippet : '' + }`.trim() + const bodyLen = source.text.trim().length + + // Clear DROP: obvious spam/listicle/marketing title, OR a too-thin body that + // can't be a real reference. + for (const pattern of options.spamPatterns) { + if (pattern.test(titleAndSnippet)) { + return { triage: 'drop', reason: `spam/listicle title (${pattern.source})` } + } + } + if (bodyLen < options.minBodyChars) { + return { triage: 'drop', reason: `thin body (${bodyLen} < ${options.minBodyChars} chars)` } + } + + // Clear KEEP: an authoritative host with a substantial readable body. The host + // is a strong prior for a real reference; the length rules out a stub page. + const host = hostOf(source.uri) + const authoritative = + host.length > 0 && options.authoritativeHosts.some((suffix) => hostMatches(host, suffix)) + if (authoritative && bodyLen >= options.substantialBodyChars) { + return { triage: 'keep', reason: `authoritative host ${host} + substantial body (${bodyLen})` } + } + + // Everything else is AMBIGUOUS — an unknown host with a plausible body. This + // is exactly the off-scope tail the LLM relevance judge is reserved for. + return { triage: 'ambiguous', reason: `unknown host ${host || '(none)'}, body ${bodyLen} chars` } +} + +function hostOf(uri: string): string { + try { + return new URL(uri.trim()).hostname.toLowerCase().replace(/^www\./, '') + } catch { + return '' + } +} + +/** Suffix host match: `.edu` matches `mit.edu`; `arxiv.org` matches `export.arxiv.org`. */ +function hostMatches(host: string, suffix: string): boolean { + if (suffix.startsWith('.')) return host.endsWith(suffix) || host === suffix.slice(1) + return host === suffix || host.endsWith(`.${suffix}`) +} + +export interface AdaptiveResearchDriver { + verifySource( + source: ResearchSourceProposal, + ctx: SourceVerificationContext, + ): Promise + /** Live tally of where decisions were spent — the cost/quality instrumentation. */ + stats(): AdaptiveStats +} + +/** + * Build the adaptive verifier. The deterministic stages (dedup + heuristic + * triage) cost $0; only AMBIGUOUS survivors escalate to the LLM relevance + * verifier. `stats()` exposes where every decision was spent so the A/B can read + * the LLM $/calls the adaptive driver saved against the cleanliness it kept. + * + * Dedup state is kept on the driver instance: it tracks the canonical URLs and + * content hashes it has ACCEPTED, plus those it sees in the verification + * context's `acceptedThisRound` and the KB index. Use one driver per loop run. + */ +export function createAdaptiveResearchDriver( + options: AdaptiveDriverOptions = {}, +): AdaptiveResearchDriver { + const authoritativeHosts = options.authoritativeHosts ?? defaultAuthoritativeHosts + const spamPatterns = options.spamPatterns ?? defaultSpamPatterns + const minBodyChars = Math.max(1, options.minBodyChars ?? 400) + const substantialBodyChars = Math.max(minBodyChars, options.substantialBodyChars ?? 600) + + // The shipped LLM relevance verifier — the ONLY paid stage, reused, not rebuilt. + const relevance = createVerifyingResearchDriver({ + router: options.router, + router_options: options.router_options, + acceptOnParseFailure: options.verifying?.acceptOnParseFailure, + }) + + // Dedup memory across the run: every URL/content key this driver ACCEPTED. + const acceptedUrlKeys = new Set() + const acceptedContentKeys = new Set() + + const stats: AdaptiveStats = { + total: 0, + dedupRejected: 0, + heuristicKept: 0, + heuristicDropped: 0, + llmCalls: 0, + llmAccepted: 0, + decisions: [], + } + + function record(decision: AdaptiveDecision): void { + stats.decisions.push(decision) + options.onDecision?.(decision) + } + + return { + async verifySource(source, ctx): Promise { + stats.total += 1 + const urlKey = canonicalizeUrl(source.uri) + const cKey = contentKey(source.text) + + // ---- STAGE 1: DETERMINISTIC DEDUP ($0) ---------------------------------- + // Seed the dup sets from what the loop already accepted this round AND from + // the KB index, so a duplicate of an existing source is caught even if this + // driver instance hasn't seen it yet. + const roundUrlKeys = new Set( + ctx.acceptedThisRound.map((accepted) => canonicalizeUrl(accepted.uri)), + ) + const roundContentKeys = new Set( + ctx.acceptedThisRound.map((accepted) => contentKey(accepted.text)), + ) + const indexUrlKeys = new Set( + ctx.index.sources.flatMap((indexed) => + typeof indexed.metadata?.originalUri === 'string' + ? [canonicalizeUrl(indexed.metadata.originalUri)] + : [], + ), + ) + const dupUrl = + acceptedUrlKeys.has(urlKey) || roundUrlKeys.has(urlKey) || indexUrlKeys.has(urlKey) + const dupContent = acceptedContentKeys.has(cKey) || roundContentKeys.has(cKey) + if (dupUrl || dupContent) { + stats.dedupRejected += 1 + const reason: DedupReason = dupUrl ? 'duplicate-url' : 'duplicate-content' + record({ uri: source.uri, stage: 'dedup', accepted: false, reason }) + return { accept: false, reason: `dedup: ${reason}` } + } + + // ---- STAGE 2: CHEAP HEURISTIC TRIAGE ($0) ------------------------------- + const { triage, reason } = triageSource(source, { + authoritativeHosts, + spamPatterns, + minBodyChars, + substantialBodyChars, + }) + if (triage === 'keep') { + stats.heuristicKept += 1 + acceptedUrlKeys.add(urlKey) + acceptedContentKeys.add(cKey) + record({ uri: source.uri, stage: 'heuristic', accepted: true, triage, reason }) + return { accept: true } + } + if (triage === 'drop') { + stats.heuristicDropped += 1 + record({ uri: source.uri, stage: 'heuristic', accepted: false, triage, reason }) + return { accept: false, reason: `heuristic drop: ${reason}` } + } + + // ---- STAGE 3: LLM ESCALATION ($ — ambiguous tail only) ------------------ + stats.llmCalls += 1 + const verdict = await relevance.verifySource(source, ctx) + if (verdict.accept) { + stats.llmAccepted += 1 + acceptedUrlKeys.add(urlKey) + acceptedContentKeys.add(cKey) + } + record({ + uri: source.uri, + stage: 'llm', + accepted: verdict.accept, + triage, + reason: verdict.accept ? 'llm accepted' : verdict.reason, + }) + return verdict + }, + stats(): AdaptiveStats { + return { + ...stats, + decisions: [...stats.decisions], + } + }, + } +} diff --git a/src/claim-grounding.ts b/src/claim-grounding.ts new file mode 100644 index 0000000..244b7d1 --- /dev/null +++ b/src/claim-grounding.ts @@ -0,0 +1,351 @@ +/** + * Claim-grounding mode for `runTwoAgentResearchLoop`. + * + * The two-agent loop's existing verifier judges a source's on-topic RELEVANCE + * (is this page about the goal?). On the topic sets we have measured, its + * cleanliness win is dominated by DE-DUPLICATION — which a deterministic + * content-hash / canonical-URL check captures at ~none of the LLM premium (see + * `docs/results/cost-quality.md`). That makes the LLM verifier look expensive + * for what a cheap rule already does. + * + * Claim-grounding targets a DIFFERENT, harder error band: a citation that is + * relevant and unique but **misattributed** — the page is on-topic, the URL is + * real, yet the specific CLAIM the source is cited for does NOT actually appear + * in the page. This is the citation-fabrication failure mode of LLM research: + * the model writes a plausible sentence and hangs a real URL off it that never + * says any such thing. Neither de-dup nor a relevance judge catches it (both can + * pass a misattributed-but-on-topic page); only checking the claim against the + * fetched text does. + * + * The check is EXECUTABLE GROUND TRUTH, not another LLM opinion: the worker + * attaches the specific claim it is citing the source for, and the verifier + * tests whether that claim is PRESENT (verbatim, normalized, or as a sufficient + * content-word overlap / close paraphrase) in the `htmlToText` output of the + * page the worker actually fetched. A claim that is not grounded is rejected as + * misattributed. Because the oracle is deterministic text presence — not a model + * call — it is a deployable, non-oracle verifier: it can run in production with + * zero inference cost, OR be composed with the LLM relevance verifier so the + * loop rejects BOTH off-topic AND misattributed sources. + * + * This module is content-free and any-topic: it adds (1) a way for a proposal to + * carry the claim it is cited for, (2) the `groundClaimInText` oracle, and (3) a + * `ResearchDriver` that gates on grounding. It composes the existing + * `ResearchDriver` / `ResearchSourceProposal` contracts and the shipped + * `htmlToText`; it reinvents none of them. + */ + +import type { + ResearchSourceProposal, + SourceVerdict, + SourceVerificationContext, +} from './two-agent-research-loop' +import { + createTangleRouterClient, + type RouterClient, + type TangleRouterOptions, +} from './web-research-worker' + +/** + * Metadata key under which a proposal carries the specific claim it is cited + * for. The worker sets `metadata[citedClaimKey] = ''`; the + * claim-grounding driver reads it and checks it against the fetched page text. + */ +export const citedClaimKey = 'citedClaim' + +/** Read the cited claim a proposal carries, if any. */ +export function citedClaimOf(source: ResearchSourceProposal): string | undefined { + const claim = source.metadata?.[citedClaimKey] + return typeof claim === 'string' && claim.trim() ? claim.trim() : undefined +} + +/** Attach a cited claim to a proposal (immutably returns a new proposal). */ +export function withCitedClaim( + source: ResearchSourceProposal, + claim: string, +): ResearchSourceProposal { + return { ...source, metadata: { ...source.metadata, [citedClaimKey]: claim } } +} + +export interface GroundingResult { + /** True when the claim is sufficiently present in the page text. */ + grounded: boolean + /** How the claim matched (or why it didn't). For audit/notes. */ + mode: 'verbatim' | 'normalized' | 'overlap' | 'absent' | 'empty-claim' | 'empty-text' + /** + * Fraction of the claim's content words found in the page text. 1 for a + * verbatim/normalized hit; the measured overlap otherwise. + */ + overlap: number + /** Content words present in the claim but NOT in the page text. */ + missingWords: string[] +} + +export interface GroundClaimOptions { + /** + * Minimum fraction of the claim's content words that must appear in the page + * text to count as a close paraphrase when there is no verbatim/normalized + * hit. Default 0.7 — a high bar, because a misattribution is exactly a claim + * whose specific words the page does not contain. + */ + minOverlap?: number + /** + * Content words shorter than this are ignored (drops "the", "of", "is", …) + * and never count toward overlap. Default 3. + */ + minWordLength?: number +} + +/** + * Stopwords stripped before overlap scoring so the bar measures the claim's + * SUBSTANTIVE words (the numbers, nouns, methods it asserts), not filler a + * misattributed page would trivially share with the real one. + */ +const stopwords = new Set([ + 'the', + 'a', + 'an', + 'and', + 'or', + 'but', + 'of', + 'to', + 'in', + 'on', + 'for', + 'with', + 'as', + 'by', + 'at', + 'from', + 'that', + 'this', + 'these', + 'those', + 'it', + 'its', + 'is', + 'are', + 'was', + 'were', + 'be', + 'been', + 'being', + 'has', + 'have', + 'had', + 'can', + 'will', + 'would', + 'should', + 'may', + 'might', + 'not', + 'no', + 'than', + 'then', + 'over', + 'under', + 'about', + 'into', + 'their', + 'they', + 'them', +]) + +/** Normalize for presence checks: lowercase, collapse whitespace + punctuation. */ +function normalize(text: string): string { + return text + .toLowerCase() + .replace(/[^\p{L}\p{N}\s]+/gu, ' ') + .replace(/\s+/g, ' ') + .trim() +} + +/** The claim's substantive content words (deduped, stopwords + short words removed). */ +function contentWords(claim: string, minWordLength: number): string[] { + const words = normalize(claim) + .split(' ') + .filter((word) => word.length >= minWordLength && !stopwords.has(word)) + return [...new Set(words)] +} + +/** + * THE ORACLE. Is `claim` grounded in `pageText` (the `htmlToText` output of the + * page the worker fetched)? Deterministic, no model call: + * + * 1. verbatim — the claim string appears as-is (case-insensitive). + * 2. normalized — the claim appears after collapsing punctuation/whitespace + * on both sides (so "5.4x" vs "5.4 x", smart quotes, etc. still match). + * 3. overlap — a close paraphrase: at least `minOverlap` of the claim's + * substantive content words appear in the page text. A misattributed page + * fails here because the SPECIFIC words the claim asserts are absent. + * + * Returns the match mode, the measured overlap, and the missing content words — + * enough for the driver to give a precise rejection reason and for a test/doc to + * audit WHY a claim grounded or didn't. + */ +export function groundClaimInText( + claim: string, + pageText: string, + options: GroundClaimOptions = {}, +): GroundingResult { + const minOverlap = options.minOverlap ?? 0.7 + const minWordLength = Math.max(1, options.minWordLength ?? 3) + + const claimTrimmed = claim.trim() + if (!claimTrimmed) return { grounded: false, mode: 'empty-claim', overlap: 0, missingWords: [] } + if (!pageText.trim()) return { grounded: false, mode: 'empty-text', overlap: 0, missingWords: [] } + + const haystackLower = pageText.toLowerCase() + if (haystackLower.includes(claimTrimmed.toLowerCase())) { + return { grounded: true, mode: 'verbatim', overlap: 1, missingWords: [] } + } + + const haystackNorm = normalize(pageText) + const claimNorm = normalize(claimTrimmed) + if (claimNorm && haystackNorm.includes(claimNorm)) { + return { grounded: true, mode: 'normalized', overlap: 1, missingWords: [] } + } + + const words = contentWords(claimTrimmed, minWordLength) + if (words.length === 0) { + // The claim has no substantive content words (all stopwords/short). With no + // verbatim/normalized hit there is nothing to ground — treat as absent. + return { grounded: false, mode: 'absent', overlap: 0, missingWords: [] } + } + // Word-boundary presence so "rotary" does not match inside "rotaryxyz". + const present = words.filter((word) => + new RegExp(`\\b${word.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}\\b`).test(haystackNorm), + ) + const missingWords = words.filter((word) => !present.includes(word)) + const overlap = present.length / words.length + const grounded = overlap >= minOverlap + return { grounded, mode: grounded ? 'overlap' : 'absent', overlap, missingWords } +} + +export interface ClaimGroundingDriverOptions extends GroundClaimOptions { + /** + * Optional second verifier to compose AFTER grounding passes. When set, a + * source must BOTH ground its claim AND pass this verifier (e.g. the LLM + * relevance driver's `verifySource`). Lets the loop reject off-topic AND + * misattributed sources in one driver. Omit for the pure, zero-inference + * grounding gate. + */ + relevanceVerifier?: ( + source: ResearchSourceProposal, + ctx: SourceVerificationContext, + ) => Promise | SourceVerdict + /** + * What to do when a proposal carries NO cited claim. `'reject'` (default) is + * fail-closed: in claim-grounding mode every source must declare what it is + * cited for, so an un-annotated source is treated as ungrounded. `'accept'` + * lets un-annotated sources through to the relevance verifier (if any) — + * useful when mixing annotated and legacy proposals. + */ + onMissingClaim?: 'reject' | 'accept' +} + +/** + * A `ResearchDriver`-shaped verifier (just the `verifySource` arm) that gates on + * CLAIM GROUNDING: it rejects a source whose cited claim is not present in its + * fetched page text — a misattributed / fabricated citation — and (optionally) + * composes a relevance verifier after grounding passes. + * + * The returned function matches `ResearchDriver['verifySource']`, so it drops + * straight into `runTwoAgentResearchLoop` as `{ verifySource: createClaimGroundingVerifier(...) }`. + */ +export function createClaimGroundingVerifier(options: ClaimGroundingDriverOptions = {}) { + const onMissingClaim = options.onMissingClaim ?? 'reject' + return async function verifySource( + source: ResearchSourceProposal, + ctx: SourceVerificationContext, + ): Promise { + const claim = citedClaimOf(source) + if (!claim) { + if (onMissingClaim === 'reject') { + return { + accept: false, + reason: 'no cited claim: claim-grounding mode requires every source to declare its claim', + } + } + // accept-on-missing: fall through to the relevance verifier (or accept). + return options.relevanceVerifier ? options.relevanceVerifier(source, ctx) : { accept: true } + } + + const grounding = groundClaimInText(claim, source.text, options) + if (!grounding.grounded) { + const detail = + grounding.mode === 'empty-text' + ? 'fetched page has no text' + : `claim not found in the fetched page (overlap ${(grounding.overlap * 100).toFixed(0)}%${ + grounding.missingWords.length + ? `, missing: ${grounding.missingWords.slice(0, 6).join(', ')}` + : '' + })` + return { + accept: false, + reason: `misattributed citation: ${detail}`, + } + } + + // Claim is grounded. Compose the relevance verifier if one was provided. + if (options.relevanceVerifier) return options.relevanceVerifier(source, ctx) + return { accept: true } + } +} + +export interface WorkerClaimDecorationOptions { + router?: RouterClient + router_options?: TangleRouterOptions + /** Max output tokens for the claim-extraction call. Default 1200 (glm floor). */ + maxTokens?: number +} + +/** + * Ask an LLM to state, for one source, the single specific factual claim a + * researcher would cite THIS page for toward the goal. Used to DECORATE the + * sources a relevance-only worker produced with the claim a citation would make, + * so the claim-grounding verifier has something executable to check. The model + * is told to ground the claim in the provided excerpt; the verifier then checks + * it against the FULL page text independently — the model does not get to mark + * its own homework. + * + * Returns the proposal annotated via `withCitedClaim`, or the original proposal + * unchanged if the model returns nothing parseable (the verifier's + * `onMissingClaim` policy then decides). + */ +export function createClaimDecorator(options: WorkerClaimDecorationOptions = {}) { + const maxTokens = options.maxTokens ?? 1200 + return async function decorate( + source: ResearchSourceProposal, + goal: string, + ): Promise { + const router = options.router ?? createTangleRouterClient(options.router_options) + const excerpt = source.text.slice(0, 1500) + const system = + 'You extract the single most important factual claim a researcher would cite a page for. ' + + "State ONE concrete, checkable sentence using the page's own key terms and numbers. " + + 'Do NOT invent facts not in the excerpt. Respond with ONLY the claim sentence, no prose.' + const user = [ + `Research goal: ${goal}`, + `Page title: ${source.title ?? '(none)'}`, + `Page excerpt:\n${excerpt}`, + 'The single claim this page should be cited for:', + ].join('\n\n') + let raw = '' + try { + raw = await router.chat( + [ + { role: 'system', content: system }, + { role: 'user', content: user }, + ], + maxTokens, + ) + } catch { + return source + } + const claim = raw.trim().split('\n')[0]?.trim() + if (!claim) return source + return withCitedClaim(source, claim) + } +} diff --git a/src/index.ts b/src/index.ts index 9487362..db4893b 100644 --- a/src/index.ts +++ b/src/index.ts @@ -1,6 +1,8 @@ export * from './adapters' +export * from './adaptive-driver' export * from './changes' export * from './chunking' +export * from './claim-grounding' export * from './discovery' export * from './eval-readiness' export * from './events' diff --git a/src/web-research-worker.ts b/src/web-research-worker.ts index ff147a1..4dce7ed 100644 --- a/src/web-research-worker.ts +++ b/src/web-research-worker.ts @@ -72,6 +72,24 @@ export interface RouterClient { messages: { role: 'system' | 'user'; content: string }[], maxTokens?: number, ): Promise + /** Cumulative cost (chat + search) since this client was created. */ + usage(): RouterUsage +} + +/** + * Cumulative router cost — the per-arm signal the A/B reports ALONGSIDE quality, + * so "2.3 fewer sources" can be read against the token/$/latency it cost. A + * two-agent round is one worker pass plus N `verifySource` LLM calls; counting + * each call here is what surfaces that the two-agent loop spends more inference + * than its "equal passes" budget implies. + */ +export interface RouterUsage { + chatCalls: number + searchCalls: number + promptTokens: number + completionTokens: number + usd: number + wallMs: number } export interface TangleRouterOptions { @@ -114,8 +132,20 @@ export function createTangleRouterClient(options: TangleRouterOptions = {}): Rou Authorization: `Bearer ${apiKey}`, } + // glm-5.2 pricing (USD per token) + a cumulative accumulator. Read via usage(). + const price = { prompt: 0.95 / 1_000_000, completion: 3.0 / 1_000_000 } + const acc: RouterUsage = { + chatCalls: 0, + searchCalls: 0, + promptTokens: 0, + completionTokens: 0, + usd: 0, + wallMs: 0, + } + return { async search(query, opts) { + const t0 = Date.now() const res = await fetch(`${baseUrl}/search`, { method: 'POST', headers, @@ -126,6 +156,8 @@ export function createTangleRouterClient(options: TangleRouterOptions = {}): Rou ...(opts?.maxResults != null ? { maxResults: opts.maxResults } : {}), }), }) + acc.searchCalls += 1 + acc.wallMs += Date.now() - t0 if (!res.ok) { throw new RouterError(res.status, await res.text().catch(() => res.statusText)) } @@ -138,6 +170,7 @@ export function createTangleRouterClient(options: TangleRouterOptions = {}): Rou // Reasoning-model floor: never let glm-5.2 spend the whole budget on // hidden reasoning and return empty visible content. const max_tokens = Math.max(MIN_MAX_TOKENS, maxTokens ?? MIN_MAX_TOKENS) + const t0 = Date.now() const res = await fetch(`${baseUrl}/chat/completions`, { method: 'POST', headers, @@ -149,9 +182,20 @@ export function createTangleRouterClient(options: TangleRouterOptions = {}): Rou } const body = (await res.json()) as { choices?: { message?: { content?: string } }[] + usage?: { prompt_tokens?: number; completion_tokens?: number } } + const promptTokens = body.usage?.prompt_tokens ?? 0 + const completionTokens = body.usage?.completion_tokens ?? 0 + acc.chatCalls += 1 + acc.promptTokens += promptTokens + acc.completionTokens += completionTokens + acc.usd += promptTokens * price.prompt + completionTokens * price.completion + acc.wallMs += Date.now() - t0 return body.choices?.[0]?.message?.content ?? '' }, + usage() { + return { ...acc } + }, } } diff --git a/tests/claim-grounding.test.ts b/tests/claim-grounding.test.ts new file mode 100644 index 0000000..5d0c0d4 --- /dev/null +++ b/tests/claim-grounding.test.ts @@ -0,0 +1,176 @@ +import { describe, expect, it } from 'vitest' +import { + citedClaimKey, + citedClaimOf, + createClaimGroundingVerifier, + groundClaimInText, + withCitedClaim, +} from '../src/claim-grounding' +import type { + ResearchSourceProposal, + SourceVerificationContext, +} from '../src/two-agent-research-loop' + +const ctx: SourceVerificationContext = { + root: '/tmp/x', + goal: 'self-speculative decoding', + round: 1, + index: { + root: '/tmp/x', + generatedAt: '', + sources: [], + pages: [], + graph: { nodes: [], edges: [] }, + }, + gaps: [], + acceptedThisRound: [], +} + +describe('groundClaimInText (the deterministic grounding oracle)', () => { + const page = + 'Self-speculative decoding skips intermediate layers to draft tokens, then verifies them ' + + 'with the full model. The paper reports a 1.73x speedup on LLaMA-2 with no quality loss.' + + it('grounds a verbatim claim', () => { + const r = groundClaimInText('skips intermediate layers to draft tokens', page) + expect(r.grounded).toBe(true) + expect(r.mode).toBe('verbatim') + expect(r.overlap).toBe(1) + }) + + it('grounds across punctuation/whitespace differences (normalized)', () => { + // The page says "1.73x speedup"; the claim spaces it differently + adds a comma. + const r = groundClaimInText('a 1.73x, speedup', page) + expect(r.grounded).toBe(true) + expect(['verbatim', 'normalized']).toContain(r.mode) + }) + + it('grounds a close paraphrase via content-word overlap', () => { + // Reworded but the substantive words are present in the page (drops the + // "no-quality-loss" ordering). Inflected forms that the page does NOT + // contain verbatim (e.g. "drafts" vs "draft") legitimately lower the score — + // that strictness is the point, so this paraphrase keeps to present words. + const r = groundClaimInText( + 'draft tokens by skipping intermediate layers then verifies with the full model', + page, + ) + expect(r.grounded).toBe(true) + expect(r.mode).toBe('overlap') + expect(r.overlap).toBeGreaterThanOrEqual(0.7) + }) + + it('REJECTS a misattributed claim — relevant topic, wrong numbers/facts', () => { + // On-topic (mentions speculative decoding) but the page never says any of this: + // a different speedup, a different model, a different mechanism. A relevance + // judge would pass it; grounding must not. + const r = groundClaimInText( + 'achieves a 4.8x speedup on GPT-4 using a separate draft transformer network', + page, + ) + expect(r.grounded).toBe(false) + expect(r.mode).toBe('absent') + expect(r.missingWords).toContain('gpt') + expect(r.missingWords).toContain('network') + }) + + it('rejects an empty page text and an empty claim', () => { + expect(groundClaimInText('anything', '').grounded).toBe(false) + expect(groundClaimInText('anything', '').mode).toBe('empty-text') + expect(groundClaimInText(' ', page).grounded).toBe(false) + expect(groundClaimInText(' ', page).mode).toBe('empty-claim') + }) + + it('does not let a stopword-only claim ground spuriously', () => { + const r = groundClaimInText('the of and to', page) + expect(r.grounded).toBe(false) + }) + + it('honours a stricter minOverlap', () => { + // ~0.6 overlap claim: grounds at 0.5, fails at 0.9. + const claim = 'speedup verifies tokens nonexistentwordzz alsofakewordzz' + expect(groundClaimInText(claim, page, { minOverlap: 0.5 }).grounded).toBe(true) + expect(groundClaimInText(claim, page, { minOverlap: 0.9 }).grounded).toBe(false) + }) +}) + +describe('citedClaim helpers', () => { + const base: ResearchSourceProposal = { uri: 'https://x', text: 't', title: 'T' } + + it('round-trips a claim through metadata', () => { + const decorated = withCitedClaim(base, 'the claim') + expect(decorated.metadata?.[citedClaimKey]).toBe('the claim') + expect(citedClaimOf(decorated)).toBe('the claim') + }) + + it('returns undefined for a missing/blank claim', () => { + expect(citedClaimOf(base)).toBeUndefined() + expect(citedClaimOf(withCitedClaim(base, ' '))).toBeUndefined() + }) +}) + +describe('createClaimGroundingVerifier (the driver gate)', () => { + const page = + 'The transformer architecture uses multi-head self-attention. Reported BLEU of 28.4 on WMT14 En-De.' + + it('accepts a grounded source', async () => { + const verify = createClaimGroundingVerifier() + const source = withCitedClaim( + { uri: 'https://a', text: page, title: 'Attention' }, + 'BLEU of 28.4 on WMT14', + ) + expect(await verify(source, ctx)).toEqual({ accept: true }) + }) + + it('REJECTS a misattributed source with a precise reason', async () => { + const verify = createClaimGroundingVerifier() + const source = withCitedClaim( + { uri: 'https://a', text: page, title: 'Attention' }, + 'reports a BLEU of 41.0 on the WMT16 Russian benchmark', + ) + const verdict = await verify(source, ctx) + expect(verdict.accept).toBe(false) + if (!verdict.accept) expect(verdict.reason).toMatch(/misattributed citation/) + }) + + it('rejects an un-annotated source by default (fail-closed)', async () => { + const verify = createClaimGroundingVerifier() + const verdict = await verify({ uri: 'https://a', text: page, title: 'T' }, ctx) + expect(verdict.accept).toBe(false) + if (!verdict.accept) expect(verdict.reason).toMatch(/no cited claim/) + }) + + it('composes a relevance verifier AFTER grounding passes', async () => { + let relevanceCalled = false + const verify = createClaimGroundingVerifier({ + relevanceVerifier: () => { + relevanceCalled = true + return { accept: false, reason: 'off-topic per relevance judge' } + }, + }) + const grounded = withCitedClaim( + { uri: 'https://a', text: page, title: 'T' }, + 'multi-head self-attention', + ) + const verdict = await verify(grounded, ctx) + expect(relevanceCalled).toBe(true) + expect(verdict.accept).toBe(false) + if (!verdict.accept) expect(verdict.reason).toMatch(/off-topic/) + }) + + it('does NOT call the relevance verifier when grounding already fails', async () => { + let relevanceCalled = false + const verify = createClaimGroundingVerifier({ + relevanceVerifier: () => { + relevanceCalled = true + return { accept: true } + }, + }) + const misattributed = withCitedClaim( + { uri: 'https://a', text: page, title: 'T' }, + 'a 99x speedup on a quantum coprocessor', + ) + const verdict = await verify(misattributed, ctx) + expect(relevanceCalled).toBe(false) + expect(verdict.accept).toBe(false) + }) +}) diff --git a/tests/loops/adaptive-ab.test.ts b/tests/loops/adaptive-ab.test.ts new file mode 100644 index 0000000..c0254b8 --- /dev/null +++ b/tests/loops/adaptive-ab.test.ts @@ -0,0 +1,647 @@ +import { mkdtemp, rm } from 'node:fs/promises' +import { tmpdir } from 'node:os' +import { join } from 'node:path' +import { afterEach, beforeEach, describe, expect, it } from 'vitest' +import { + type AdaptiveResearchDriver, + canonicalizeUrl, + contentKey, + createAdaptiveResearchDriver, + triageSource, +} from '../../src/adaptive-driver' +import { + buildEvalKnowledgeBundle, + defineReadinessSpec, + type KnowledgeReadinessSpec, +} from '../../src/eval-readiness' +import { buildKnowledgeIndex } from '../../src/indexer' +import { + type ResearchContribution, + type ResearchDriver, + type ResearchSourceProposal, + type ResearchWorker, + runTwoAgentResearchLoop, + type SourceVerificationContext, + type WorkerResearchContext, +} from '../../src/two-agent-research-loop' +import { + createTangleRouterClient, + createVerifyingResearchDriver, + createWebResearchWorker, + type RouterClient, +} from '../../src/web-research-worker' + +// =========================================================================== +// ADAPTIVE TOPOLOGY A/B: spend the LLM verifier only when it pays. +// +// The cost/quality result (docs/results/cost-quality.md) found the LLM relevance +// verifier's cleanliness win is dominated by DE-DUPLICATION — captured by a +// deterministic content-hash / canonical-URL check at ~none of the LLM premium — +// and that an LLM check only earns its dollar on the off-scope tail. The +// production move it named is: do the cheap deterministic work first, reserve the +// LLM for the ambiguous tail. `createAdaptiveResearchDriver` is that driver. +// +// This file measures THREE topologies on the cost/quality frontier: +// - single-agent : no verifier (the floor — admits everything). +// - full-LLM : one LLM verifySource call per candidate (the ceiling cost). +// - adaptive : $0 dedup → $0 host/title/length triage → LLM ONLY for the +// ambiguous survivors. +// +// The question: does adaptive capture most of the cleanliness (dedup + clear +// drops) at a fraction of the verifier $/calls? The offline arm proves the +// wiring + that adaptive escalates ONLY ambiguous survivors; the live arm +// (creds-gated) reports the real frontier with #36's RouterClient.usage(). +// =========================================================================== + +// --------------------------------------------------------------------------- +// Pure-unit coverage of the deterministic stages (no loop, no network). +// --------------------------------------------------------------------------- +describe('adaptive driver — deterministic stages', () => { + it('canonicalizeUrl collapses scheme/www/trailing-slash/tracking params', () => { + const a = canonicalizeUrl('https://www.Example.com/Path/?utm_source=x&ref=y&a=1#frag') + const b = canonicalizeUrl('http://example.com/Path?a=1') + expect(a).toBe(b) + expect(a).toBe('example.com/Path?a=1') + // Non-URL identifiers dedup by lowercased equality. + expect(canonicalizeUrl('Local-Note-7')).toBe('local-note-7') + }) + + it('contentKey ignores formatting so a reformatted mirror collides', () => { + const k1 = contentKey('Self-speculative decoding skips layers. It is 1.7x faster.') + const k2 = contentKey('self speculative decoding skips layers it is 1 7x faster') + expect(k1).toBe(k2) + }) + + it('triageSource keeps authoritative+substantial, drops spam/thin, else ambiguous', () => { + const opts = { + authoritativeHosts: ['arxiv.org', '.edu'], + spamPatterns: [/\bbuy\b.*\bcheap\b/i, /\d+\s+things/i], + minBodyChars: 400, + substantialBodyChars: 600, + } + const body = 'x'.repeat(700) + // Authoritative host + substantial body → KEEP, no LLM. + expect( + triageSource({ uri: 'https://arxiv.org/abs/1', title: 'A paper', text: body }, opts).triage, + ).toBe('keep') + // Spam title → DROP. + expect( + triageSource({ uri: 'https://shop.example.com/x', title: 'Buy fans cheap', text: body }, opts) + .triage, + ).toBe('drop') + // Thin body → DROP. + expect( + triageSource({ uri: 'https://arxiv.org/abs/2', title: 'A paper', text: 'short' }, opts) + .triage, + ).toBe('drop') + // Unknown host, plausible body → AMBIGUOUS (the LLM tail). + expect( + triageSource({ uri: 'https://blog.unknown.io/x', title: 'Some notes', text: body }, opts) + .triage, + ).toBe('ambiguous') + }) +}) + +// --------------------------------------------------------------------------- +// OFFLINE CONTROLLED A/B — proves adaptive escalates ONLY ambiguous survivors +// and matches the full-LLM cleanliness at a fraction of the LLM calls. +// --------------------------------------------------------------------------- +interface PoolEntry { + uri: string + title: string + text: string + /** Why it's in the pool: 'authoritative' keep, 'spam'/'thin' drop, 'dup' of an + * earlier entry, or 'ambiguous' (must reach the LLM). */ + kind: 'authoritative' | 'spam' | 'thin' | 'dup' | 'ambiguous-good' | 'ambiguous-bad' +} + +const longBody = + 'Self-speculative decoding drafts tokens by skipping intermediate transformer ' + + 'layers, then verifies them with the full model in a single forward pass. It reports a ' + + '1.73x speedup on LLaMA-2 with no measurable quality loss across the evaluated benchmarks. '.repeat( + 6, + ) + +const goal = 'self-speculative decoding' + +const pool: PoolEntry[] = [ + // Authoritative + substantial → adaptive KEEPS with no LLM call. + { + uri: 'https://arxiv.org/abs/2309.08168', + title: 'Draft & Verify: Lossless LLM Acceleration via Self-Speculative Decoding', + text: longBody, + kind: 'authoritative', + }, + // Exact-content mirror of the arxiv paper on a different URL → adaptive DEDUPS. + { + uri: 'https://mirror.example.org/draft-and-verify', + title: 'Draft and Verify (mirror)', + text: longBody, + kind: 'dup', + }, + // Obvious spam title → adaptive DROPS with no LLM call. Distinct body so the + // DROP is earned by the title heuristic, not by content-dedup against the paper. + { + uri: 'https://shop.example.com/fans', + title: '10 things about decoding that will SHOCK you!!!', + text: 'Subscribe now for the best decoding deals of 2026! Limited offer, buy cheap fans today. '.repeat( + 8, + ), + kind: 'spam', + }, + // Too-thin body → adaptive DROPS with no LLM call. + { + uri: 'https://stub.example.net/p', + title: 'Decoding stub', + text: 'self-speculative decoding.', + kind: 'thin', + }, + // Unknown host, on-topic plausible body → AMBIGUOUS → reaches the LLM (kept). + { + uri: 'https://blog.unknown.io/self-spec-explainer', + title: 'An explainer on self-speculative decoding', + text: + 'This post walks through how self-speculative decoding reuses a single model to draft and ' + + 'verify tokens, with worked examples and a discussion of when the speedup holds. '.repeat(8), + kind: 'ambiguous-good', + }, + // Unknown host, off-topic plausible body → AMBIGUOUS → reaches the LLM (rejected). + { + uri: 'https://blog.unknown.io/gardening', + title: 'Companion planting for tomatoes', + text: + 'Tomatoes thrive next to basil and marigolds; rotate nightshades yearly and mulch to keep ' + + 'soil moisture even through the summer. This has nothing to do with language models. '.repeat( + 8, + ), + kind: 'ambiguous-bad', + }, +] + +const specs: KnowledgeReadinessSpec[] = [ + defineReadinessSpec({ + id: 'topic/definition', + description: `what ${goal} is and how it works`, + query: `${goal} how it works method`, + requiredFor: ['ResearchAgent'], + importance: 'blocking', + minSources: 1, + minHits: 1, + }), +] + +/** A worker that proposes the whole pool once. */ +function poolWorker(): ResearchWorker { + return async (_ctx: WorkerResearchContext): Promise => { + const sources: ResearchSourceProposal[] = pool.map((entry) => ({ + uri: entry.uri, + title: entry.title, + text: entry.text, + metadata: { kind: entry.kind, originalUri: entry.uri }, + })) + return { + sources, + buildPages: (accepted) => + accepted + .map((record) => { + const original = record.metadata?.originalUri + const entry = pool.find((p) => p.uri === original) + const slug = String(original ?? record.id) + .replace(/[^a-z0-9]+/gi, '-') + .slice(0, 120) + return [ + `---FILE: knowledge/${slug}.md---`, + '---', + `title: ${entry?.title ?? record.id}`, + `sources: ["${record.id}"]`, + '---', + `# ${entry?.title ?? record.id}`, + entry?.text ?? '', + '---END FILE---', + ].join('\n') + }) + .join('\n'), + notes: `proposed ${sources.length}`, + } + } +} + +/** + * A deterministic stand-in for the LLM relevance verifier: accepts on-topic + * (mentions decoding/token/model/layer), rejects off-topic. Counts its calls so + * the test can prove adaptive routes ONLY ambiguous survivors here. This is the + * exact role `createVerifyingResearchDriver().verifySource` plays live — one LLM + * call per source it sees. + */ +function countingRelevanceVerifier(): { + verifySource: (s: ResearchSourceProposal, c: SourceVerificationContext) => { accept: boolean } + calls: () => number +} { + let calls = 0 + return { + verifySource(source) { + calls += 1 + const onTopic = /decod|token|\bmodel\b|layer|speculative/i.test(source.text) + return onTopic ? { accept: true } : { accept: false } + }, + calls: () => calls, + } +} + +async function admittedKinds(root: string): Promise> { + const index = await buildKnowledgeIndex(root) + return new Set( + index.sources.flatMap((s) => (typeof s.metadata?.kind === 'string' ? [s.metadata.kind] : [])), + ) +} + +async function admittedCount(root: string): Promise { + const index = await buildKnowledgeIndex(root) + return index.sources.length +} + +/** + * An offline router stub for the adaptive driver's LLM-escalation stage. The + * relevance verifier sends a chat whose USER message embeds the candidate's + * URL/title/excerpt; the stub reads that excerpt and returns a real on-topic + * verdict JSON — accept if it mentions decoding/speculative/token/model/layer, + * reject otherwise. This is exactly the shape `createVerifyingResearchDriver` + * parses, so the offline arm exercises the real escalation path with $0/network. + * `search`/`usage` are never reached by the verifier but satisfy the interface. + */ +const stubRouter: RouterClient = { + async chat(messages) { + const user = messages.find((m) => m.role === 'user')?.content ?? '' + // Judge ONLY the candidate's excerpt, not the whole prompt — the prompt also + // embeds the on-topic goal/gaps, which would make every candidate look + // on-topic. The relevance verifier formats the excerpt after `Excerpt:`. + const excerpt = user.split(/Excerpt:\n?/)[1] ?? user + const onTopic = /decod|speculative|\btoken\b|\bmodel\b|\blayer\b/i.test(excerpt) + return JSON.stringify({ accept: onTopic, reason: onTopic ? 'on-topic' : 'off-topic' }) + }, + async search() { + return [] + }, + usage() { + return { chatCalls: 0, searchCalls: 0, promptTokens: 0, completionTokens: 0, usd: 0, wallMs: 0 } + }, +} + +let fullRoot: string +let adaptiveRoot: string +let singleRoot: string + +beforeEach(async () => { + fullRoot = await mkdtemp(join(tmpdir(), 'ad-full-')) + adaptiveRoot = await mkdtemp(join(tmpdir(), 'ad-adaptive-')) + singleRoot = await mkdtemp(join(tmpdir(), 'ad-single-')) +}) +afterEach(async () => { + await rm(fullRoot, { recursive: true, force: true }) + await rm(adaptiveRoot, { recursive: true, force: true }) + await rm(singleRoot, { recursive: true, force: true }) +}) + +describe('adaptive A/B (offline, controlled): adaptive escalates only the ambiguous tail', () => { + it('matches full-LLM cleanliness at a fraction of the LLM calls', async () => { + // FULL-LLM arm: an LLM call per candidate (here the counting stand-in). + const fullVerifier = countingRelevanceVerifier() + const fullDriver: ResearchDriver = { verifySource: fullVerifier.verifySource } + await runTwoAgentResearchLoop({ + root: fullRoot, + goal, + worker: poolWorker(), + driver: fullDriver, + readinessSpecs: specs, + maxRounds: 1, + }) + + // ADAPTIVE arm: $0 dedup + $0 triage, LLM ONLY for ambiguous survivors. Its + // LLM stage routes to the same stub the full-LLM arm models, so the call + // count and the verdicts are directly comparable. + const adaptiveDriver: AdaptiveResearchDriver = createAdaptiveResearchDriver({ + router: stubRouter, + }) + // The adaptive driver calls relevance.verifySource (one stub chat) ONLY for + // ambiguous survivors, so the driver's own stats().llmCalls is the observable + // escalation count; the stub returns a real on-topic/off-topic verdict so the + // good ambiguous source is kept and the off-topic one rejected, exactly as a + // live relevance judge would. + await runTwoAgentResearchLoop({ + root: adaptiveRoot, + goal, + worker: poolWorker(), + driver: { verifySource: (s, c) => adaptiveDriver.verifySource(s, c) }, + readinessSpecs: specs, + maxRounds: 1, + }) + + // SINGLE-AGENT arm: no verifier — admit everything the loop's own exact-uri + // dedup lets through. + await runTwoAgentResearchLoop({ + root: singleRoot, + goal, + worker: poolWorker(), + driver: { verifySource: () => ({ accept: true }) }, + readinessSpecs: specs, + maxRounds: 1, + }) + + const stats = adaptiveDriver.stats() + const fullCalls = fullVerifier.calls() + const adaptiveLlmCalls = stats.llmCalls + const adaptiveAdmitted = await admittedCount(adaptiveRoot) + const singleAdmitted = await admittedCount(singleRoot) + const adaptiveKinds = await admittedKinds(adaptiveRoot) + + console.log( + `[adaptive A/B offline] full-LLM calls=${fullCalls} | ` + + `adaptive: llmCalls=${adaptiveLlmCalls} dedup=${stats.dedupRejected} ` + + `heuristicKept=${stats.heuristicKept} heuristicDropped=${stats.heuristicDropped} ` + + `admitted=${adaptiveAdmitted} | single admitted=${singleAdmitted}`, + ) + + // The full-LLM arm called the verifier once per candidate that survived the + // loop's exact-uri dedup (all 6 distinct uris → 6 calls). + expect(fullCalls).toBe(pool.length) + + // ADAPTIVE escalates ONLY the two ambiguous survivors to the LLM. The other + // four are decided for $0: arxiv (keep), mirror (dedup by content), spam + // (drop), thin (drop). + expect(adaptiveLlmCalls).toBe(2) + expect(stats.dedupRejected).toBe(1) // the content mirror + expect(stats.heuristicKept).toBe(1) // the arxiv paper + expect(stats.heuristicDropped).toBe(2) // spam + thin + + // CLEANLINESS: adaptive admits exactly the real, on-topic sources — the + // authoritative paper + the on-topic ambiguous explainer — and rejects the + // mirror, spam, thin, and off-topic. So 2 admitted, no junk kinds. + expect(adaptiveKinds.has('spam')).toBe(false) + expect(adaptiveKinds.has('thin')).toBe(false) + expect(adaptiveKinds.has('dup')).toBe(false) + expect(adaptiveKinds.has('ambiguous-bad')).toBe(false) + expect(adaptiveKinds.has('authoritative')).toBe(true) + expect(adaptiveKinds.has('ambiguous-good')).toBe(true) + expect(adaptiveAdmitted).toBe(2) + + // FRONTIER: adaptive keeps the full-LLM cleanliness (single-agent admits the + // junk adaptive rejected) at strictly fewer LLM calls (2 vs 6 = a 3x cut). + expect(adaptiveLlmCalls).toBeLessThan(fullCalls) + expect(singleAdmitted).toBeGreaterThan(adaptiveAdmitted) + }) +}) + +// =========================================================================== +// LIVE THREE-TOPOLOGY A/B — the real cost/quality frontier. Skipped offline. +// +// Runs the REAL web-research worker (glm-5.2 query-gen → live /v1/search → +// politeFetch → htmlToText) ONCE per topic, then gates the SAME fetched +// proposals through three drivers, diffing each arm's cost with #36's +// RouterClient.usage(): +// +// A. single-agent : accept all (no verifier). $0, admits everything. +// B. full-LLM : createVerifyingResearchDriver — one LLM call per source. +// C. adaptive : createAdaptiveResearchDriver — $0 dedup + $0 triage, LLM +// only for the ambiguous tail. +// +// Reports per arm: admitted-source count (cleanliness), LLM calls, USD, tokens. +// The frontier question: does adaptive land near full-LLM cleanliness at a +// fraction of full-LLM's $/calls? Honest: if adaptive does NOT sit on the +// frontier, or the host/title/length heuristic mis-routes, the doc says so. +// +// Gate: AGENT_KNOWLEDGE_LIVE=1 + TANGLE_API_KEY with glm-5.2 credits. +// ADAPTIVE_LIVE_GOALS — `|`-separated topics +// ADAPTIVE_LIVE_MODEL — router chat model (default glm-5.2) +// =========================================================================== + +interface ArmResult { + admitted: number + llmCalls: number + usd: number + tokens: number + wallMs: number +} + +describe.skipIf(!process.env.AGENT_KNOWLEDGE_LIVE)('live: adaptive three-topology frontier', () => { + it('single vs full-LLM vs adaptive on real web sources', async () => { + const goals = ( + process.env.ADAPTIVE_LIVE_GOALS ?? + 'self-speculative decoding|rotary position embeddings|grouped-query attention' + ) + .split('|') + .map((g) => g.trim()) + .filter(Boolean) + const model = process.env.ADAPTIVE_LIVE_MODEL ?? 'glm-5.2' + const router: RouterClient = createTangleRouterClient({ model }) + + // COST GATE: cheap glm-5.2 smoke BEFORE the multi-topic burn. + const smoke = await router.chat( + [ + { role: 'system', content: 'Reply with exactly the word: OK' }, + { role: 'user', content: 'Say OK.' }, + ], + 1200, + ) + console.log(`[LIVE smoke] glm-5.2 visible content length=${smoke.trim().length}`) + expect(smoke.trim().length).toBeGreaterThan(0) + + const worker = createWebResearchWorker({ + router, + resultsPerQuery: 3, + queriesPerGap: 2, + maxSourcesPerRound: 8, + }) + + let anyFetched = false + const rows: { + goal: string + fetched: number + single: ArmResult + full: ArmResult + adaptive: ArmResult + }[] = [] + + for (const liveGoal of goals) { + const liveSpecs: KnowledgeReadinessSpec[] = [ + defineReadinessSpec({ + id: 'topic/definition', + description: `what ${liveGoal} is and how it works`, + query: `${liveGoal} how it works method`, + requiredFor: ['ResearchAgent'], + importance: 'blocking', + minSources: 1, + minHits: 1, + }), + ] + + // 1. REAL fetch ONCE per topic, shared by all three arms. + const probeRoot = await mkdtemp(join(tmpdir(), 'ad-live-probe-')) + let fetched: ResearchSourceProposal[] = [] + try { + const index = await buildKnowledgeIndex(probeRoot) + const readiness = buildEvalKnowledgeBundle({ taskId: liveGoal, index, specs: [] }) + const contribution = await worker({ + root: probeRoot, + goal: liveGoal, + round: 1, + index, + gaps: liveSpecs.map((s) => ({ + id: s.id, + description: s.description, + query: typeof s.metadata?.query === 'string' ? s.metadata.query : s.description, + blocking: true, + })), + readiness, + }) + fetched = contribution.sources ?? [] + if (fetched.length > 0) anyFetched = true + } finally { + await rm(probeRoot, { recursive: true, force: true }) + } + + // Add a controlled duplicate of the first fetched source under a different + // (tracking-decorated) URL so the dedup stage has something real to catch — + // mirrors/syndication are common in live search and are exactly what the $0 + // stage is for. If nothing was fetched this is a no-op. + const withDup = + fetched.length > 0 + ? [ + ...fetched, + { + ...fetched[0], + uri: `${fetched[0].uri}${fetched[0].uri.includes('?') ? '&' : '?'}utm_source=mirror&ref=feed`, + metadata: { ...fetched[0].metadata, planted_dup: true }, + }, + ] + : fetched + + const staticWorker = + (sources: ResearchSourceProposal[]): ResearchWorker => + async () => ({ + sources, + buildPages: (accepted) => + accepted.length === 0 + ? undefined + : accepted + .map((record) => { + const original = record.metadata?.originalUri + const src = sources.find((s) => s.uri === original) + const slug = String(original ?? record.id) + .replace(/[^a-z0-9]+/gi, '-') + .slice(0, 120) + return [ + `---FILE: knowledge/${slug}.md---`, + '---', + `title: ${src?.title ?? record.id}`, + `sources: ["${record.id}"]`, + '---', + `# ${src?.title ?? record.id}`, + src?.text ?? '', + '---END FILE---', + ].join('\n') + }) + .join('\n'), + }) + + const arm = async ( + driver: ResearchDriver, + ): Promise<{ root: string; cost: ReturnType }> => { + const root = await mkdtemp(join(tmpdir(), 'ad-live-arm-')) + const u0 = router.usage() + await runTwoAgentResearchLoop({ + root, + goal: liveGoal, + worker: staticWorker(withDup), + driver, + readinessSpecs: liveSpecs, + maxRounds: 1, + }) + return { root, cost: diffUsage(u0, router.usage()) } + } + + const toResult = async ( + out: { root: string; cost: ReturnType }, + llmCalls: number, + ): Promise => { + const admitted = await admittedCount(out.root) + await rm(out.root, { recursive: true, force: true }) + return { + admitted, + llmCalls, + usd: out.cost.usd, + tokens: out.cost.promptTokens + out.cost.completionTokens, + wallMs: out.cost.wallMs, + } + } + + // A. single-agent (no verifier). + const singleOut = await arm({ verifySource: () => ({ accept: true }) }) + const single = await toResult(singleOut, 0) + + // B. full-LLM. + const fullOut = await arm(createVerifyingResearchDriver({ router })) + const full = await toResult(fullOut, fullOut.cost.chatCalls) + + // C. adaptive — instrument its LLM-stage count via stats(). + const adaptiveDriver = createAdaptiveResearchDriver({ router }) + const adaptiveOut = await arm({ + verifySource: (s, c) => adaptiveDriver.verifySource(s, c), + }) + const adaptive = await toResult(adaptiveOut, adaptiveDriver.stats().llmCalls) + + rows.push({ goal: liveGoal, fetched: withDup.length, single, full, adaptive }) + console.log( + `[LIVE ADAPTIVE ${JSON.stringify(liveGoal)}] fetched=${withDup.length} | ` + + `single: admitted=${single.admitted} $0 | ` + + `full-LLM: admitted=${full.admitted} calls=${full.llmCalls} $${full.usd.toFixed(4)} tok=${full.tokens} | ` + + `adaptive: admitted=${adaptive.admitted} llmCalls=${adaptive.llmCalls} $${adaptive.usd.toFixed(4)} tok=${adaptive.tokens} ` + + `(dedup=${adaptiveDriver.stats().dedupRejected} hKeep=${adaptiveDriver.stats().heuristicKept} hDrop=${adaptiveDriver.stats().heuristicDropped})`, + ) + } + + expect(anyFetched).toBe(true) + + // FRONTIER SUMMARY over all topics. + const sum = (pick: (r: (typeof rows)[number]) => number) => + rows.reduce((a, r) => a + pick(r), 0) + const fullUsd = sum((r) => r.full.usd) + const adaptiveUsd = sum((r) => r.adaptive.usd) + const fullCalls = sum((r) => r.full.llmCalls) + const adaptiveCalls = sum((r) => r.adaptive.llmCalls) + const singleAdmitted = sum((r) => r.single.admitted) + const fullAdmitted = sum((r) => r.full.admitted) + const adaptiveAdmitted = sum((r) => r.adaptive.admitted) + const callSaving = fullCalls > 0 ? 1 - adaptiveCalls / fullCalls : 0 + const usdSaving = fullUsd > 0 ? 1 - adaptiveUsd / fullUsd : 0 + + console.log( + `[LIVE ADAPTIVE SUMMARY] admitted single=${singleAdmitted} full=${fullAdmitted} adaptive=${adaptiveAdmitted} | ` + + `LLM calls full=${fullCalls} adaptive=${adaptiveCalls} (${(callSaving * 100).toFixed(0)}% fewer) | ` + + `USD full=$${fullUsd.toFixed(4)} adaptive=$${adaptiveUsd.toFixed(4)} (${(usdSaving * 100).toFixed(0)}% cheaper)`, + ) + + // The adaptive driver must NEVER spend MORE than the full-LLM arm — it is a + // strict subset of full-LLM's calls (the ambiguous tail) plus $0 stages. That + // is the one hard invariant; the magnitude of the saving and where adaptive + // lands on the cleanliness frontier are reported honestly in the doc. + expect(adaptiveCalls).toBeLessThanOrEqual(fullCalls) + expect(adaptiveUsd).toBeLessThanOrEqual(fullUsd + 1e-9) + // Adaptive cleanliness sits between single (admits all) and full-LLM (admits + // least) — it cannot admit MORE than the single-agent floor. + expect(adaptiveAdmitted).toBeLessThanOrEqual(singleAdmitted) + }, 600_000) +}) + +function diffUsage( + a: ReturnType, + b: ReturnType, +): ReturnType { + return { + chatCalls: b.chatCalls - a.chatCalls, + searchCalls: b.searchCalls - a.searchCalls, + promptTokens: b.promptTokens - a.promptTokens, + completionTokens: b.completionTokens - a.completionTokens, + usd: b.usd - a.usd, + wallMs: b.wallMs - a.wallMs, + } +} diff --git a/tests/loops/claim-grounding-ab.test.ts b/tests/loops/claim-grounding-ab.test.ts new file mode 100644 index 0000000..3056dfa --- /dev/null +++ b/tests/loops/claim-grounding-ab.test.ts @@ -0,0 +1,519 @@ +import { mkdtemp, rm } from 'node:fs/promises' +import { tmpdir } from 'node:os' +import { join } from 'node:path' +import { afterEach, beforeEach, describe, expect, it } from 'vitest' +import { createClaimGroundingVerifier, withCitedClaim } from '../../src/claim-grounding' +import { + buildEvalKnowledgeBundle, + defineReadinessSpec, + type KnowledgeReadinessSpec, +} from '../../src/eval-readiness' +import { buildKnowledgeIndex } from '../../src/indexer' +import { + type ResearchContribution, + type ResearchDriver, + type ResearchSourceProposal, + type ResearchWorker, + runTwoAgentResearchLoop, + type WorkerResearchContext, +} from '../../src/two-agent-research-loop' +import { + createTangleRouterClient, + createVerifyingResearchDriver, + createWebResearchWorker, + type RouterClient, +} from '../../src/web-research-worker' + +// =========================================================================== +// CLAIM-GROUNDING A/B: does checking each citation's CLAIM against the fetched +// page text catch an error band the relevance verifier and de-dup CANNOT — and +// does it do so for MORE quality per dollar than the relevance verifier earns +// on the de-dup-dominated topic set (docs/results/cost-quality.md)? +// +// The cost/quality result showed the relevance verifier's cleanliness win is +// dominated by DE-DUPLICATION, which a deterministic content-hash captures at +// ~none of the LLM premium. So the open question is: is there a band where the +// verifier earns its premium — an error a hash and a relevance judge both miss? +// +// MISATTRIBUTION is that band: a source that is on-topic, unique, and real, but +// whose cited CLAIM does not appear in the page (a fabricated / mis-cited fact). +// - de-dup: passes it (it's unique). +// - relevance judge: passes it (it's on-topic). +// - claim-grounding: REJECTS it (the claim is absent from the fetched text). +// +// The grounding check is DETERMINISTIC text presence over `htmlToText` output — +// executable ground truth, $0 inference — so its value-per-dollar denominator is +// near-zero, which is the whole point: it catches what the expensive judge can't +// for what the cheap rule can't. +// =========================================================================== + +interface PoolEntry { + uri: string + title: string + text: string + /** The claim the worker will cite this source for. */ + claim: string + /** True when `claim` is NOT present in `text` — a planted misattribution. */ + misattributed: boolean +} + +const goal = 'self-speculative decoding' + +/** + * A controlled pool: each source's TEXT is what a real fetch would return; the + * CLAIM is what a citation asserts. Two sources are correctly grounded; two are + * misattributed — on-topic, unique, real text, but the cited claim never appears + * in the page. A relevance judge would keep all four; de-dup would keep all four + * (distinct URLs); only claim-grounding rejects the misattributed two. + */ +const pool: PoolEntry[] = [ + { + uri: 'https://arxiv.org/abs/self-spec-decoding', + title: 'Draft & Verify: Lossless LLM Acceleration via Self-Speculative Decoding', + text: + 'Self-speculative decoding drafts tokens by skipping intermediate layers, then verifies them ' + + 'with the full model. It reports a 1.73x speedup on LLaMA-2 with no quality loss.', + claim: 'self-speculative decoding reports a 1.73x speedup on LLaMA-2 with no quality loss', + misattributed: false, + }, + { + uri: 'https://example.org/layer-skip-explainer', + title: 'How layer skipping accelerates decoding', + text: + 'By skipping intermediate transformer layers during the draft stage, the same model produces ' + + 'candidate tokens cheaply, which the full forward pass then verifies in parallel.', + claim: 'skipping intermediate layers lets the same model produce candidate tokens cheaply', + misattributed: false, + }, + { + // On-topic, real text, UNIQUE url — but the cited claim (a 5x speedup on a + // separate draft network) appears NOWHERE in the page. Misattribution. + uri: 'https://blog.example.com/spec-decoding-overview', + title: 'A short overview of speculative decoding', + text: + 'Speculative decoding uses a small draft model to propose tokens that a larger target model ' + + 'verifies. Self-speculative variants reuse a single model instead of a separate draft model.', + claim: 'achieves a 5x speedup using a separately trained draft transformer network', + misattributed: true, + }, + { + // On-topic, real text, UNIQUE url — cited claim invents a benchmark/number + // the page never states. Misattribution. + uri: 'https://news.example.com/decoding-roundup', + title: 'Decoding methods roundup', + text: + 'This roundup compares several inference-time decoding strategies and notes that speculative ' + + 'approaches trade extra compute for lower latency on autoregressive generation.', + claim: 'measured a 4.8x speedup on the GPT-4 MT-bench benchmark with zero accuracy drop', + misattributed: true, + }, +] + +const specs: KnowledgeReadinessSpec[] = [ + defineReadinessSpec({ + id: 'topic/definition', + description: `what ${goal} is and how it works`, + query: `${goal} how it works method`, + requiredFor: ['ResearchAgent'], + importance: 'blocking', + minSources: 1, + minHits: 1, + }), +] + +/** A worker that proposes the whole pool, each source carrying its cited claim. */ +function poolWorker(onPass: () => void): ResearchWorker { + return async (_ctx: WorkerResearchContext): Promise => { + onPass() + const sources: ResearchSourceProposal[] = pool.map((entry) => + withCitedClaim( + { + uri: entry.uri, + title: entry.title, + text: entry.text, + metadata: { planted_misattributed: entry.misattributed }, + }, + entry.claim, + ), + ) + return { + sources, + buildPages: (accepted) => + accepted + .map((record) => { + const original = record.metadata?.originalUri + const entry = pool.find((p) => p.uri === original) + const slug = String(original ?? record.id).replace(/[^a-z0-9]+/gi, '-') + return [ + `---FILE: knowledge/${slug}.md---`, + '---', + `title: ${entry?.title ?? record.id}`, + `sources: ["${record.id}"]`, + '---', + `# ${entry?.title ?? record.id}`, + entry?.text ?? '', + '---END FILE---', + ].join('\n') + }) + .join('\n'), + notes: `proposed ${sources.length} sources (2 grounded, 2 misattributed)`, + } + } +} + +/** A no-op verifier: accepts everything (the "no verifier" arm). */ +const acceptAllDriver: ResearchDriver = { verifySource: () => ({ accept: true }) } + +/** A relevance-only verifier stand-in: every pool source IS on-topic, so accept all. */ +const relevanceOnlyDriver: ResearchDriver = { + verifySource: (source: ResearchSourceProposal) => { + // Everything in the pool mentions decoding/tokens/model → on-topic → accept. + // This is exactly why relevance can't catch misattribution: the page IS + // relevant; only its cited claim is wrong. + const onTopic = /decod|token|model|layer/i.test(source.text) + return onTopic ? { accept: true } : { accept: false, reason: 'off-topic' } + }, +} + +/** Count how many planted-misattributed sources reached the KB. */ +async function misattributedAdmitted(root: string): Promise { + const index = await buildKnowledgeIndex(root) + return index.sources.filter((s) => s.metadata?.planted_misattributed === true).length +} + +async function admittedCount(root: string): Promise { + const index = await buildKnowledgeIndex(root) + return index.sources.length +} + +let groundRoot: string +let relevanceRoot: string +let noneRoot: string + +beforeEach(async () => { + groundRoot = await mkdtemp(join(tmpdir(), 'cg-ground-')) + relevanceRoot = await mkdtemp(join(tmpdir(), 'cg-relevance-')) + noneRoot = await mkdtemp(join(tmpdir(), 'cg-none-')) +}) +afterEach(async () => { + await rm(groundRoot, { recursive: true, force: true }) + await rm(relevanceRoot, { recursive: true, force: true }) + await rm(noneRoot, { recursive: true, force: true }) +}) + +describe('claim-grounding A/B (offline, controlled): catches misattribution dedup+relevance miss', () => { + it('only the claim-grounding verifier rejects the planted misattributions', async () => { + let groundPasses = 0 + await runTwoAgentResearchLoop({ + root: groundRoot, + goal, + worker: poolWorker(() => { + groundPasses += 1 + }), + driver: { verifySource: createClaimGroundingVerifier() }, + readinessSpecs: specs, + maxRounds: 1, + }) + await runTwoAgentResearchLoop({ + root: relevanceRoot, + goal, + worker: poolWorker(() => {}), + driver: relevanceOnlyDriver, + readinessSpecs: specs, + maxRounds: 1, + }) + await runTwoAgentResearchLoop({ + root: noneRoot, + goal, + worker: poolWorker(() => {}), + driver: acceptAllDriver, + readinessSpecs: specs, + maxRounds: 1, + }) + + const groundMis = await misattributedAdmitted(groundRoot) + const relevanceMis = await misattributedAdmitted(relevanceRoot) + const noneMis = await misattributedAdmitted(noneRoot) + const groundAdmitted = await admittedCount(groundRoot) + + console.log( + `[claim-grounding A/B offline] misattributed admitted — ` + + `grounding=${groundMis} relevance=${relevanceMis} no-verifier=${noneMis} ` + + `(grounding kept ${groundAdmitted}/${pool.length} sources)`, + ) + + // The band: relevance + no-verifier let BOTH misattributions through (they + // are on-topic + unique); only claim-grounding rejects them. + expect(groundMis).toBe(0) + expect(relevanceMis).toBe(2) + expect(noneMis).toBe(2) + // Grounding still keeps the two correctly-grounded sources (no over-rejection). + expect(groundAdmitted).toBe(2) + expect(groundPasses).toBe(1) + }) +}) + +// =========================================================================== +// LIVE claim-grounding A/B — the real evidence. Skipped offline (no creds). +// +// Runs a REAL web-research worker (glm-5.2 query-gen → live /v1/search → +// politeFetch → htmlToText) on a topic set, then INJECTS a controlled fraction +// of MISATTRIBUTED citations (real fetched page + a deliberately-wrong claim) so +// there is a measurable correctable band. Three verifier arms gate the SAME +// proposals on the SAME topics: +// +// A. no-verifier — accept all (the floor). +// B. relevance (LLM) — the shipped createVerifyingResearchDriver: 1 LLM call +// per source, judges on-topic relevance. +// C. claim-grounding — deterministic groundClaimInText over the fetched text: +// $0 inference, rejects misattributions. +// +// Using #36's RouterClient.usage() we diff each arm's tokens/$/latency/calls and +// report VALUE PER DOLLAR = misattributions-caught ÷ marginal-USD. Arm C's +// denominator is ~$0, so if it catches the injected misattributions it dominates +// arm B on this band — the cost the relevance judge could not earn on the +// de-dup-dominated set. +// +// Gate: AGENT_KNOWLEDGE_LIVE=1 + TANGLE_API_KEY with glm-5.2 credits. +// CLAIM_GROUNDING_LIVE_GOALS — `|`-separated topics +// CLAIM_GROUNDING_LIVE_MODEL — router chat model (default glm-5.2) +// =========================================================================== + +interface ArmCost { + chatCalls: number + tokens: number + usd: number + wallMs: number +} + +function diffCost( + a: ReturnType, + b: ReturnType, +): ArmCost { + return { + chatCalls: b.chatCalls - a.chatCalls, + tokens: b.promptTokens + b.completionTokens - a.promptTokens - a.completionTokens, + usd: b.usd - a.usd, + wallMs: b.wallMs - a.wallMs, + } +} + +/** + * Take the sources a real worker fetched and (1) attach a GROUNDED claim built + * from the page's own text, then (2) corrupt every `corruptEvery`-th source into + * a MISATTRIBUTION by replacing its claim with one whose specific words are not + * in the page. Returns the annotated sources plus the count corrupted, so the + * arms share an identical proposal set with a known misattribution band. + */ +function plantMisattributions( + sources: ResearchSourceProposal[], + corruptEvery: number, +): { annotated: ResearchSourceProposal[]; planted: number } { + let planted = 0 + const fabricated = [ + 'this page reports a 7.3x speedup on the FluxBench-9000 benchmark with zero accuracy loss', + 'the authors trained a separate 12-billion-parameter draft transformer on proprietary data', + 'results show a 94.2 BLEU score on the never-released Zephyr-XL evaluation suite', + ] + const annotated = sources.map((source, i) => { + if (sources.length >= 2 && i % corruptEvery === corruptEvery - 1) { + const claim = fabricated[planted % fabricated.length] ?? fabricated[0] + planted += 1 + return withCitedClaim( + { ...source, metadata: { ...source.metadata, planted_misattributed: true } }, + claim, + ) + } + // GROUNDED claim: the first sentence of the page's own text (verbatim-present + // by construction), so a correct citation grounds. + const firstSentence = source.text.split(/(?<=[.!?])\s/)[0]?.trim() || source.text.slice(0, 160) + return withCitedClaim( + { ...source, metadata: { ...source.metadata, planted_misattributed: false } }, + firstSentence, + ) + }) + return { annotated, planted } +} + +describe.skipIf(!process.env.AGENT_KNOWLEDGE_LIVE)('live: claim-grounding A/B per dollar', () => { + it('three verifier arms on real web sources with a planted misattribution band', async () => { + const goals = ( + process.env.CLAIM_GROUNDING_LIVE_GOALS ?? + 'self-speculative decoding|rotary position embeddings|grouped-query attention' + ) + .split('|') + .map((g) => g.trim()) + .filter(Boolean) + const model = process.env.CLAIM_GROUNDING_LIVE_MODEL ?? 'glm-5.2' + const router: RouterClient = createTangleRouterClient({ model }) + + // COST GATE: cheap glm-5.2 smoke BEFORE the multi-topic burn. + const smoke = await router.chat( + [ + { role: 'system', content: 'Reply with exactly the word: OK' }, + { role: 'user', content: 'Say OK.' }, + ], + 1200, + ) + console.log(`[LIVE smoke] glm-5.2 visible content length=${smoke.trim().length}`) + expect(smoke.trim().length).toBeGreaterThan(0) + + const worker = createWebResearchWorker({ + router, + resultsPerQuery: 3, + queriesPerGap: 1, + maxSourcesPerRound: 6, + }) + const relevanceDriver = createVerifyingResearchDriver({ router }) + const groundingDriver: ResearchDriver = { verifySource: createClaimGroundingVerifier() } + + let totalPlanted = 0 + let anyFetched = false + const rows: { + goal: string + planted: number + caught: Record<'none' | 'relevance' | 'grounding', number> + cost: Record<'none' | 'relevance' | 'grounding', ArmCost> + }[] = [] + + for (const liveGoal of goals) { + const liveSpecs: KnowledgeReadinessSpec[] = [ + defineReadinessSpec({ + id: 'topic/definition', + description: `what ${liveGoal} is and how it works`, + query: `${liveGoal} how it works method`, + requiredFor: ['ResearchAgent'], + importance: 'blocking', + minSources: 1, + minHits: 1, + }), + ] + // 1. REAL fetch ONCE per topic, shared by all three arms. + const probeRoot = await mkdtemp(join(tmpdir(), 'cg-live-probe-')) + let annotated: ResearchSourceProposal[] = [] + let planted = 0 + try { + const index = await buildKnowledgeIndex(probeRoot) + const readiness = buildEvalKnowledgeBundle({ taskId: liveGoal, index, specs: [] }) + const contribution = await worker({ + root: probeRoot, + goal: liveGoal, + round: 1, + index, + gaps: liveSpecs.map((s) => ({ + id: s.id, + description: s.description, + query: typeof s.metadata?.query === 'string' ? s.metadata.query : s.description, + blocking: true, + })), + readiness, + }) + const fetched = contribution.sources ?? [] + if (fetched.length > 0) anyFetched = true + const result = plantMisattributions(fetched, 2) + annotated = result.annotated + planted = result.planted + totalPlanted += planted + } finally { + await rm(probeRoot, { recursive: true, force: true }) + } + + // 2. Run each arm's verifier over the SAME annotated proposals, diffing cost. + const staticWorker: ResearchWorker = async () => ({ + sources: annotated, + buildPages: (accepted) => + accepted.length === 0 + ? undefined + : accepted + .map((record) => { + const original = record.metadata?.originalUri + const src = annotated.find((s) => s.uri === original) + const slug = String(original ?? record.id) + .replace(/[^a-z0-9]+/gi, '-') + .slice(0, 120) + return [ + `---FILE: knowledge/${slug}.md---`, + '---', + `title: ${src?.title ?? record.id}`, + `sources: ["${record.id}"]`, + '---', + `# ${src?.title ?? record.id}`, + src?.text ?? '', + '---END FILE---', + ].join('\n') + }) + .join('\n'), + }) + + const caught: Record<'none' | 'relevance' | 'grounding', number> = { + none: 0, + relevance: 0, + grounding: 0, + } + const cost: Record<'none' | 'relevance' | 'grounding', ArmCost> = { + none: { chatCalls: 0, tokens: 0, usd: 0, wallMs: 0 }, + relevance: { chatCalls: 0, tokens: 0, usd: 0, wallMs: 0 }, + grounding: { chatCalls: 0, tokens: 0, usd: 0, wallMs: 0 }, + } + for (const arm of ['none', 'relevance', 'grounding'] as const) { + const driver = + arm === 'none' ? acceptAllDriver : arm === 'relevance' ? relevanceDriver : groundingDriver + const root = await mkdtemp(join(tmpdir(), `cg-live-${arm}-`)) + try { + const u0 = router.usage() + await runTwoAgentResearchLoop({ + root, + goal: liveGoal, + worker: staticWorker, + driver, + readinessSpecs: liveSpecs, + maxRounds: 1, + }) + cost[arm] = diffCost(u0, router.usage()) + const admittedMis = await misattributedAdmitted(root) + // caught = planted − admitted misattributions. + caught[arm] = planted - admittedMis + } finally { + await rm(root, { recursive: true, force: true }) + } + } + + rows.push({ goal: liveGoal, planted, caught, cost }) + console.log( + `[LIVE CG ${JSON.stringify(liveGoal)}] planted=${planted} ` + + `caught none=${caught.none} relevance=${caught.relevance} grounding=${caught.grounding} | ` + + `$ none=${cost.none.usd.toFixed(4)} relevance=${cost.relevance.usd.toFixed(4)} grounding=${cost.grounding.usd.toFixed(4)} | ` + + `calls relevance=${cost.relevance.chatCalls} grounding=${cost.grounding.chatCalls}`, + ) + } + + expect(anyFetched).toBe(true) + + // VALUE PER DOLLAR over all topics: misattributions caught ÷ marginal USD. + const totalCaught = (arm: 'none' | 'relevance' | 'grounding') => + rows.reduce((acc, r) => acc + r.caught[arm], 0) + const totalUsd = (arm: 'none' | 'relevance' | 'grounding') => + rows.reduce((acc, r) => acc + r.cost[arm].usd, 0) + const relevanceCaught = totalCaught('relevance') + const groundingCaught = totalCaught('grounding') + const relevanceUsd = totalUsd('relevance') + const groundingUsd = totalUsd('grounding') + const perDollar = (caughtN: number, usd: number) => + usd > 0 ? caughtN / usd : caughtN > 0 ? Number.POSITIVE_INFINITY : 0 + + console.log( + `[LIVE CG SUMMARY] planted=${totalPlanted} ` + + `caught relevance=${relevanceCaught} grounding=${groundingCaught} | ` + + `$ relevance=${relevanceUsd.toFixed(4)} grounding=${groundingUsd.toFixed(4)} | ` + + `per-$ relevance=${perDollar(relevanceCaught, relevanceUsd).toFixed(1)} ` + + `grounding=${perDollar(groundingCaught, groundingUsd).toFixed(1)}`, + ) + + // The claim-grounding arm should catch at least as many misattributions as + // the relevance arm (it is the band relevance cannot see) at strictly lower + // marginal cost (deterministic, $0 inference). This is the value-per-dollar + // claim the doc reports; assert the DIRECTION here, the magnitude in the doc. + expect(groundingCaught).toBeGreaterThanOrEqual(relevanceCaught) + expect(groundingUsd).toBeLessThanOrEqual(relevanceUsd) + }, 600_000) +}) diff --git a/tests/loops/research-loop-equal-compute.test.ts b/tests/loops/research-loop-equal-compute.test.ts index 8ae6544..397dbaf 100644 --- a/tests/loops/research-loop-equal-compute.test.ts +++ b/tests/loops/research-loop-equal-compute.test.ts @@ -627,6 +627,9 @@ describe.skipIf(!process.env.AGENT_KNOWLEDGE_LIVE)( const twoRoot = await mkdtemp(join(tmpdir(), 'live-two-')) const singleRoot = await mkdtemp(join(tmpdir(), 'live-single-')) try { + // Snapshot cumulative router cost so each arm's token/$/latency/calls + // can be diffed out — the COST half of the cost/quality result. + const u0 = router.usage() // TWO-AGENT arm: real worker proposes, real LLM driver verifies. const two = await runTwoAgentArm( twoRoot, @@ -635,6 +638,7 @@ describe.skipIf(!process.env.AGENT_KNOWLEDGE_LIVE)( { worker: realWorker, driver: realDriver }, specs, ) + const u1 = router.usage() // SINGLE-AGENT arm: the SAME real worker, NO verifier gate, more iters // to spend the same agent-pass budget the two-agent loop burns on // verification. @@ -645,6 +649,19 @@ describe.skipIf(!process.env.AGENT_KNOWLEDGE_LIVE)( (ctx, onPass) => realWorkerPropose(realWorker, ctx, onPass), specs, ) + const u2 = router.usage() + const twoCost = { + chatCalls: u1.chatCalls - u0.chatCalls, + tokens: u1.promptTokens + u1.completionTokens - u0.promptTokens - u0.completionTokens, + usd: u1.usd - u0.usd, + wallMs: u1.wallMs - u0.wallMs, + } + const singleCost = { + chatCalls: u2.chatCalls - u1.chatCalls, + tokens: u2.promptTokens + u2.completionTokens - u1.promptTokens - u1.completionTokens, + usd: u2.usd - u1.usd, + wallMs: u2.wallMs - u1.wallMs, + } const twoAdmitted = await admittedSourceCount(twoRoot) const singleAdmitted = await admittedSourceCount(singleRoot) @@ -657,8 +674,10 @@ describe.skipIf(!process.env.AGENT_KNOWLEDGE_LIVE)( console.log( `[LIVE A/B ${JSON.stringify(liveGoal)} @ B<=${budgetPasses}] ` + - `two-agent: passes=${two.passes} admitted=${twoAdmitted} coverage=${twoCoverage.toFixed(2)} | ` + - `single-agent: passes=${single.passes} admitted=${singleAdmitted} coverage=${singleCoverage.toFixed(2)}`, + `two-agent: passes=${two.passes} admitted=${twoAdmitted} coverage=${twoCoverage.toFixed(2)} ` + + `calls=${twoCost.chatCalls} tok=${twoCost.tokens} $${twoCost.usd.toFixed(4)} ${twoCost.wallMs}ms | ` + + `single-agent: passes=${single.passes} admitted=${singleAdmitted} coverage=${singleCoverage.toFixed(2)} ` + + `calls=${singleCost.chatCalls} tok=${singleCost.tokens} $${singleCost.usd.toFixed(4)} ${singleCost.wallMs}ms`, ) } finally { await rm(twoRoot, { recursive: true, force: true })