diff --git a/docs/results/adaptive.md b/docs/results/adaptive.md
new file mode 100644
index 0000000..8533536
--- /dev/null
+++ b/docs/results/adaptive.md
@@ -0,0 +1,120 @@
+# Adaptive topology: spend the LLM verifier only when it pays
+
+The cost/quality A/B (`docs/results/cost-quality.md`) found the LLM relevance
+verifier's cleanliness win is dominated by **de-duplication** — which a
+deterministic content-hash / canonical-URL check captures at ~none of the LLM
+premium — and that an LLM check only earns its dollar on the off-scope tail. The
+production move it named was: do the cheap deterministic work first, reserve the
+LLM for the ambiguous tail. `createAdaptiveResearchDriver`
+(`src/adaptive-driver.ts`) is that driver, and this is its measurement.
+
+Per candidate source the adaptive driver runs three stages, cheapest first,
+stopping at the first that decides:
+
+1. **Dedup ($0).** Reject a source whose canonical URL (scheme/`www`/trailing
+   slash/tracking params stripped) or normalized-text content hash matches one
+   already accepted this round or in the KB.
+2. **Heuristic triage ($0).** Classify a unique survivor with host/title/length
+   signals only: an authoritative host (arxiv, `*.edu`, `*.gov`, official docs,
+   github, …) with a substantial body is **kept**; an obvious spam/listicle
+   title or a too-thin body is **dropped**; everything else is **ambiguous**.
+3. **LLM escalation ($).** Only ambiguous survivors reach the shipped LLM
+   relevance verifier (`createVerifyingResearchDriver`) — one call each.
+
+## Live frontier, n=5 topics (glm-5.2)
+
+Real web-research worker fetches each topic once; the same fetched proposals
+(plus one planted tracking-decorated mirror of the first source, so the dedup
+stage has a real duplicate to catch) are gated through all three drivers. Cost
+is the per-arm `RouterClient.usage()` diff (#36). Total spend for the run:
+**$0.033**.
+
+| topic | fetched | single admit | full-LLM admit / calls / $ | adaptive admit / LLM calls / $ |
+|---|---|---|---|---|
+| self-speculative decoding | 3 | 3 | 1 / 3 / $0.0027 | 2 / **0** / **$0.0000** |
+| rotary position embeddings | 3 | 3 | 1 / 3 / $0.0031 | 2 / **0** / **$0.0000** |
+| grouped-query attention | 7 | 7 | 3 / 7 / $0.0072 | 6 / 3 / $0.0030 |
+| KV-cache quantization | 5 | 5 | 3 / 5 / $0.0052 | 4 / **0** / **$0.0000** |
+| LoRA fine-tuning | 7 | 7 | 4 / 7 / $0.0079 | 6 / 3 / $0.0037 |
+| **total** | **25** | **25** | **12 / 25 / $0.0261** | **20 / 6 / $0.0068** |
+
+**Cost.** Adaptive cuts LLM verifier calls **76%** (25 → 6) and dollars **74%**
+($0.0261 → $0.0068). On 3 of the 5 topics it spent **zero** LLM calls — every
+unique survivor was on an authoritative host, so the $0 stages decided
+everything.
+
+## The honest reading: adaptive is a frontier POINT, not a free lunch
+
+Admitted-source counts (lower = cleaner KB): **single 25, adaptive 20,
+full-LLM 12**. Adaptive sits **between** the two:
+
+- It removes **5 of the 13 sources** the full-LLM judge rejects that the
+  single-agent loop keeps (every one of them a real duplicate caught by the $0
+  dedup stage — exactly the de-dup-dominated win the cost/quality result
+  predicted).
+- It does **NOT** match full-LLM cleanliness. The remaining 8 sources full-LLM
+  rejects, adaptive keeps. The cause is structural and visible in the trace: on
+  this topic set every non-duplicate survivor landed on an authoritative host
+  (arxiv / github / official docs), so the heuristic **kept** it without ever
+  asking the LLM — and the LLM, when full-LLM did ask it, judged several of
+  those same authoritative pages not-quite-on-topic and dropped them. The host
+  prior is coarser than the relevance judge.
+
+So the frontier tradeoff is concrete: **adaptive recovers the deterministic
+de-dup half of full-LLM's cleanliness for ~26% of full-LLM's dollars, and gives
+up the relevance-judgment half.** Whether that is the right point depends on the
+cost of a kept-but-marginal source. If a slightly-off-topic authoritative page
+is cheap to carry, adaptive dominates. If every admitted source must clear a
+relevance bar, the host heuristic is too permissive and you want the full LLM —
+or a tightened heuristic.
+
+## Where the heuristic is weak — stated plainly
+
+The escalation count is the diagnostic. On 3 of 5 topics it was **zero**: the
+heuristic never deferred to the LLM, so on those topics adaptive is a
+**pure host/title/length rule**, and its cleanliness is exactly that rule's
+cleanliness — no smarter than "trust arxiv/github, drop spam." That is fine when
+the worker's sources are dominated by authoritative hosts (as here), but it
+means the LLM's relevance judgment is contributing nothing on those topics, by
+construction. The two topics where adaptive *did* escalate (grouped-query
+attention, LoRA) are where some survivors were on unknown hosts — and there the
+3 LLM calls per topic are the off-scope tail the verifier is actually for.
+
+The heuristic would mis-route in two directions a richer worker would expose,
+neither seen on this authoritative-host-heavy set:
+
+- **False keep:** an authoritative-host page that is off-topic or low-value
+  (an arxiv paper on an unrelated subject) is kept without the LLM ever seeing
+  it. The host prior cannot catch this; only the relevance judge can.
+- **False drop:** a genuinely good source on an unknown blog/host with a
+  spam-shaped title, or under the 400-char body floor, is dropped before the LLM
+  could rescue it.
+
+## What this changes
+
+The deployable recommendation from the cost/quality result was "deterministic
+dedup first, reserve the LLM for the tail." This driver ships that and the
+measurement confirms the **cost** half cleanly (76% fewer calls, 74% cheaper)
+and qualifies the **quality** half honestly: adaptive captures the de-dup
+cleanliness (the dominant, free win) but not the LLM's relevance cleanliness,
+because the host heuristic resolves authoritative survivors without asking. For
+a worker whose sources are mostly authoritative, adaptive is the right frontier
+point. For one whose junk is on-topic-looking pages on unknown hosts, the
+ambiguous tail grows and adaptive converges toward full-LLM cost — which is the
+correct behavior: it pays for the LLM exactly when the cheap signals can't
+decide.
+
+## Run it
+
+```
+# offline (controlled wiring + escalates-only-ambiguous proof, no creds)
+pnpm exec vitest run tests/loops/adaptive-ab.test.ts
+
+# live three-topology frontier (needs TANGLE_API_KEY with glm-5.2 credits)
+AGENT_KNOWLEDGE_LIVE=1 \
+ADAPTIVE_LIVE_GOALS="self-speculative decoding|rotary position embeddings|grouped-query attention|KV-cache quantization|LoRA fine-tuning" \
+pnpm exec vitest run tests/loops/adaptive-ab.test.ts -t "three-topology"
+```
+
+A cheap one-call glm-5.2 smoke gates the multi-topic burn (fails fast if the key
+or the reasoning-token floor is broken) before any dollars are spent.
diff --git a/docs/results/claim-grounding.md b/docs/results/claim-grounding.md
new file mode 100644
index 0000000..f262d9f
--- /dev/null
+++ b/docs/results/claim-grounding.md
@@ -0,0 +1,97 @@
+# Claim-grounding: the band where the verifier earns its dollar
+
+`docs/results/cost-quality.md` found the relevance verifier's cleanliness win is
+**dominated by de-duplication** — a deterministic content-hash captures most of it
+at ~none of the LLM premium. So the open question was: is there an error band where
+a verifier earns its cost — something a hash AND a relevance judge both miss?
+
+**Yes: misattributed citations.** A source that is on-topic, unique, and real, but
+whose cited CLAIM does not appear in the page (the LLM wrote a plausible sentence
+and hung a real URL off it). De-dup passes it (it's unique). A relevance judge
+passes it (the page is on-topic). Only checking the claim against the fetched text
+catches it — and that check is **deterministic text presence, $0 inference**.
+
+## The mode
+
+Each proposed source now carries the specific claim it is cited for
+(`withCitedClaim` → `metadata.citedClaim`). The verifier
+(`createClaimGroundingVerifier`) runs `groundClaimInText(claim, pageText)` over the
+`htmlToText` output of the page the worker actually fetched — verbatim, normalized
+(punctuation/whitespace-insensitive), or a ≥70% content-word overlap close
+paraphrase. A claim that isn't present is rejected as **misattributed**. The oracle
+is text presence, not a model call, so it composes with the LLM relevance verifier
+(reject off-topic AND misattributed) or runs alone at zero inference cost.
+
+## Live A/B (glm-5.2, real web fetch, planted misattribution band)
+
+Real worker (glm-5.2 query-gen → live `/v1/search` → `politeFetch` → `htmlToText`)
+fetches the sources once per topic; we then plant ONE misattribution per topic (real
+fetched page + a deliberately-wrong claim) and run three verifier arms over the SAME
+proposals. Cost diffed per arm with the #36 `RouterClient.usage()` instrumentation.
+
+| n=5 topics | misattributions caught | marginal $ | $/topic | per-$ caught |
+|---|---|---|---|---|
+| no-verifier | 0 / 5 | $0.0000 | — | — |
+| relevance (LLM judge) | **4 / 5** | $0.0157 | ~$0.0031 | 254 |
+| claim-grounding (text) | **5 / 5** | **$0.0000** | $0 | ∞ |
+
+Per-topic (caught relevance / grounding): self-speculative decoding 1/1, rotary
+position embeddings 1/1, grouped-query attention 1/1, **KV-cache quantization 0/1**,
+LoRA 1/1. (An earlier 3-topic run missed self-speculative decoding instead — the
+miss moves around; it is not a fixed topic.)
+
+**Reading.** Claim-grounding catches every misattribution at **$0**; the relevance
+judge catches most but **misses one in five at ~$0.003/topic**. The miss is the
+point: the relevance verifier only ever sees the page text, never the cited claim,
+so it is *structurally blind* to misattribution. It catches one only by accident —
+when the fabricated claim happens to also make the page read off-topic
+(e.g. a "12-billion-parameter draft transformer" claim on a rotary-embeddings page).
+When the fabrication stays on-topic (the KV-cache case), the judge waves it through.
+
+So on THIS band the verifier-per-dollar comparison inverts the cost/quality result:
+there, the LLM verifier bought a dedup-shaped gain a free hash already captures —
+expensive for what a cheap rule does. Here the cheap, deterministic check
+**dominates** the expensive judge: it catches strictly more (5/5 vs 4/5) at strictly
+less ($0 vs $0.0157). The verifier earns its dollar on misattribution; it does not on
+de-duplication.
+
+## Why this is a real correctable band (not dedup, not relevance)
+
+- **Not de-duplication.** Every planted source has a unique URL and unique text; a
+  content-hash / canonical-URL dedup keeps all of them.
+- **Not generic relevance.** Every planted source is on-topic; the relevance judge
+  (and the offline relevance stand-in) accept them. The error is in the *claim*, not
+  the *topic*.
+- **Executable ground truth.** The check is presence/close-paraphrase of the claim
+  in the fetched text — deployable in production with no oracle and no model call.
+
+The offline arm proves the floor with a controlled 4-source pool (2 grounded, 2
+misattributed): claim-grounding admits **0/2** misattributions and keeps **2/2**
+grounded sources, while relevance and no-verifier both admit **2/2**.
+
+## Threats to validity
+
+- **n=5 topics, 1 misattribution each.** The direction (grounding ≥ relevance caught,
+  at ≤ cost) is asserted in the test on every run; the magnitude is small-n. The
+  relevance miss-rate (1/5 here, 1/3 earlier) is an existence proof of the blind
+  spot, not a calibrated rate.
+- **Planted misattributions, not naturally-occurring ones.** Like the cost/quality
+  offline floor, the misattribution is injected so the band is measurable. It models
+  the real LLM citation-fabrication failure but does not measure its base rate in the
+  wild — that needs a corpus of model-written citations checked by hand.
+- **The grounding oracle is conservative.** A real paraphrase whose inflected words
+  differ from the page ("drafts" vs "draft") can score below 0.7 and be rejected —
+  a false-positive misattribution flag. `minOverlap` tunes this; the worker should
+  cite the page's own key terms (the `createClaimDecorator` extractor is told to).
+
+## Run it
+
+```bash
+# offline floor (no creds)
+pnpm exec vitest run tests/loops/claim-grounding-ab.test.ts -t "offline"
+
+# live A/B (creds-gated). A cheap glm-5.2 smoke runs BEFORE the multi-topic burn.
+AGENT_KNOWLEDGE_LIVE=1 TANGLE_API_KEY=… \
+  CLAIM_GROUNDING_LIVE_GOALS='self-speculative decoding|rotary position embeddings|grouped-query attention|KV-cache quantization|LoRA' \
+  pnpm exec vitest run tests/loops/claim-grounding-ab.test.ts -t "three verifier arms"
+```
diff --git a/docs/results/cost-quality.md b/docs/results/cost-quality.md
new file mode 100644
index 0000000..6a1d7f6
--- /dev/null
+++ b/docs/results/cost-quality.md
@@ -0,0 +1,27 @@
+# Cost/quality: the two-agent loop's inference premium
+
+Live 9-topic A/B (glm-5.2, budget B ≤ 4 passes/arm), measured per arm with the
+router-client instrumentation. The original A/B reported only admitted-sources at
+"equal passes" — which charged the two-agent verify step as one pass while it is
+actually N `verifySource` LLM calls. Pricing the calls shows what that hid.
+
+| per topic (mean) | two-agent | single-agent | ratio |
+|---|---|---|---|
+| LLM chat calls | 5.4 | 1.0 | ~5.4× |
+| tokens (in+out) | ~4,900 | ~530 | ~9× |
+| cost (USD) | ~$0.0072 | ~$0.0013 | ~5.5× |
+| latency (wall) | ~37 s | ~11 s | ~3.4× |
+| cleanliness Δ (single − two admitted) | — | — | +1.56, 95% CI [0.33, 2.67] |
+
+Per-topic Δ (single − two admitted) this run: self-speculative decoding +4,
+grouped-query attention 0, rotary position embeddings +1, KV-cache quantization
+**−1**, LoRA +4, ring attention +2, constitutional AI +3, transformer +2, gradient
+descent **−1**. Coverage 1.00 every topic, both arms.
+
+**Reading.** The verifier buys ~1.5–2.7 fewer junk sources for roughly **5× the
+dollars, 9× the tokens, and 3× the latency** — and on two topics it admitted *more*
+than the single agent (the cleanliness signal is real but noisier than the +2.3 /
++2.7 of earlier runs). The cleanliness gain is dominated by de-duplication, so the
+honest production move is a deterministic content-hash / canonical-URL dedup, which
+captures most of the cleanliness at ~none of this premium; reserve an LLM check for
+the off-scope tail. This is the cost half the "equal passes" framing left out.
diff --git a/docs/two-agent-research-ab.md b/docs/two-agent-research-ab.md
index 7f45d16..124b597 100644
--- a/docs/two-agent-research-ab.md
+++ b/docs/two-agent-research-ab.md
@@ -1,4 +1,4 @@
-# A verifier agent mostly deduplicates: a controlled A/B on two-agent web research
+# A verifier agent mostly deduplicates: a controlled A/B on two-agent web research, and what its cost buys
 
 *Tangle Network · `agent-knowledge`*
 
@@ -15,10 +15,25 @@ fewer sources per topic at identical coverage** — 95% bootstrap intervals
 effect is real and reproduces. But the mechanism is not the one we set out to
 test: reading the rejection logs, most of the gain is **de-duplication** — the
 same paper fetched from arXiv, OpenReview, and the NeurIPS proceedings — not the
-relevance filtering we expected. The off-topic rejections we hypothesized were the
-minority. Most of the value is therefore recoverable with a content hash; the LLM
-verifier earns its cost only on the long tail, where a source looks on-topic but
-isn't.
+relevance filtering we expected. Pricing the verifier's calls (we added per-arm
+router-usage instrumentation) shows the cleanliness costs roughly **5× the
+dollars, 9× the tokens, and 3× the latency** of the single agent — and that the
+original "equal passes" framing hid this, because it charged the verify step as
+one pass while it is actually one LLM call per proposed source. Since the win is
+de-dup-dominated, a deterministic content hash recovers most of the cleanliness at
+~none of the premium. We then asked the sharper question — is there an error band
+where an LLM verifier *does* earn its dollar? — and found two, on opposite sides of
+the ledger. **Misattributed citations** (an on-topic, unique, real source whose
+cited claim never appears in the page) are caught by a $0 deterministic
+text-presence check that the LLM relevance judge misses 1 in 5 times, because the
+judge structurally never sees the claim. And we built the deployable shape the
+cost result implied — an **adaptive driver** that runs free dedup, then free
+heuristic triage, and escalates to the LLM only on the ambiguous tail: it cuts
+LLM verifier calls 76% and dollars 74%, recovering the de-dup half of the
+verifier's cleanliness while honestly giving up the relevance-judgment half on a
+source pool dominated by authoritative hosts. The verifier earns its dollar on
+misattribution, not on de-duplication; the right production loop spends it only
+where the cheap signals can't decide.
 
 ## 1. Setup
 
@@ -32,7 +47,9 @@ The trap in any "more agents help" claim is compute. Two agents that simply do m
 work will of course produce more — that is a bigger budget, not a finding. So the
 comparison must hold total compute fixed and ask whether the *topology* — splitting
 find from check — beats spending the same compute on a single agent that just finds
-more.
+more. And once topology shows an effect, the second question is what it costs: a
+cleaner base bought at 5× the inference is a different product decision than one
+bought for free.
 
 ## 2. Method
 
@@ -57,7 +74,8 @@ Note what the driver→worker hand-off is and isn't: the driver *steers* the wor
 by handing it the remaining readiness gaps (`foldGaps`), which is a deterministic
 formatting of unmet requirements — not an LLM authoring a fresh instruction. The
 driver's LLM work is in `verifySource` (one call per proposed source) and its own
-`research` pass.
+`research` pass. This matters for §4.2: the verify step is N calls, not one, and
+the cost framing turns on that.
 
 The readiness gate is `scoreKnowledgeReadiness` (from `agent-eval`). It scores
 *pages* (curated `knowledge/*.md`), not raw sources, and only `importance:
@@ -78,7 +96,7 @@ driver (`createVerifyingResearchDriver`) is one glm-5.2 chat call per source.
 The repo *does* ship a real `AgentProfile` for research (`researcherProfile`), and
 the **offline** control arm uses it with a stub harness — but the live arm bypasses
 it for the direct pipeline. This is a deliberate shortcut (no harness to stand up,
-~$0.20 to run) and also the loop's main simplification debt; see §6.
+~$0.20 to run) and also the loop's main simplification debt; see §7.
 
 ### 2.3 Equal compute
 
@@ -91,7 +109,21 @@ per topic, that the two-agent loop spent no more passes than the single-agent lo
 and that both stayed under the ceiling; if that ever fails the comparison has
 drifted to unequal compute and the result is void.
 
-### 2.4 Topics and readiness
+The pass-accounting has a known soft spot, which §4.2 exposes: a "verify pass" is
+not one LLM call, it is one `verifySource` call per *proposed* source that round.
+Charging it as a single pass keeps the topology comparison fair on agent passes but
+understates the verifier's dollar cost. We added explicit per-arm cost
+instrumentation to measure that directly.
+
+### 2.4 Cost instrumentation
+
+The router client now records usage per call (`RouterClient.usage()`,
+`src/web-research-worker.ts`): cumulative chat-completion count, prompt/completion
+tokens, glm-5.2 priced cost, and wall latency. Each A/B arm reads the accumulator
+before and after its run and diffs, so every reported dollar and token figure is a
+measured per-arm delta, not an estimate.
+
+### 2.5 Topics and readiness
 
 9 topics, each with two blocking requirements (the defining mechanism, and reported
 results / trade-offs). Seven are "narrow-scope-inside-a-broad-space" (e.g.
@@ -99,7 +131,7 @@ results / trade-offs). Seven are "narrow-scope-inside-a-broad-space" (e.g.
 broad space to leak in; two are clean controls (*the transformer architecture*,
 *gradient descent*).
 
-## 3. Results
+## 3. Result 1 — cleanliness: the verifier admits fewer sources at equal coverage
 
 The cleanliness signal is the **admitted-source count**: on live data there is no
 oracle, so "fewer sources admitted at equal coverage" is the measurable proxy for
@@ -125,7 +157,15 @@ are above zero. The effect reproduces; its exact magnitude varies run-to-run wit
 what the web returns (one topic swung Δ = 0→1→3 across separate runs during
 development).
 
-## 4. What the verifier actually does
+A third, cost-instrumented run (the one priced in §4.2) was noisier: mean Δ **+1.56,
+95% CI [0.33, 2.67]**, with two topics where the two-agent loop admitted *more* than
+the single agent (KV-cache quantization −1, gradient descent −1). The interval still
+clears zero, but it is the lower-bound run — a reminder that the magnitude is
+web-variance-bound, while the sign is stable.
+
+## 4. Result 2 — what the verifier does, and what it costs
+
+### 4.1 Mostly de-duplication
 
 We classified each rejection by the verifier's own stated reason:
 
@@ -149,42 +189,198 @@ which a content hash catches for free. The LLM's distinctive contribution is the
 page that *looks* on-topic but isn't — the self-speculative-vs-separate-draft
 distinction a string match would miss.
 
-## 5. Limitations
+### 4.2 The inference premium (and what "equal passes" hid)
+
+`docs/results/cost-quality.md`. The original A/B reported only admitted-sources at
+"equal passes," which charged the verify step as one pass while it is actually N
+`verifySource` LLM calls. Pricing the calls per arm (B ≤ 4 passes/arm, glm-5.2):
+
+| per topic (mean) | two-agent | single-agent | ratio |
+|---|---|---|---|
+| LLM chat calls | 5.4 | 1.0 | ~5.4× |
+| tokens (in+out) | ~4,900 | ~530 | ~9× |
+| cost (USD) | ~$0.0072 | ~$0.0013 | ~5.5× |
+| latency (wall) | ~37 s | ~11 s | ~3.4× |
+| cleanliness Δ (single − two admitted) | — | — | +1.56, 95% CI [0.33, 2.67] |
+
+The verifier buys ~1.5–2.7 fewer junk sources for roughly **5× the dollars, 9× the
+tokens, and 3× the latency**. Since the cleanliness gain is de-dup-dominated
+(§4.1), the honest production move is a deterministic content-hash / canonical-URL
+dedup, which captures most of the cleanliness at ~none of this premium, reserving
+an LLM check only for the off-scope tail. This is the cost half the "equal passes"
+framing left out — and the rest of the paper is what we built once we saw it.
+
+## 5. Result 3 — the two bands where an LLM verifier does, and doesn't, earn its dollar
+
+If de-dup is free and dominates the win, when is the LLM verifier worth its 5×? We
+found two bands, and they cut in opposite directions.
+
+### 5.1 Misattributed citations — the cheap check beats the expensive judge
+
+`docs/results/claim-grounding.md`. A source can be on-topic, unique, and real, yet
+the cited *claim* never appears in the page — the LLM wrote a plausible sentence and
+hung a real URL off it. De-dup passes it (unique). A relevance judge passes it (the
+page is on-topic). Only checking the claim against the fetched text catches it — and
+that check is **deterministic text presence, $0 inference**.
+
+Each proposed source now carries the claim it is cited for (`withCitedClaim` →
+`metadata.citedClaim`). The claim-grounding verifier (`createClaimGroundingVerifier`,
+`src/claim-grounding.ts`) runs `groundClaimInText(claim, pageText)` over the
+`htmlToText` output of the page the worker actually fetched — verbatim, normalized
+(punctuation/whitespace-insensitive), or a ≥70% content-word overlap close
+paraphrase. A claim that isn't present is rejected as misattributed. The oracle is
+text presence, not a model call, so it composes with the LLM relevance verifier or
+runs alone at zero cost.
+
+Live A/B (glm-5.2, real web fetch, one planted misattribution per topic — a real
+fetched page plus a deliberately-wrong claim — over three verifier arms on the same
+proposals):
+
+| n=5 topics | misattributions caught | marginal $ | per-$ caught |
+|---|---|---|---|
+| no-verifier | 0 / 5 | $0.0000 | — |
+| relevance (LLM judge) | 4 / 5 | $0.0157 | 254 |
+| claim-grounding (text) | **5 / 5** | **$0.0000** | ∞ |
+
+The relevance judge catches one only by accident — when the fabricated claim also
+makes the page read off-topic (a "12-billion-parameter draft transformer" claim on a
+rotary-embeddings page). When the fabrication stays on-topic (the KV-cache case), the
+judge waves it through, because the relevance verifier only ever sees the page text,
+never the cited claim — it is **structurally blind** to misattribution. On this band
+the verifier-per-dollar comparison inverts §4.2: the cheap, deterministic check
+catches strictly more (5/5 vs 4/5) at strictly less ($0 vs $0.0157). The offline
+floor confirms the wiring: on a controlled 4-source pool (2 grounded, 2
+misattributed), claim-grounding admits **0/2** misattributions and keeps **2/2**
+grounded, while relevance and no-verifier both admit **2/2**.
+
+### 5.2 Adaptive topology — pay the LLM only on the ambiguous tail
+
+`docs/results/adaptive.md`. The deployable shape §4.2 implied: do the free
+deterministic work first, reserve the LLM for what the cheap signals can't decide.
+`createAdaptiveResearchDriver` (`src/adaptive-driver.ts`) is that driver. Per
+candidate source it runs three stages, cheapest first, stopping at the first that
+decides:
+
+1. **Dedup ($0).** Reject a source whose canonical URL (scheme / `www` / trailing
+   slash / tracking params stripped) or normalized-text content hash matches one
+   already accepted this round or in the KB.
+2. **Heuristic triage ($0).** Classify a unique survivor with host/title/length
+   signals only: an authoritative host (arxiv, `*.edu`, `*.gov`, official docs,
+   github, …) with a substantial body is **kept**; an obvious spam/listicle title
+   or a too-thin body is **dropped**; everything else is **ambiguous**.
+3. **LLM escalation ($).** Only ambiguous survivors reach the shipped LLM relevance
+   verifier — one call each.
+
+Live frontier, n=5 topics, glm-5.2, same fetched proposals gated through all three
+drivers (plus one planted tracking-decorated mirror of the first source, so the
+dedup stage has a real duplicate to catch). Total spend $0.033:
+
+| topic | fetched | single admit | full-LLM admit / calls / $ | adaptive admit / LLM calls / $ |
+|---|---|---|---|---|
+| self-speculative decoding | 3 | 3 | 1 / 3 / $0.0027 | 2 / **0** / **$0.0000** |
+| rotary position embeddings | 3 | 3 | 1 / 3 / $0.0031 | 2 / **0** / **$0.0000** |
+| grouped-query attention | 7 | 7 | 3 / 7 / $0.0072 | 6 / 3 / $0.0030 |
+| KV-cache quantization | 5 | 5 | 3 / 5 / $0.0052 | 4 / **0** / **$0.0000** |
+| LoRA fine-tuning | 7 | 7 | 4 / 7 / $0.0079 | 6 / 3 / $0.0037 |
+| **total** | **25** | **25** | **12 / 25 / $0.0261** | **20 / 6 / $0.0068** |
+
+Adaptive cuts LLM verifier calls **76%** (25 → 6) and dollars **74%** ($0.0261 →
+$0.0068). On 3 of the 5 topics it spent **zero** LLM calls — every unique survivor
+was on an authoritative host, so the $0 stages decided everything.
+
+It is a frontier point, not a free lunch. Admitted counts (lower = cleaner): single
+25, **adaptive 20**, full-LLM 12. Adaptive removes the 5 real duplicates the $0 dedup
+catches — exactly the de-dup-dominated win — but keeps the 8 sources full-LLM rejects
+on relevance, because on this authoritative-host-heavy set the heuristic resolved
+every non-duplicate survivor without ever asking the LLM, and the host prior is
+coarser than the relevance judge. So adaptive **recovers the deterministic de-dup
+half of full-LLM's cleanliness for ~26% of its dollars, and gives up the
+relevance-judgment half**. The escalation count is the diagnostic: on the 3 topics
+where it was zero, adaptive *is* a pure host/title/length rule and the LLM
+contributes nothing by construction; on the 2 topics with unknown-host survivors
+(grouped-query attention, LoRA) it escalated 3 calls each — the off-scope tail the
+verifier is actually for.
+
+## 6. Discussion
+
+The three results compose into one rule. The LLM verifier's headline cleanliness win
+is real (§3) but **de-dup-dominated** (§4.1) and **expensive** (§4.2, ~5×/9×/3×), so
+spending an LLM call on every source is the wrong default — a free content hash buys
+most of it. The verifier earns its 5× exactly where the cheap signals are blind: on
+**misattribution** (§5.1), where a $0 text-presence check beats the LLM judge
+outright because the judge never sees the claim; and on the **off-scope tail** (§5.2),
+where a page looks on-topic, is unique, and isn't fabricated, so only a relevance
+judgment can settle it. The deployable loop therefore stratifies by cost: free dedup,
+free claim-grounding, free heuristic triage, then an LLM call only on what survives —
+which is what the adaptive driver ships.
+
+Two cross-cutting lessons. First, **the accounting unit decides the verdict**:
+charging the verify step as one pass made the topology look near-free; pricing it per
+LLM call (§4.2) is what surfaced the 5× and motivated everything after it. Second,
+**the same verifier inverts in value across bands** — on de-dup the LLM is expensive
+for what a hash does; on misattribution a deterministic check is free for what the LLM
+can't do; on the off-scope tail the LLM is the only thing that works. "Add a verifier"
+is not a setting; it is a cost-stratified decision per error type.
+
+## 7. Limitations
 
 - **The verifier is also the judge.** Admitted-count is a proxy; we have no
   independent oracle for whether a dropped source was genuinely redundant. The
   verifier's stated reasons hold up on inspection, but this is the load-bearing
-  caveat.
+  caveat for §3–§4.
 - **Deltas are conservative.** The single-agent loop stops on the same readiness
   gate, capping its admits; with more iterations it would admit even more junk, so
   the true gap is at least this large.
-- **n = 2 clean controls** is too thin to compare bands with confidence.
-- **glm-5.2-specific.** A weaker or stronger judge would shift rejection rates.
-- **High web variance.** One run per topic; results move with what search returns.
-
-## 6. A simpler loop
-
-Two simplifications fall out of the above.
-
-1. **The worker should be an `AgentProfile`, not a bespoke pipeline.** The live
-   worker is ~500 lines hand-wiring query-generation, search, fetch, and proposal
-   against the router directly. The repo's own pattern is to *author* a profile
-   (`researcherProfile`) and run it on a harness with a web-search tool — reusable
-   and harness-agnostic — rather than re-implement the agent loop. The direct
-   pipeline is cheaper to run today (no harness, no creds beyond the router) but it
-   is the loop's main piece of duplication.
-2. **The driver doesn't need an LLM for most of its work.** Since the win is
-   dominated by de-duplication, the efficient shape is a deterministic dedup
-   (content hash / canonical-URL normalization) followed by a *light* LLM check only
-   for the off-scope tail — not a full glm-5.2 `verifySource` call on every fetched
-   source. Same cleanliness, a fraction of the calls.
-
-Neither is built yet; they are the obvious next step if this loop graduates from
-experiment to production.
-
-## 7. Reproduce
-
-The loop, the worker, the verifier, and this A/B are all in this repository.
+- **Small n.** n = 2 clean controls is too thin to compare bands; the misattribution
+  and adaptive frontiers are n = 5 each. The directions are asserted in the tests on
+  every run; the magnitudes are small-n and web-variance-bound (the §3 third run swung
+  to +1.56 from +2.3/+2.7).
+- **Planted error bands.** The misattributions (§5.1) and the adaptive duplicate
+  (§5.2) are injected so the band is measurable. They model the real LLM
+  citation-fabrication and mirror-host failures but do not measure their base rate in
+  the wild — that needs a hand-checked corpus of model-written citations.
+- **Adaptive's quality is host-prior-bound.** On an authoritative-host-heavy source
+  pool the heuristic resolves everything and the LLM's relevance judgment contributes
+  nothing; a richer worker (good sources on unknown hosts, junk on on-topic-looking
+  pages) would grow the ambiguous tail and converge adaptive toward full-LLM cost.
+- **glm-5.2-specific.** A weaker or stronger judge would shift rejection rates and the
+  relevance miss-rate. The grounding oracle is also conservative: a real paraphrase
+  whose inflected words differ ("drafts" vs "draft") can fall below the 0.7 overlap and
+  be flagged misattributed; `minOverlap` tunes this.
+- **High web variance.** One live run per topic per result; numbers move with what
+  search returns.
+
+## 8. A simpler loop — built, not deferred
+
+The original write-up named two simplifications as future work. Both are now built and
+measured; this is what changed.
+
+1. **Deterministic dedup before the LLM, LLM only on the tail — shipped.** The
+   adaptive driver (`src/adaptive-driver.ts`, §5.2) does exactly this: free
+   canonical-URL / content-hash dedup, free host/title/length triage, LLM relevance
+   only on the ambiguous survivors. Measured: **76% fewer LLM calls, 74% cheaper**,
+   recovering the de-dup half of the verifier's cleanliness. The remaining gap to
+   full-LLM is the relevance-judgment half, kept honest in §5.2 — adaptive is a
+   frontier point you choose by how much a kept-but-marginal source costs you, not a
+   strict improvement.
+2. **A free check the LLM judge can't replicate — shipped.** Claim-grounding
+   (`src/claim-grounding.ts`, §5.1) adds the one verification an LLM relevance judge is
+   structurally blind to: does the cited claim actually appear in the page? It catches
+   5/5 planted misattributions at **$0**, vs the judge's 4/5 at ~$0.003/topic.
+
+What is still **not** built remains the worker: the live worker is a ~500-line
+hand-wired pipeline (query-gen, search, fetch, propose) against the router directly,
+where the repo's own pattern is to *author* an `AgentProfile` (`researcherProfile`)
+and run it on a harness with a web-search tool — reusable and harness-agnostic. The
+direct pipeline is cheaper to run today (no harness, no creds beyond the router) but it
+is the loop's main remaining piece of duplication, and the obvious next step if this
+loop graduates from experiment to production.
+
+## 9. Reproduce
+
+The loop, the worker, the verifier, the claim-grounding mode, the adaptive driver, the
+cost instrumentation, and every A/B are all in this repository. Each live test gates a
+cheap one-call glm-5.2 smoke before any multi-topic burn.
 
 ```bash
 git clone https://github.com/tangle-network/agent-knowledge
@@ -194,16 +390,39 @@ cd agent-knowledge && pnpm install
 # exercises the same harness against a planted source pool)
 pnpm exec vitest run tests/loops/research-loop-equal-compute.test.ts
 
-# the live sweep — real web search + a real glm-5.2 verifier (~$0.20 for 9 topics)
+# offline claim-grounding + adaptive floors (no credentials)
+pnpm exec vitest run tests/loops/claim-grounding-ab.test.ts -t "offline"
+pnpm exec vitest run tests/loops/adaptive-ab.test.ts
+
+# the live cleanliness sweep — real web search + a real glm-5.2 verifier, with
+# per-arm cost reported (~$0.20 for 9 topics)
 export TANGLE_API_KEY=<router key with glm-5.2 credits>
 AGENT_KNOWLEDGE_LIVE=1 \
 AGENT_KNOWLEDGE_LIVE_GOALS="self-speculative decoding|grouped-query attention|rotary position embeddings|KV-cache quantization|LoRA|ring attention|constitutional AI|the transformer architecture|gradient descent" \
   pnpm exec vitest run tests/loops/research-loop-equal-compute.test.ts
+
+# live misattribution band — three verifier arms over the same proposals
+AGENT_KNOWLEDGE_LIVE=1 TANGLE_API_KEY=<…> \
+  CLAIM_GROUNDING_LIVE_GOALS='self-speculative decoding|rotary position embeddings|grouped-query attention|KV-cache quantization|LoRA' \
+  pnpm exec vitest run tests/loops/claim-grounding-ab.test.ts -t "three verifier arms"
+
+# live adaptive frontier — single / full-LLM / adaptive on the same fetched proposals
+AGENT_KNOWLEDGE_LIVE=1 TANGLE_API_KEY=<…> \
+  ADAPTIVE_LIVE_GOALS="self-speculative decoding|rotary position embeddings|grouped-query attention|KV-cache quantization|LoRA fine-tuning" \
+  pnpm exec vitest run tests/loops/adaptive-ab.test.ts -t "three-topology"
 ```
 
-`AGENT_KNOWLEDGE_LIVE_GOALS` takes a `|`-separated topic list; the live arm runs
-both loops on each at equal compute and reports the paired bootstrap.
+`AGENT_KNOWLEDGE_LIVE_GOALS` (and the per-result `*_LIVE_GOALS`) take a `|`-separated
+topic list; the live arms run the loops on each at equal compute and report the paired
+bootstrap and per-arm cost.
 
 **Source:** the loop — [`src/two-agent-research-loop.ts`](../src/two-agent-research-loop.ts);
-the live worker + verifier — [`src/web-research-worker.ts`](../src/web-research-worker.ts);
-the A/B harness — [`tests/loops/research-loop-equal-compute.test.ts`](../tests/loops/research-loop-equal-compute.test.ts).
+the live worker + verifier + cost instrumentation — [`src/web-research-worker.ts`](../src/web-research-worker.ts);
+the misattribution check — [`src/claim-grounding.ts`](../src/claim-grounding.ts);
+the adaptive driver — [`src/adaptive-driver.ts`](../src/adaptive-driver.ts);
+the A/B harnesses — [`tests/loops/`](../tests/loops/).
+Per-result detail: [`docs/results/cost-quality.md`](results/cost-quality.md),
+[`docs/results/claim-grounding.md`](results/claim-grounding.md),
+[`docs/results/adaptive.md`](results/adaptive.md).
+</content>
+</invoke>
diff --git a/src/adaptive-driver.ts b/src/adaptive-driver.ts
new file mode 100644
index 0000000..3d70e6e
--- /dev/null
+++ b/src/adaptive-driver.ts
@@ -0,0 +1,401 @@
+/**
+ * Adaptive verifier mode for `runTwoAgentResearchLoop`.
+ *
+ * The cost/quality A/B (`docs/results/cost-quality.md`) found the LLM relevance
+ * verifier's cleanliness win is dominated by DE-DUPLICATION — which a
+ * deterministic content-hash / canonical-URL check captures at ~none of the LLM
+ * premium — and that an LLM check only earns its dollar on the off-scope tail.
+ * The honest production move it names is: do the cheap deterministic work first,
+ * spend the LLM only where it pays. This module is that driver.
+ *
+ * Per candidate source the adaptive driver runs THREE stages, cheapest first,
+ * and stops at the first that decides:
+ *
+ *   1. DEDUP ($0, no LLM). Reject a source whose CONTENT (normalized-text hash)
+ *      or whose CANONICAL URL matches one already accepted this round or already
+ *      in the knowledge base. This is the de-dup the relevance judge was being
+ *      paid to do; doing it deterministically is free and exact.
+ *
+ *   2. HEURISTIC TRIAGE ($0, no LLM). For a unique survivor, a cheap host /
+ *      title / length signal classifies it as clearly-keep, clearly-drop, or
+ *      AMBIGUOUS. Clear cases are resolved without a model: an authoritative host
+ *      (arxiv, *.edu, *.gov, official docs) with a substantial readable body is
+ *      kept; an obvious spam/listicle/marketing title or a too-thin body is
+ *      dropped. Only genuinely ambiguous survivors fall through.
+ *
+ *   3. LLM ESCALATION ($, one call). ONLY the ambiguous survivors reach the LLM
+ *      `verifySource` — the shipped `createVerifyingResearchDriver` relevance
+ *      judge. This is where the verifier earns its premium: the off-scope tail a
+ *      cheap rule can't adjudicate.
+ *
+ * The result is the cost/quality frontier point the doc predicted: most of the
+ * cleanliness (dedup + clear drops) at a fraction of the LLM $/calls (only the
+ * ambiguous tail pays). It is a real `ResearchDriver` — same contract the
+ * two-agent loop already gates on — and reuses `sha256`, the relevance verifier,
+ * and the index; it reinvents none of them.
+ */
+
+import { sha256 } from './ids'
+import type {
+  ResearchSourceProposal,
+  SourceVerdict,
+  SourceVerificationContext,
+} from './two-agent-research-loop'
+import {
+  createVerifyingResearchDriver,
+  type RouterClient,
+  type TangleRouterOptions,
+  type VerifyingDriverOptions,
+} from './web-research-worker'
+
+/**
+ * Canonicalize a URL for duplicate detection: lowercase host, strip a leading
+ * `www.`, drop the scheme, the fragment, a trailing slash, and tracking query
+ * params (`utm_*`, `ref`, `fbclid`, `gclid`, …). Two URLs that differ only by
+ * those decorations canonicalize to the same key, so the dedup stage treats them
+ * as the same source. Falls back to the lowercased raw string when the input is
+ * not a parseable absolute URL (so non-http identifiers still dedup by equality).
+ */
+export function canonicalizeUrl(uri: string): string {
+  const trimmed = uri.trim()
+  try {
+    const url = new URL(trimmed)
+    const host = url.hostname.toLowerCase().replace(/^www\./, '')
+    // Keep only non-tracking query params, sorted for stable ordering.
+    const kept: [string, string][] = []
+    for (const [key, value] of url.searchParams) {
+      const lower = key.toLowerCase()
+      if (lower.startsWith('utm_')) continue
+      if (trackingParams.has(lower)) continue
+      kept.push([key, value])
+    }
+    kept.sort(([a], [b]) => (a < b ? -1 : a > b ? 1 : 0))
+    const query = kept.map(([k, v]) => `${k}=${v}`).join('&')
+    const path = url.pathname.replace(/\/+$/, '') || '/'
+    return `${host}${path}${query ? `?${query}` : ''}`
+  } catch {
+    return trimmed.toLowerCase()
+  }
+}
+
+/** Tracking / referrer query params dropped during URL canonicalization. */
+const trackingParams = new Set([
+  'ref',
+  'ref_src',
+  'source',
+  'fbclid',
+  'gclid',
+  'mc_cid',
+  'mc_eid',
+  'igshid',
+  'spm',
+  '_hsenc',
+  '_hsmi',
+])
+
+/**
+ * A stable content key for a fetched page: the sha256 of its normalized text
+ * (lowercased, punctuation/whitespace collapsed). Two pages whose readable body
+ * is the same modulo formatting collide here, so the dedup stage rejects a
+ * mirror/syndication of an already-accepted source even when the URL differs.
+ */
+export function contentKey(text: string): string {
+  const normalized = text
+    .toLowerCase()
+    .replace(/[^\p{L}\p{N}\s]+/gu, ' ')
+    .replace(/\s+/g, ' ')
+    .trim()
+  return sha256(normalized)
+}
+
+/** Why the deterministic dedup stage rejected a candidate (for audit/notes). */
+export type DedupReason = 'duplicate-url' | 'duplicate-content'
+
+/** The cheap-triage classification of a unique survivor. */
+export type TriageClass = 'keep' | 'drop' | 'ambiguous'
+
+/** One source's adaptive routing decision, for instrumentation and the doc. */
+export interface AdaptiveDecision {
+  uri: string
+  /** The stage that decided this source: dedup | heuristic | llm. */
+  stage: 'dedup' | 'heuristic' | 'llm'
+  accepted: boolean
+  /** The triage class assigned (set once past dedup). */
+  triage?: TriageClass
+  reason?: string
+}
+
+/** Running tally of where the adaptive driver spent its decisions. */
+export interface AdaptiveStats {
+  total: number
+  /** Rejected by deterministic dedup (URL or content). $0. */
+  dedupRejected: number
+  /** Kept by the cheap heuristic without an LLM call. $0. */
+  heuristicKept: number
+  /** Dropped by the cheap heuristic without an LLM call. $0. */
+  heuristicDropped: number
+  /** Escalated to the LLM relevance verifier ($ — the only paid stage). */
+  llmCalls: number
+  /** Of the escalations, how many the LLM accepted. */
+  llmAccepted: number
+  decisions: AdaptiveDecision[]
+}
+
+export interface AdaptiveDriverOptions {
+  /** Router client for the LLM escalation. Defaults to a live client from env. */
+  router?: RouterClient
+  router_options?: TangleRouterOptions
+  /** Passed through to the escalation relevance verifier. */
+  verifying?: Pick<VerifyingDriverOptions, 'acceptOnParseFailure'>
+  /**
+   * Hosts an authoritative source lives on. A unique survivor on one of these,
+   * with a substantial body, is KEPT deterministically (no LLM). Suffix-matched
+   * against the canonical host, so `arxiv.org` matches `export.arxiv.org`. The
+   * defaults cover papers, official docs, and standards bodies.
+   */
+  authoritativeHosts?: string[]
+  /**
+   * Title/snippet patterns that mark obvious spam / listicle / marketing — a
+   * unique survivor matching one is DROPPED deterministically (no LLM).
+   */
+  spamPatterns?: RegExp[]
+  /**
+   * Below this many readable chars a survivor is too thin to be a real reference
+   * and is dropped deterministically. Default 400.
+   */
+  minBodyChars?: number
+  /**
+   * A survivor whose body is at or above this many chars AND on an authoritative
+   * host is kept without an LLM call. Default 600.
+   */
+  substantialBodyChars?: number
+  /** Receives each routing decision as it is made (for live instrumentation). */
+  onDecision?: (decision: AdaptiveDecision) => void
+}
+
+/** Default authoritative host suffixes — papers, official docs, standards. */
+const defaultAuthoritativeHosts = [
+  'arxiv.org',
+  'aclanthology.org',
+  'openreview.net',
+  'dl.acm.org',
+  'ieeexplore.ieee.org',
+  'nature.com',
+  'science.org',
+  'pubmed.ncbi.nlm.nih.gov',
+  'ncbi.nlm.nih.gov',
+  '.edu',
+  '.gov',
+  'docs.python.org',
+  'pytorch.org',
+  'tensorflow.org',
+  'huggingface.co',
+  'github.com',
+  'developer.mozilla.org',
+  'wikipedia.org',
+  'w3.org',
+  'ietf.org',
+  'rfc-editor.org',
+]
+
+/** Default spam/listicle/marketing title patterns. */
+const defaultSpamPatterns = [
+  /\bbuy\b.*\b(cheap|now|deal|sale|discount)\b/i,
+  /\b\d+\s+(things|ways|reasons|tips|tricks|secrets|hacks)\b.*\b(you|that|will)\b/i,
+  /\bshock(ing|ed)?\b/i,
+  /\bclickbait\b/i,
+  /!!!|\$\$\$/,
+  /\b(coupon|promo code|affiliate|sponsored)\b/i,
+  /\bbest .* (of \d{4}|in \d{4})\b/i,
+]
+
+/**
+ * Classify a UNIQUE survivor (already past dedup) with cheap host/title/length
+ * signals only — no LLM. Returns `keep`, `drop`, or `ambiguous`. `ambiguous` is
+ * the residue the LLM is reserved for: on-topic-looking pages on unknown hosts
+ * with a plausible body, which a host/title rule cannot adjudicate.
+ */
+export function triageSource(
+  source: ResearchSourceProposal,
+  options: {
+    authoritativeHosts: string[]
+    spamPatterns: RegExp[]
+    minBodyChars: number
+    substantialBodyChars: number
+  },
+): { triage: TriageClass; reason: string } {
+  const titleAndSnippet = `${source.title ?? ''} ${
+    typeof source.metadata?.snippet === 'string' ? source.metadata.snippet : ''
+  }`.trim()
+  const bodyLen = source.text.trim().length
+
+  // Clear DROP: obvious spam/listicle/marketing title, OR a too-thin body that
+  // can't be a real reference.
+  for (const pattern of options.spamPatterns) {
+    if (pattern.test(titleAndSnippet)) {
+      return { triage: 'drop', reason: `spam/listicle title (${pattern.source})` }
+    }
+  }
+  if (bodyLen < options.minBodyChars) {
+    return { triage: 'drop', reason: `thin body (${bodyLen} < ${options.minBodyChars} chars)` }
+  }
+
+  // Clear KEEP: an authoritative host with a substantial readable body. The host
+  // is a strong prior for a real reference; the length rules out a stub page.
+  const host = hostOf(source.uri)
+  const authoritative =
+    host.length > 0 && options.authoritativeHosts.some((suffix) => hostMatches(host, suffix))
+  if (authoritative && bodyLen >= options.substantialBodyChars) {
+    return { triage: 'keep', reason: `authoritative host ${host} + substantial body (${bodyLen})` }
+  }
+
+  // Everything else is AMBIGUOUS — an unknown host with a plausible body. This
+  // is exactly the off-scope tail the LLM relevance judge is reserved for.
+  return { triage: 'ambiguous', reason: `unknown host ${host || '(none)'}, body ${bodyLen} chars` }
+}
+
+function hostOf(uri: string): string {
+  try {
+    return new URL(uri.trim()).hostname.toLowerCase().replace(/^www\./, '')
+  } catch {
+    return ''
+  }
+}
+
+/** Suffix host match: `.edu` matches `mit.edu`; `arxiv.org` matches `export.arxiv.org`. */
+function hostMatches(host: string, suffix: string): boolean {
+  if (suffix.startsWith('.')) return host.endsWith(suffix) || host === suffix.slice(1)
+  return host === suffix || host.endsWith(`.${suffix}`)
+}
+
+export interface AdaptiveResearchDriver {
+  verifySource(
+    source: ResearchSourceProposal,
+    ctx: SourceVerificationContext,
+  ): Promise<SourceVerdict>
+  /** Live tally of where decisions were spent — the cost/quality instrumentation. */
+  stats(): AdaptiveStats
+}
+
+/**
+ * Build the adaptive verifier. The deterministic stages (dedup + heuristic
+ * triage) cost $0; only AMBIGUOUS survivors escalate to the LLM relevance
+ * verifier. `stats()` exposes where every decision was spent so the A/B can read
+ * the LLM $/calls the adaptive driver saved against the cleanliness it kept.
+ *
+ * Dedup state is kept on the driver instance: it tracks the canonical URLs and
+ * content hashes it has ACCEPTED, plus those it sees in the verification
+ * context's `acceptedThisRound` and the KB index. Use one driver per loop run.
+ */
+export function createAdaptiveResearchDriver(
+  options: AdaptiveDriverOptions = {},
+): AdaptiveResearchDriver {
+  const authoritativeHosts = options.authoritativeHosts ?? defaultAuthoritativeHosts
+  const spamPatterns = options.spamPatterns ?? defaultSpamPatterns
+  const minBodyChars = Math.max(1, options.minBodyChars ?? 400)
+  const substantialBodyChars = Math.max(minBodyChars, options.substantialBodyChars ?? 600)
+
+  // The shipped LLM relevance verifier — the ONLY paid stage, reused, not rebuilt.
+  const relevance = createVerifyingResearchDriver({
+    router: options.router,
+    router_options: options.router_options,
+    acceptOnParseFailure: options.verifying?.acceptOnParseFailure,
+  })
+
+  // Dedup memory across the run: every URL/content key this driver ACCEPTED.
+  const acceptedUrlKeys = new Set<string>()
+  const acceptedContentKeys = new Set<string>()
+
+  const stats: AdaptiveStats = {
+    total: 0,
+    dedupRejected: 0,
+    heuristicKept: 0,
+    heuristicDropped: 0,
+    llmCalls: 0,
+    llmAccepted: 0,
+    decisions: [],
+  }
+
+  function record(decision: AdaptiveDecision): void {
+    stats.decisions.push(decision)
+    options.onDecision?.(decision)
+  }
+
+  return {
+    async verifySource(source, ctx): Promise<SourceVerdict> {
+      stats.total += 1
+      const urlKey = canonicalizeUrl(source.uri)
+      const cKey = contentKey(source.text)
+
+      // ---- STAGE 1: DETERMINISTIC DEDUP ($0) ----------------------------------
+      // Seed the dup sets from what the loop already accepted this round AND from
+      // the KB index, so a duplicate of an existing source is caught even if this
+      // driver instance hasn't seen it yet.
+      const roundUrlKeys = new Set(
+        ctx.acceptedThisRound.map((accepted) => canonicalizeUrl(accepted.uri)),
+      )
+      const roundContentKeys = new Set(
+        ctx.acceptedThisRound.map((accepted) => contentKey(accepted.text)),
+      )
+      const indexUrlKeys = new Set(
+        ctx.index.sources.flatMap((indexed) =>
+          typeof indexed.metadata?.originalUri === 'string'
+            ? [canonicalizeUrl(indexed.metadata.originalUri)]
+            : [],
+        ),
+      )
+      const dupUrl =
+        acceptedUrlKeys.has(urlKey) || roundUrlKeys.has(urlKey) || indexUrlKeys.has(urlKey)
+      const dupContent = acceptedContentKeys.has(cKey) || roundContentKeys.has(cKey)
+      if (dupUrl || dupContent) {
+        stats.dedupRejected += 1
+        const reason: DedupReason = dupUrl ? 'duplicate-url' : 'duplicate-content'
+        record({ uri: source.uri, stage: 'dedup', accepted: false, reason })
+        return { accept: false, reason: `dedup: ${reason}` }
+      }
+
+      // ---- STAGE 2: CHEAP HEURISTIC TRIAGE ($0) -------------------------------
+      const { triage, reason } = triageSource(source, {
+        authoritativeHosts,
+        spamPatterns,
+        minBodyChars,
+        substantialBodyChars,
+      })
+      if (triage === 'keep') {
+        stats.heuristicKept += 1
+        acceptedUrlKeys.add(urlKey)
+        acceptedContentKeys.add(cKey)
+        record({ uri: source.uri, stage: 'heuristic', accepted: true, triage, reason })
+        return { accept: true }
+      }
+      if (triage === 'drop') {
+        stats.heuristicDropped += 1
+        record({ uri: source.uri, stage: 'heuristic', accepted: false, triage, reason })
+        return { accept: false, reason: `heuristic drop: ${reason}` }
+      }
+
+      // ---- STAGE 3: LLM ESCALATION ($ — ambiguous tail only) ------------------
+      stats.llmCalls += 1
+      const verdict = await relevance.verifySource(source, ctx)
+      if (verdict.accept) {
+        stats.llmAccepted += 1
+        acceptedUrlKeys.add(urlKey)
+        acceptedContentKeys.add(cKey)
+      }
+      record({
+        uri: source.uri,
+        stage: 'llm',
+        accepted: verdict.accept,
+        triage,
+        reason: verdict.accept ? 'llm accepted' : verdict.reason,
+      })
+      return verdict
+    },
+    stats(): AdaptiveStats {
+      return {
+        ...stats,
+        decisions: [...stats.decisions],
+      }
+    },
+  }
+}
diff --git a/src/claim-grounding.ts b/src/claim-grounding.ts
new file mode 100644
index 0000000..244b7d1
--- /dev/null
+++ b/src/claim-grounding.ts
@@ -0,0 +1,351 @@
+/**
+ * Claim-grounding mode for `runTwoAgentResearchLoop`.
+ *
+ * The two-agent loop's existing verifier judges a source's on-topic RELEVANCE
+ * (is this page about the goal?). On the topic sets we have measured, its
+ * cleanliness win is dominated by DE-DUPLICATION — which a deterministic
+ * content-hash / canonical-URL check captures at ~none of the LLM premium (see
+ * `docs/results/cost-quality.md`). That makes the LLM verifier look expensive
+ * for what a cheap rule already does.
+ *
+ * Claim-grounding targets a DIFFERENT, harder error band: a citation that is
+ * relevant and unique but **misattributed** — the page is on-topic, the URL is
+ * real, yet the specific CLAIM the source is cited for does NOT actually appear
+ * in the page. This is the citation-fabrication failure mode of LLM research:
+ * the model writes a plausible sentence and hangs a real URL off it that never
+ * says any such thing. Neither de-dup nor a relevance judge catches it (both can
+ * pass a misattributed-but-on-topic page); only checking the claim against the
+ * fetched text does.
+ *
+ * The check is EXECUTABLE GROUND TRUTH, not another LLM opinion: the worker
+ * attaches the specific claim it is citing the source for, and the verifier
+ * tests whether that claim is PRESENT (verbatim, normalized, or as a sufficient
+ * content-word overlap / close paraphrase) in the `htmlToText` output of the
+ * page the worker actually fetched. A claim that is not grounded is rejected as
+ * misattributed. Because the oracle is deterministic text presence — not a model
+ * call — it is a deployable, non-oracle verifier: it can run in production with
+ * zero inference cost, OR be composed with the LLM relevance verifier so the
+ * loop rejects BOTH off-topic AND misattributed sources.
+ *
+ * This module is content-free and any-topic: it adds (1) a way for a proposal to
+ * carry the claim it is cited for, (2) the `groundClaimInText` oracle, and (3) a
+ * `ResearchDriver` that gates on grounding. It composes the existing
+ * `ResearchDriver` / `ResearchSourceProposal` contracts and the shipped
+ * `htmlToText`; it reinvents none of them.
+ */
+
+import type {
+  ResearchSourceProposal,
+  SourceVerdict,
+  SourceVerificationContext,
+} from './two-agent-research-loop'
+import {
+  createTangleRouterClient,
+  type RouterClient,
+  type TangleRouterOptions,
+} from './web-research-worker'
+
+/**
+ * Metadata key under which a proposal carries the specific claim it is cited
+ * for. The worker sets `metadata[citedClaimKey] = '<the claim>'`; the
+ * claim-grounding driver reads it and checks it against the fetched page text.
+ */
+export const citedClaimKey = 'citedClaim'
+
+/** Read the cited claim a proposal carries, if any. */
+export function citedClaimOf(source: ResearchSourceProposal): string | undefined {
+  const claim = source.metadata?.[citedClaimKey]
+  return typeof claim === 'string' && claim.trim() ? claim.trim() : undefined
+}
+
+/** Attach a cited claim to a proposal (immutably returns a new proposal). */
+export function withCitedClaim(
+  source: ResearchSourceProposal,
+  claim: string,
+): ResearchSourceProposal {
+  return { ...source, metadata: { ...source.metadata, [citedClaimKey]: claim } }
+}
+
+export interface GroundingResult {
+  /** True when the claim is sufficiently present in the page text. */
+  grounded: boolean
+  /** How the claim matched (or why it didn't). For audit/notes. */
+  mode: 'verbatim' | 'normalized' | 'overlap' | 'absent' | 'empty-claim' | 'empty-text'
+  /**
+   * Fraction of the claim's content words found in the page text. 1 for a
+   * verbatim/normalized hit; the measured overlap otherwise.
+   */
+  overlap: number
+  /** Content words present in the claim but NOT in the page text. */
+  missingWords: string[]
+}
+
+export interface GroundClaimOptions {
+  /**
+   * Minimum fraction of the claim's content words that must appear in the page
+   * text to count as a close paraphrase when there is no verbatim/normalized
+   * hit. Default 0.7 — a high bar, because a misattribution is exactly a claim
+   * whose specific words the page does not contain.
+   */
+  minOverlap?: number
+  /**
+   * Content words shorter than this are ignored (drops "the", "of", "is", …)
+   * and never count toward overlap. Default 3.
+   */
+  minWordLength?: number
+}
+
+/**
+ * Stopwords stripped before overlap scoring so the bar measures the claim's
+ * SUBSTANTIVE words (the numbers, nouns, methods it asserts), not filler a
+ * misattributed page would trivially share with the real one.
+ */
+const stopwords = new Set([
+  'the',
+  'a',
+  'an',
+  'and',
+  'or',
+  'but',
+  'of',
+  'to',
+  'in',
+  'on',
+  'for',
+  'with',
+  'as',
+  'by',
+  'at',
+  'from',
+  'that',
+  'this',
+  'these',
+  'those',
+  'it',
+  'its',
+  'is',
+  'are',
+  'was',
+  'were',
+  'be',
+  'been',
+  'being',
+  'has',
+  'have',
+  'had',
+  'can',
+  'will',
+  'would',
+  'should',
+  'may',
+  'might',
+  'not',
+  'no',
+  'than',
+  'then',
+  'over',
+  'under',
+  'about',
+  'into',
+  'their',
+  'they',
+  'them',
+])
+
+/** Normalize for presence checks: lowercase, collapse whitespace + punctuation. */
+function normalize(text: string): string {
+  return text
+    .toLowerCase()
+    .replace(/[^\p{L}\p{N}\s]+/gu, ' ')
+    .replace(/\s+/g, ' ')
+    .trim()
+}
+
+/** The claim's substantive content words (deduped, stopwords + short words removed). */
+function contentWords(claim: string, minWordLength: number): string[] {
+  const words = normalize(claim)
+    .split(' ')
+    .filter((word) => word.length >= minWordLength && !stopwords.has(word))
+  return [...new Set(words)]
+}
+
+/**
+ * THE ORACLE. Is `claim` grounded in `pageText` (the `htmlToText` output of the
+ * page the worker fetched)? Deterministic, no model call:
+ *
+ *   1. verbatim — the claim string appears as-is (case-insensitive).
+ *   2. normalized — the claim appears after collapsing punctuation/whitespace
+ *      on both sides (so "5.4x" vs "5.4 x", smart quotes, etc. still match).
+ *   3. overlap — a close paraphrase: at least `minOverlap` of the claim's
+ *      substantive content words appear in the page text. A misattributed page
+ *      fails here because the SPECIFIC words the claim asserts are absent.
+ *
+ * Returns the match mode, the measured overlap, and the missing content words —
+ * enough for the driver to give a precise rejection reason and for a test/doc to
+ * audit WHY a claim grounded or didn't.
+ */
+export function groundClaimInText(
+  claim: string,
+  pageText: string,
+  options: GroundClaimOptions = {},
+): GroundingResult {
+  const minOverlap = options.minOverlap ?? 0.7
+  const minWordLength = Math.max(1, options.minWordLength ?? 3)
+
+  const claimTrimmed = claim.trim()
+  if (!claimTrimmed) return { grounded: false, mode: 'empty-claim', overlap: 0, missingWords: [] }
+  if (!pageText.trim()) return { grounded: false, mode: 'empty-text', overlap: 0, missingWords: [] }
+
+  const haystackLower = pageText.toLowerCase()
+  if (haystackLower.includes(claimTrimmed.toLowerCase())) {
+    return { grounded: true, mode: 'verbatim', overlap: 1, missingWords: [] }
+  }
+
+  const haystackNorm = normalize(pageText)
+  const claimNorm = normalize(claimTrimmed)
+  if (claimNorm && haystackNorm.includes(claimNorm)) {
+    return { grounded: true, mode: 'normalized', overlap: 1, missingWords: [] }
+  }
+
+  const words = contentWords(claimTrimmed, minWordLength)
+  if (words.length === 0) {
+    // The claim has no substantive content words (all stopwords/short). With no
+    // verbatim/normalized hit there is nothing to ground — treat as absent.
+    return { grounded: false, mode: 'absent', overlap: 0, missingWords: [] }
+  }
+  // Word-boundary presence so "rotary" does not match inside "rotaryxyz".
+  const present = words.filter((word) =>
+    new RegExp(`\\b${word.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}\\b`).test(haystackNorm),
+  )
+  const missingWords = words.filter((word) => !present.includes(word))
+  const overlap = present.length / words.length
+  const grounded = overlap >= minOverlap
+  return { grounded, mode: grounded ? 'overlap' : 'absent', overlap, missingWords }
+}
+
+export interface ClaimGroundingDriverOptions extends GroundClaimOptions {
+  /**
+   * Optional second verifier to compose AFTER grounding passes. When set, a
+   * source must BOTH ground its claim AND pass this verifier (e.g. the LLM
+   * relevance driver's `verifySource`). Lets the loop reject off-topic AND
+   * misattributed sources in one driver. Omit for the pure, zero-inference
+   * grounding gate.
+   */
+  relevanceVerifier?: (
+    source: ResearchSourceProposal,
+    ctx: SourceVerificationContext,
+  ) => Promise<SourceVerdict> | SourceVerdict
+  /**
+   * What to do when a proposal carries NO cited claim. `'reject'` (default) is
+   * fail-closed: in claim-grounding mode every source must declare what it is
+   * cited for, so an un-annotated source is treated as ungrounded. `'accept'`
+   * lets un-annotated sources through to the relevance verifier (if any) —
+   * useful when mixing annotated and legacy proposals.
+   */
+  onMissingClaim?: 'reject' | 'accept'
+}
+
+/**
+ * A `ResearchDriver`-shaped verifier (just the `verifySource` arm) that gates on
+ * CLAIM GROUNDING: it rejects a source whose cited claim is not present in its
+ * fetched page text — a misattributed / fabricated citation — and (optionally)
+ * composes a relevance verifier after grounding passes.
+ *
+ * The returned function matches `ResearchDriver['verifySource']`, so it drops
+ * straight into `runTwoAgentResearchLoop` as `{ verifySource: createClaimGroundingVerifier(...) }`.
+ */
+export function createClaimGroundingVerifier(options: ClaimGroundingDriverOptions = {}) {
+  const onMissingClaim = options.onMissingClaim ?? 'reject'
+  return async function verifySource(
+    source: ResearchSourceProposal,
+    ctx: SourceVerificationContext,
+  ): Promise<SourceVerdict> {
+    const claim = citedClaimOf(source)
+    if (!claim) {
+      if (onMissingClaim === 'reject') {
+        return {
+          accept: false,
+          reason: 'no cited claim: claim-grounding mode requires every source to declare its claim',
+        }
+      }
+      // accept-on-missing: fall through to the relevance verifier (or accept).
+      return options.relevanceVerifier ? options.relevanceVerifier(source, ctx) : { accept: true }
+    }
+
+    const grounding = groundClaimInText(claim, source.text, options)
+    if (!grounding.grounded) {
+      const detail =
+        grounding.mode === 'empty-text'
+          ? 'fetched page has no text'
+          : `claim not found in the fetched page (overlap ${(grounding.overlap * 100).toFixed(0)}%${
+              grounding.missingWords.length
+                ? `, missing: ${grounding.missingWords.slice(0, 6).join(', ')}`
+                : ''
+            })`
+      return {
+        accept: false,
+        reason: `misattributed citation: ${detail}`,
+      }
+    }
+
+    // Claim is grounded. Compose the relevance verifier if one was provided.
+    if (options.relevanceVerifier) return options.relevanceVerifier(source, ctx)
+    return { accept: true }
+  }
+}
+
+export interface WorkerClaimDecorationOptions {
+  router?: RouterClient
+  router_options?: TangleRouterOptions
+  /** Max output tokens for the claim-extraction call. Default 1200 (glm floor). */
+  maxTokens?: number
+}
+
+/**
+ * Ask an LLM to state, for one source, the single specific factual claim a
+ * researcher would cite THIS page for toward the goal. Used to DECORATE the
+ * sources a relevance-only worker produced with the claim a citation would make,
+ * so the claim-grounding verifier has something executable to check. The model
+ * is told to ground the claim in the provided excerpt; the verifier then checks
+ * it against the FULL page text independently — the model does not get to mark
+ * its own homework.
+ *
+ * Returns the proposal annotated via `withCitedClaim`, or the original proposal
+ * unchanged if the model returns nothing parseable (the verifier's
+ * `onMissingClaim` policy then decides).
+ */
+export function createClaimDecorator(options: WorkerClaimDecorationOptions = {}) {
+  const maxTokens = options.maxTokens ?? 1200
+  return async function decorate(
+    source: ResearchSourceProposal,
+    goal: string,
+  ): Promise<ResearchSourceProposal> {
+    const router = options.router ?? createTangleRouterClient(options.router_options)
+    const excerpt = source.text.slice(0, 1500)
+    const system =
+      'You extract the single most important factual claim a researcher would cite a page for. ' +
+      "State ONE concrete, checkable sentence using the page's own key terms and numbers. " +
+      'Do NOT invent facts not in the excerpt. Respond with ONLY the claim sentence, no prose.'
+    const user = [
+      `Research goal: ${goal}`,
+      `Page title: ${source.title ?? '(none)'}`,
+      `Page excerpt:\n${excerpt}`,
+      'The single claim this page should be cited for:',
+    ].join('\n\n')
+    let raw = ''
+    try {
+      raw = await router.chat(
+        [
+          { role: 'system', content: system },
+          { role: 'user', content: user },
+        ],
+        maxTokens,
+      )
+    } catch {
+      return source
+    }
+    const claim = raw.trim().split('\n')[0]?.trim()
+    if (!claim) return source
+    return withCitedClaim(source, claim)
+  }
+}
diff --git a/src/index.ts b/src/index.ts
index 9487362..db4893b 100644
--- a/src/index.ts
+++ b/src/index.ts
@@ -1,6 +1,8 @@
 export * from './adapters'
+export * from './adaptive-driver'
 export * from './changes'
 export * from './chunking'
+export * from './claim-grounding'
 export * from './discovery'
 export * from './eval-readiness'
 export * from './events'
diff --git a/src/web-research-worker.ts b/src/web-research-worker.ts
index ff147a1..4dce7ed 100644
--- a/src/web-research-worker.ts
+++ b/src/web-research-worker.ts
@@ -72,6 +72,24 @@ export interface RouterClient {
     messages: { role: 'system' | 'user'; content: string }[],
     maxTokens?: number,
   ): Promise<string>
+  /** Cumulative cost (chat + search) since this client was created. */
+  usage(): RouterUsage
+}
+
+/**
+ * Cumulative router cost — the per-arm signal the A/B reports ALONGSIDE quality,
+ * so "2.3 fewer sources" can be read against the token/$/latency it cost. A
+ * two-agent round is one worker pass plus N `verifySource` LLM calls; counting
+ * each call here is what surfaces that the two-agent loop spends more inference
+ * than its "equal passes" budget implies.
+ */
+export interface RouterUsage {
+  chatCalls: number
+  searchCalls: number
+  promptTokens: number
+  completionTokens: number
+  usd: number
+  wallMs: number
 }
 
 export interface TangleRouterOptions {
@@ -114,8 +132,20 @@ export function createTangleRouterClient(options: TangleRouterOptions = {}): Rou
     Authorization: `Bearer ${apiKey}`,
   }
 
+  // glm-5.2 pricing (USD per token) + a cumulative accumulator. Read via usage().
+  const price = { prompt: 0.95 / 1_000_000, completion: 3.0 / 1_000_000 }
+  const acc: RouterUsage = {
+    chatCalls: 0,
+    searchCalls: 0,
+    promptTokens: 0,
+    completionTokens: 0,
+    usd: 0,
+    wallMs: 0,
+  }
+
   return {
     async search(query, opts) {
+      const t0 = Date.now()
       const res = await fetch(`${baseUrl}/search`, {
         method: 'POST',
         headers,
@@ -126,6 +156,8 @@ export function createTangleRouterClient(options: TangleRouterOptions = {}): Rou
           ...(opts?.maxResults != null ? { maxResults: opts.maxResults } : {}),
         }),
       })
+      acc.searchCalls += 1
+      acc.wallMs += Date.now() - t0
       if (!res.ok) {
         throw new RouterError(res.status, await res.text().catch(() => res.statusText))
       }
@@ -138,6 +170,7 @@ export function createTangleRouterClient(options: TangleRouterOptions = {}): Rou
       // Reasoning-model floor: never let glm-5.2 spend the whole budget on
       // hidden reasoning and return empty visible content.
       const max_tokens = Math.max(MIN_MAX_TOKENS, maxTokens ?? MIN_MAX_TOKENS)
+      const t0 = Date.now()
       const res = await fetch(`${baseUrl}/chat/completions`, {
         method: 'POST',
         headers,
@@ -149,9 +182,20 @@ export function createTangleRouterClient(options: TangleRouterOptions = {}): Rou
       }
       const body = (await res.json()) as {
         choices?: { message?: { content?: string } }[]
+        usage?: { prompt_tokens?: number; completion_tokens?: number }
       }
+      const promptTokens = body.usage?.prompt_tokens ?? 0
+      const completionTokens = body.usage?.completion_tokens ?? 0
+      acc.chatCalls += 1
+      acc.promptTokens += promptTokens
+      acc.completionTokens += completionTokens
+      acc.usd += promptTokens * price.prompt + completionTokens * price.completion
+      acc.wallMs += Date.now() - t0
       return body.choices?.[0]?.message?.content ?? ''
     },
+    usage() {
+      return { ...acc }
+    },
   }
 }
 
diff --git a/tests/claim-grounding.test.ts b/tests/claim-grounding.test.ts
new file mode 100644
index 0000000..5d0c0d4
--- /dev/null
+++ b/tests/claim-grounding.test.ts
@@ -0,0 +1,176 @@
+import { describe, expect, it } from 'vitest'
+import {
+  citedClaimKey,
+  citedClaimOf,
+  createClaimGroundingVerifier,
+  groundClaimInText,
+  withCitedClaim,
+} from '../src/claim-grounding'
+import type {
+  ResearchSourceProposal,
+  SourceVerificationContext,
+} from '../src/two-agent-research-loop'
+
+const ctx: SourceVerificationContext = {
+  root: '/tmp/x',
+  goal: 'self-speculative decoding',
+  round: 1,
+  index: {
+    root: '/tmp/x',
+    generatedAt: '',
+    sources: [],
+    pages: [],
+    graph: { nodes: [], edges: [] },
+  },
+  gaps: [],
+  acceptedThisRound: [],
+}
+
+describe('groundClaimInText (the deterministic grounding oracle)', () => {
+  const page =
+    'Self-speculative decoding skips intermediate layers to draft tokens, then verifies them ' +
+    'with the full model. The paper reports a 1.73x speedup on LLaMA-2 with no quality loss.'
+
+  it('grounds a verbatim claim', () => {
+    const r = groundClaimInText('skips intermediate layers to draft tokens', page)
+    expect(r.grounded).toBe(true)
+    expect(r.mode).toBe('verbatim')
+    expect(r.overlap).toBe(1)
+  })
+
+  it('grounds across punctuation/whitespace differences (normalized)', () => {
+    // The page says "1.73x speedup"; the claim spaces it differently + adds a comma.
+    const r = groundClaimInText('a 1.73x, speedup', page)
+    expect(r.grounded).toBe(true)
+    expect(['verbatim', 'normalized']).toContain(r.mode)
+  })
+
+  it('grounds a close paraphrase via content-word overlap', () => {
+    // Reworded but the substantive words are present in the page (drops the
+    // "no-quality-loss" ordering). Inflected forms that the page does NOT
+    // contain verbatim (e.g. "drafts" vs "draft") legitimately lower the score —
+    // that strictness is the point, so this paraphrase keeps to present words.
+    const r = groundClaimInText(
+      'draft tokens by skipping intermediate layers then verifies with the full model',
+      page,
+    )
+    expect(r.grounded).toBe(true)
+    expect(r.mode).toBe('overlap')
+    expect(r.overlap).toBeGreaterThanOrEqual(0.7)
+  })
+
+  it('REJECTS a misattributed claim — relevant topic, wrong numbers/facts', () => {
+    // On-topic (mentions speculative decoding) but the page never says any of this:
+    // a different speedup, a different model, a different mechanism. A relevance
+    // judge would pass it; grounding must not.
+    const r = groundClaimInText(
+      'achieves a 4.8x speedup on GPT-4 using a separate draft transformer network',
+      page,
+    )
+    expect(r.grounded).toBe(false)
+    expect(r.mode).toBe('absent')
+    expect(r.missingWords).toContain('gpt')
+    expect(r.missingWords).toContain('network')
+  })
+
+  it('rejects an empty page text and an empty claim', () => {
+    expect(groundClaimInText('anything', '').grounded).toBe(false)
+    expect(groundClaimInText('anything', '').mode).toBe('empty-text')
+    expect(groundClaimInText('   ', page).grounded).toBe(false)
+    expect(groundClaimInText('   ', page).mode).toBe('empty-claim')
+  })
+
+  it('does not let a stopword-only claim ground spuriously', () => {
+    const r = groundClaimInText('the of and to', page)
+    expect(r.grounded).toBe(false)
+  })
+
+  it('honours a stricter minOverlap', () => {
+    // ~0.6 overlap claim: grounds at 0.5, fails at 0.9.
+    const claim = 'speedup verifies tokens nonexistentwordzz alsofakewordzz'
+    expect(groundClaimInText(claim, page, { minOverlap: 0.5 }).grounded).toBe(true)
+    expect(groundClaimInText(claim, page, { minOverlap: 0.9 }).grounded).toBe(false)
+  })
+})
+
+describe('citedClaim helpers', () => {
+  const base: ResearchSourceProposal = { uri: 'https://x', text: 't', title: 'T' }
+
+  it('round-trips a claim through metadata', () => {
+    const decorated = withCitedClaim(base, 'the claim')
+    expect(decorated.metadata?.[citedClaimKey]).toBe('the claim')
+    expect(citedClaimOf(decorated)).toBe('the claim')
+  })
+
+  it('returns undefined for a missing/blank claim', () => {
+    expect(citedClaimOf(base)).toBeUndefined()
+    expect(citedClaimOf(withCitedClaim(base, '   '))).toBeUndefined()
+  })
+})
+
+describe('createClaimGroundingVerifier (the driver gate)', () => {
+  const page =
+    'The transformer architecture uses multi-head self-attention. Reported BLEU of 28.4 on WMT14 En-De.'
+
+  it('accepts a grounded source', async () => {
+    const verify = createClaimGroundingVerifier()
+    const source = withCitedClaim(
+      { uri: 'https://a', text: page, title: 'Attention' },
+      'BLEU of 28.4 on WMT14',
+    )
+    expect(await verify(source, ctx)).toEqual({ accept: true })
+  })
+
+  it('REJECTS a misattributed source with a precise reason', async () => {
+    const verify = createClaimGroundingVerifier()
+    const source = withCitedClaim(
+      { uri: 'https://a', text: page, title: 'Attention' },
+      'reports a BLEU of 41.0 on the WMT16 Russian benchmark',
+    )
+    const verdict = await verify(source, ctx)
+    expect(verdict.accept).toBe(false)
+    if (!verdict.accept) expect(verdict.reason).toMatch(/misattributed citation/)
+  })
+
+  it('rejects an un-annotated source by default (fail-closed)', async () => {
+    const verify = createClaimGroundingVerifier()
+    const verdict = await verify({ uri: 'https://a', text: page, title: 'T' }, ctx)
+    expect(verdict.accept).toBe(false)
+    if (!verdict.accept) expect(verdict.reason).toMatch(/no cited claim/)
+  })
+
+  it('composes a relevance verifier AFTER grounding passes', async () => {
+    let relevanceCalled = false
+    const verify = createClaimGroundingVerifier({
+      relevanceVerifier: () => {
+        relevanceCalled = true
+        return { accept: false, reason: 'off-topic per relevance judge' }
+      },
+    })
+    const grounded = withCitedClaim(
+      { uri: 'https://a', text: page, title: 'T' },
+      'multi-head self-attention',
+    )
+    const verdict = await verify(grounded, ctx)
+    expect(relevanceCalled).toBe(true)
+    expect(verdict.accept).toBe(false)
+    if (!verdict.accept) expect(verdict.reason).toMatch(/off-topic/)
+  })
+
+  it('does NOT call the relevance verifier when grounding already fails', async () => {
+    let relevanceCalled = false
+    const verify = createClaimGroundingVerifier({
+      relevanceVerifier: () => {
+        relevanceCalled = true
+        return { accept: true }
+      },
+    })
+    const misattributed = withCitedClaim(
+      { uri: 'https://a', text: page, title: 'T' },
+      'a 99x speedup on a quantum coprocessor',
+    )
+    const verdict = await verify(misattributed, ctx)
+    expect(relevanceCalled).toBe(false)
+    expect(verdict.accept).toBe(false)
+  })
+})
diff --git a/tests/loops/adaptive-ab.test.ts b/tests/loops/adaptive-ab.test.ts
new file mode 100644
index 0000000..c0254b8
--- /dev/null
+++ b/tests/loops/adaptive-ab.test.ts
@@ -0,0 +1,647 @@
+import { mkdtemp, rm } from 'node:fs/promises'
+import { tmpdir } from 'node:os'
+import { join } from 'node:path'
+import { afterEach, beforeEach, describe, expect, it } from 'vitest'
+import {
+  type AdaptiveResearchDriver,
+  canonicalizeUrl,
+  contentKey,
+  createAdaptiveResearchDriver,
+  triageSource,
+} from '../../src/adaptive-driver'
+import {
+  buildEvalKnowledgeBundle,
+  defineReadinessSpec,
+  type KnowledgeReadinessSpec,
+} from '../../src/eval-readiness'
+import { buildKnowledgeIndex } from '../../src/indexer'
+import {
+  type ResearchContribution,
+  type ResearchDriver,
+  type ResearchSourceProposal,
+  type ResearchWorker,
+  runTwoAgentResearchLoop,
+  type SourceVerificationContext,
+  type WorkerResearchContext,
+} from '../../src/two-agent-research-loop'
+import {
+  createTangleRouterClient,
+  createVerifyingResearchDriver,
+  createWebResearchWorker,
+  type RouterClient,
+} from '../../src/web-research-worker'
+
+// ===========================================================================
+// ADAPTIVE TOPOLOGY A/B: spend the LLM verifier only when it pays.
+//
+// The cost/quality result (docs/results/cost-quality.md) found the LLM relevance
+// verifier's cleanliness win is dominated by DE-DUPLICATION — captured by a
+// deterministic content-hash / canonical-URL check at ~none of the LLM premium —
+// and that an LLM check only earns its dollar on the off-scope tail. The
+// production move it named is: do the cheap deterministic work first, reserve the
+// LLM for the ambiguous tail. `createAdaptiveResearchDriver` is that driver.
+//
+// This file measures THREE topologies on the cost/quality frontier:
+//   - single-agent : no verifier (the floor — admits everything).
+//   - full-LLM     : one LLM verifySource call per candidate (the ceiling cost).
+//   - adaptive     : $0 dedup → $0 host/title/length triage → LLM ONLY for the
+//                    ambiguous survivors.
+//
+// The question: does adaptive capture most of the cleanliness (dedup + clear
+// drops) at a fraction of the verifier $/calls? The offline arm proves the
+// wiring + that adaptive escalates ONLY ambiguous survivors; the live arm
+// (creds-gated) reports the real frontier with #36's RouterClient.usage().
+// ===========================================================================
+
+// ---------------------------------------------------------------------------
+// Pure-unit coverage of the deterministic stages (no loop, no network).
+// ---------------------------------------------------------------------------
+describe('adaptive driver — deterministic stages', () => {
+  it('canonicalizeUrl collapses scheme/www/trailing-slash/tracking params', () => {
+    const a = canonicalizeUrl('https://www.Example.com/Path/?utm_source=x&ref=y&a=1#frag')
+    const b = canonicalizeUrl('http://example.com/Path?a=1')
+    expect(a).toBe(b)
+    expect(a).toBe('example.com/Path?a=1')
+    // Non-URL identifiers dedup by lowercased equality.
+    expect(canonicalizeUrl('Local-Note-7')).toBe('local-note-7')
+  })
+
+  it('contentKey ignores formatting so a reformatted mirror collides', () => {
+    const k1 = contentKey('Self-speculative decoding skips layers.  It is 1.7x faster.')
+    const k2 = contentKey('self speculative decoding skips layers it is 1 7x faster')
+    expect(k1).toBe(k2)
+  })
+
+  it('triageSource keeps authoritative+substantial, drops spam/thin, else ambiguous', () => {
+    const opts = {
+      authoritativeHosts: ['arxiv.org', '.edu'],
+      spamPatterns: [/\bbuy\b.*\bcheap\b/i, /\d+\s+things/i],
+      minBodyChars: 400,
+      substantialBodyChars: 600,
+    }
+    const body = 'x'.repeat(700)
+    // Authoritative host + substantial body → KEEP, no LLM.
+    expect(
+      triageSource({ uri: 'https://arxiv.org/abs/1', title: 'A paper', text: body }, opts).triage,
+    ).toBe('keep')
+    // Spam title → DROP.
+    expect(
+      triageSource({ uri: 'https://shop.example.com/x', title: 'Buy fans cheap', text: body }, opts)
+        .triage,
+    ).toBe('drop')
+    // Thin body → DROP.
+    expect(
+      triageSource({ uri: 'https://arxiv.org/abs/2', title: 'A paper', text: 'short' }, opts)
+        .triage,
+    ).toBe('drop')
+    // Unknown host, plausible body → AMBIGUOUS (the LLM tail).
+    expect(
+      triageSource({ uri: 'https://blog.unknown.io/x', title: 'Some notes', text: body }, opts)
+        .triage,
+    ).toBe('ambiguous')
+  })
+})
+
+// ---------------------------------------------------------------------------
+// OFFLINE CONTROLLED A/B — proves adaptive escalates ONLY ambiguous survivors
+// and matches the full-LLM cleanliness at a fraction of the LLM calls.
+// ---------------------------------------------------------------------------
+interface PoolEntry {
+  uri: string
+  title: string
+  text: string
+  /** Why it's in the pool: 'authoritative' keep, 'spam'/'thin' drop, 'dup' of an
+   * earlier entry, or 'ambiguous' (must reach the LLM). */
+  kind: 'authoritative' | 'spam' | 'thin' | 'dup' | 'ambiguous-good' | 'ambiguous-bad'
+}
+
+const longBody =
+  'Self-speculative decoding drafts tokens by skipping intermediate transformer ' +
+  'layers, then verifies them with the full model in a single forward pass. It reports a ' +
+  '1.73x speedup on LLaMA-2 with no measurable quality loss across the evaluated benchmarks. '.repeat(
+    6,
+  )
+
+const goal = 'self-speculative decoding'
+
+const pool: PoolEntry[] = [
+  // Authoritative + substantial → adaptive KEEPS with no LLM call.
+  {
+    uri: 'https://arxiv.org/abs/2309.08168',
+    title: 'Draft & Verify: Lossless LLM Acceleration via Self-Speculative Decoding',
+    text: longBody,
+    kind: 'authoritative',
+  },
+  // Exact-content mirror of the arxiv paper on a different URL → adaptive DEDUPS.
+  {
+    uri: 'https://mirror.example.org/draft-and-verify',
+    title: 'Draft and Verify (mirror)',
+    text: longBody,
+    kind: 'dup',
+  },
+  // Obvious spam title → adaptive DROPS with no LLM call. Distinct body so the
+  // DROP is earned by the title heuristic, not by content-dedup against the paper.
+  {
+    uri: 'https://shop.example.com/fans',
+    title: '10 things about decoding that will SHOCK you!!!',
+    text: 'Subscribe now for the best decoding deals of 2026! Limited offer, buy cheap fans today. '.repeat(
+      8,
+    ),
+    kind: 'spam',
+  },
+  // Too-thin body → adaptive DROPS with no LLM call.
+  {
+    uri: 'https://stub.example.net/p',
+    title: 'Decoding stub',
+    text: 'self-speculative decoding.',
+    kind: 'thin',
+  },
+  // Unknown host, on-topic plausible body → AMBIGUOUS → reaches the LLM (kept).
+  {
+    uri: 'https://blog.unknown.io/self-spec-explainer',
+    title: 'An explainer on self-speculative decoding',
+    text:
+      'This post walks through how self-speculative decoding reuses a single model to draft and ' +
+      'verify tokens, with worked examples and a discussion of when the speedup holds. '.repeat(8),
+    kind: 'ambiguous-good',
+  },
+  // Unknown host, off-topic plausible body → AMBIGUOUS → reaches the LLM (rejected).
+  {
+    uri: 'https://blog.unknown.io/gardening',
+    title: 'Companion planting for tomatoes',
+    text:
+      'Tomatoes thrive next to basil and marigolds; rotate nightshades yearly and mulch to keep ' +
+      'soil moisture even through the summer. This has nothing to do with language models. '.repeat(
+        8,
+      ),
+    kind: 'ambiguous-bad',
+  },
+]
+
+const specs: KnowledgeReadinessSpec[] = [
+  defineReadinessSpec({
+    id: 'topic/definition',
+    description: `what ${goal} is and how it works`,
+    query: `${goal} how it works method`,
+    requiredFor: ['ResearchAgent'],
+    importance: 'blocking',
+    minSources: 1,
+    minHits: 1,
+  }),
+]
+
+/** A worker that proposes the whole pool once. */
+function poolWorker(): ResearchWorker {
+  return async (_ctx: WorkerResearchContext): Promise<ResearchContribution> => {
+    const sources: ResearchSourceProposal[] = pool.map((entry) => ({
+      uri: entry.uri,
+      title: entry.title,
+      text: entry.text,
+      metadata: { kind: entry.kind, originalUri: entry.uri },
+    }))
+    return {
+      sources,
+      buildPages: (accepted) =>
+        accepted
+          .map((record) => {
+            const original = record.metadata?.originalUri
+            const entry = pool.find((p) => p.uri === original)
+            const slug = String(original ?? record.id)
+              .replace(/[^a-z0-9]+/gi, '-')
+              .slice(0, 120)
+            return [
+              `---FILE: knowledge/${slug}.md---`,
+              '---',
+              `title: ${entry?.title ?? record.id}`,
+              `sources: ["${record.id}"]`,
+              '---',
+              `# ${entry?.title ?? record.id}`,
+              entry?.text ?? '',
+              '---END FILE---',
+            ].join('\n')
+          })
+          .join('\n'),
+      notes: `proposed ${sources.length}`,
+    }
+  }
+}
+
+/**
+ * A deterministic stand-in for the LLM relevance verifier: accepts on-topic
+ * (mentions decoding/token/model/layer), rejects off-topic. Counts its calls so
+ * the test can prove adaptive routes ONLY ambiguous survivors here. This is the
+ * exact role `createVerifyingResearchDriver().verifySource` plays live — one LLM
+ * call per source it sees.
+ */
+function countingRelevanceVerifier(): {
+  verifySource: (s: ResearchSourceProposal, c: SourceVerificationContext) => { accept: boolean }
+  calls: () => number
+} {
+  let calls = 0
+  return {
+    verifySource(source) {
+      calls += 1
+      const onTopic = /decod|token|\bmodel\b|layer|speculative/i.test(source.text)
+      return onTopic ? { accept: true } : { accept: false }
+    },
+    calls: () => calls,
+  }
+}
+
+async function admittedKinds(root: string): Promise<Set<string>> {
+  const index = await buildKnowledgeIndex(root)
+  return new Set(
+    index.sources.flatMap((s) => (typeof s.metadata?.kind === 'string' ? [s.metadata.kind] : [])),
+  )
+}
+
+async function admittedCount(root: string): Promise<number> {
+  const index = await buildKnowledgeIndex(root)
+  return index.sources.length
+}
+
+/**
+ * An offline router stub for the adaptive driver's LLM-escalation stage. The
+ * relevance verifier sends a chat whose USER message embeds the candidate's
+ * URL/title/excerpt; the stub reads that excerpt and returns a real on-topic
+ * verdict JSON — accept if it mentions decoding/speculative/token/model/layer,
+ * reject otherwise. This is exactly the shape `createVerifyingResearchDriver`
+ * parses, so the offline arm exercises the real escalation path with $0/network.
+ * `search`/`usage` are never reached by the verifier but satisfy the interface.
+ */
+const stubRouter: RouterClient = {
+  async chat(messages) {
+    const user = messages.find((m) => m.role === 'user')?.content ?? ''
+    // Judge ONLY the candidate's excerpt, not the whole prompt — the prompt also
+    // embeds the on-topic goal/gaps, which would make every candidate look
+    // on-topic. The relevance verifier formats the excerpt after `Excerpt:`.
+    const excerpt = user.split(/Excerpt:\n?/)[1] ?? user
+    const onTopic = /decod|speculative|\btoken\b|\bmodel\b|\blayer\b/i.test(excerpt)
+    return JSON.stringify({ accept: onTopic, reason: onTopic ? 'on-topic' : 'off-topic' })
+  },
+  async search() {
+    return []
+  },
+  usage() {
+    return { chatCalls: 0, searchCalls: 0, promptTokens: 0, completionTokens: 0, usd: 0, wallMs: 0 }
+  },
+}
+
+let fullRoot: string
+let adaptiveRoot: string
+let singleRoot: string
+
+beforeEach(async () => {
+  fullRoot = await mkdtemp(join(tmpdir(), 'ad-full-'))
+  adaptiveRoot = await mkdtemp(join(tmpdir(), 'ad-adaptive-'))
+  singleRoot = await mkdtemp(join(tmpdir(), 'ad-single-'))
+})
+afterEach(async () => {
+  await rm(fullRoot, { recursive: true, force: true })
+  await rm(adaptiveRoot, { recursive: true, force: true })
+  await rm(singleRoot, { recursive: true, force: true })
+})
+
+describe('adaptive A/B (offline, controlled): adaptive escalates only the ambiguous tail', () => {
+  it('matches full-LLM cleanliness at a fraction of the LLM calls', async () => {
+    // FULL-LLM arm: an LLM call per candidate (here the counting stand-in).
+    const fullVerifier = countingRelevanceVerifier()
+    const fullDriver: ResearchDriver = { verifySource: fullVerifier.verifySource }
+    await runTwoAgentResearchLoop({
+      root: fullRoot,
+      goal,
+      worker: poolWorker(),
+      driver: fullDriver,
+      readinessSpecs: specs,
+      maxRounds: 1,
+    })
+
+    // ADAPTIVE arm: $0 dedup + $0 triage, LLM ONLY for ambiguous survivors. Its
+    // LLM stage routes to the same stub the full-LLM arm models, so the call
+    // count and the verdicts are directly comparable.
+    const adaptiveDriver: AdaptiveResearchDriver = createAdaptiveResearchDriver({
+      router: stubRouter,
+    })
+    // The adaptive driver calls relevance.verifySource (one stub chat) ONLY for
+    // ambiguous survivors, so the driver's own stats().llmCalls is the observable
+    // escalation count; the stub returns a real on-topic/off-topic verdict so the
+    // good ambiguous source is kept and the off-topic one rejected, exactly as a
+    // live relevance judge would.
+    await runTwoAgentResearchLoop({
+      root: adaptiveRoot,
+      goal,
+      worker: poolWorker(),
+      driver: { verifySource: (s, c) => adaptiveDriver.verifySource(s, c) },
+      readinessSpecs: specs,
+      maxRounds: 1,
+    })
+
+    // SINGLE-AGENT arm: no verifier — admit everything the loop's own exact-uri
+    // dedup lets through.
+    await runTwoAgentResearchLoop({
+      root: singleRoot,
+      goal,
+      worker: poolWorker(),
+      driver: { verifySource: () => ({ accept: true }) },
+      readinessSpecs: specs,
+      maxRounds: 1,
+    })
+
+    const stats = adaptiveDriver.stats()
+    const fullCalls = fullVerifier.calls()
+    const adaptiveLlmCalls = stats.llmCalls
+    const adaptiveAdmitted = await admittedCount(adaptiveRoot)
+    const singleAdmitted = await admittedCount(singleRoot)
+    const adaptiveKinds = await admittedKinds(adaptiveRoot)
+
+    console.log(
+      `[adaptive A/B offline] full-LLM calls=${fullCalls} | ` +
+        `adaptive: llmCalls=${adaptiveLlmCalls} dedup=${stats.dedupRejected} ` +
+        `heuristicKept=${stats.heuristicKept} heuristicDropped=${stats.heuristicDropped} ` +
+        `admitted=${adaptiveAdmitted} | single admitted=${singleAdmitted}`,
+    )
+
+    // The full-LLM arm called the verifier once per candidate that survived the
+    // loop's exact-uri dedup (all 6 distinct uris → 6 calls).
+    expect(fullCalls).toBe(pool.length)
+
+    // ADAPTIVE escalates ONLY the two ambiguous survivors to the LLM. The other
+    // four are decided for $0: arxiv (keep), mirror (dedup by content), spam
+    // (drop), thin (drop).
+    expect(adaptiveLlmCalls).toBe(2)
+    expect(stats.dedupRejected).toBe(1) // the content mirror
+    expect(stats.heuristicKept).toBe(1) // the arxiv paper
+    expect(stats.heuristicDropped).toBe(2) // spam + thin
+
+    // CLEANLINESS: adaptive admits exactly the real, on-topic sources — the
+    // authoritative paper + the on-topic ambiguous explainer — and rejects the
+    // mirror, spam, thin, and off-topic. So 2 admitted, no junk kinds.
+    expect(adaptiveKinds.has('spam')).toBe(false)
+    expect(adaptiveKinds.has('thin')).toBe(false)
+    expect(adaptiveKinds.has('dup')).toBe(false)
+    expect(adaptiveKinds.has('ambiguous-bad')).toBe(false)
+    expect(adaptiveKinds.has('authoritative')).toBe(true)
+    expect(adaptiveKinds.has('ambiguous-good')).toBe(true)
+    expect(adaptiveAdmitted).toBe(2)
+
+    // FRONTIER: adaptive keeps the full-LLM cleanliness (single-agent admits the
+    // junk adaptive rejected) at strictly fewer LLM calls (2 vs 6 = a 3x cut).
+    expect(adaptiveLlmCalls).toBeLessThan(fullCalls)
+    expect(singleAdmitted).toBeGreaterThan(adaptiveAdmitted)
+  })
+})
+
+// ===========================================================================
+// LIVE THREE-TOPOLOGY A/B — the real cost/quality frontier. Skipped offline.
+//
+// Runs the REAL web-research worker (glm-5.2 query-gen → live /v1/search →
+// politeFetch → htmlToText) ONCE per topic, then gates the SAME fetched
+// proposals through three drivers, diffing each arm's cost with #36's
+// RouterClient.usage():
+//
+//   A. single-agent : accept all (no verifier). $0, admits everything.
+//   B. full-LLM     : createVerifyingResearchDriver — one LLM call per source.
+//   C. adaptive     : createAdaptiveResearchDriver — $0 dedup + $0 triage, LLM
+//                     only for the ambiguous tail.
+//
+// Reports per arm: admitted-source count (cleanliness), LLM calls, USD, tokens.
+// The frontier question: does adaptive land near full-LLM cleanliness at a
+// fraction of full-LLM's $/calls? Honest: if adaptive does NOT sit on the
+// frontier, or the host/title/length heuristic mis-routes, the doc says so.
+//
+// Gate: AGENT_KNOWLEDGE_LIVE=1 + TANGLE_API_KEY with glm-5.2 credits.
+//   ADAPTIVE_LIVE_GOALS — `|`-separated topics
+//   ADAPTIVE_LIVE_MODEL — router chat model (default glm-5.2)
+// ===========================================================================
+
+interface ArmResult {
+  admitted: number
+  llmCalls: number
+  usd: number
+  tokens: number
+  wallMs: number
+}
+
+describe.skipIf(!process.env.AGENT_KNOWLEDGE_LIVE)('live: adaptive three-topology frontier', () => {
+  it('single vs full-LLM vs adaptive on real web sources', async () => {
+    const goals = (
+      process.env.ADAPTIVE_LIVE_GOALS ??
+      'self-speculative decoding|rotary position embeddings|grouped-query attention'
+    )
+      .split('|')
+      .map((g) => g.trim())
+      .filter(Boolean)
+    const model = process.env.ADAPTIVE_LIVE_MODEL ?? 'glm-5.2'
+    const router: RouterClient = createTangleRouterClient({ model })
+
+    // COST GATE: cheap glm-5.2 smoke BEFORE the multi-topic burn.
+    const smoke = await router.chat(
+      [
+        { role: 'system', content: 'Reply with exactly the word: OK' },
+        { role: 'user', content: 'Say OK.' },
+      ],
+      1200,
+    )
+    console.log(`[LIVE smoke] glm-5.2 visible content length=${smoke.trim().length}`)
+    expect(smoke.trim().length).toBeGreaterThan(0)
+
+    const worker = createWebResearchWorker({
+      router,
+      resultsPerQuery: 3,
+      queriesPerGap: 2,
+      maxSourcesPerRound: 8,
+    })
+
+    let anyFetched = false
+    const rows: {
+      goal: string
+      fetched: number
+      single: ArmResult
+      full: ArmResult
+      adaptive: ArmResult
+    }[] = []
+
+    for (const liveGoal of goals) {
+      const liveSpecs: KnowledgeReadinessSpec[] = [
+        defineReadinessSpec({
+          id: 'topic/definition',
+          description: `what ${liveGoal} is and how it works`,
+          query: `${liveGoal} how it works method`,
+          requiredFor: ['ResearchAgent'],
+          importance: 'blocking',
+          minSources: 1,
+          minHits: 1,
+        }),
+      ]
+
+      // 1. REAL fetch ONCE per topic, shared by all three arms.
+      const probeRoot = await mkdtemp(join(tmpdir(), 'ad-live-probe-'))
+      let fetched: ResearchSourceProposal[] = []
+      try {
+        const index = await buildKnowledgeIndex(probeRoot)
+        const readiness = buildEvalKnowledgeBundle({ taskId: liveGoal, index, specs: [] })
+        const contribution = await worker({
+          root: probeRoot,
+          goal: liveGoal,
+          round: 1,
+          index,
+          gaps: liveSpecs.map((s) => ({
+            id: s.id,
+            description: s.description,
+            query: typeof s.metadata?.query === 'string' ? s.metadata.query : s.description,
+            blocking: true,
+          })),
+          readiness,
+        })
+        fetched = contribution.sources ?? []
+        if (fetched.length > 0) anyFetched = true
+      } finally {
+        await rm(probeRoot, { recursive: true, force: true })
+      }
+
+      // Add a controlled duplicate of the first fetched source under a different
+      // (tracking-decorated) URL so the dedup stage has something real to catch —
+      // mirrors/syndication are common in live search and are exactly what the $0
+      // stage is for. If nothing was fetched this is a no-op.
+      const withDup =
+        fetched.length > 0
+          ? [
+              ...fetched,
+              {
+                ...fetched[0],
+                uri: `${fetched[0].uri}${fetched[0].uri.includes('?') ? '&' : '?'}utm_source=mirror&ref=feed`,
+                metadata: { ...fetched[0].metadata, planted_dup: true },
+              },
+            ]
+          : fetched
+
+      const staticWorker =
+        (sources: ResearchSourceProposal[]): ResearchWorker =>
+        async () => ({
+          sources,
+          buildPages: (accepted) =>
+            accepted.length === 0
+              ? undefined
+              : accepted
+                  .map((record) => {
+                    const original = record.metadata?.originalUri
+                    const src = sources.find((s) => s.uri === original)
+                    const slug = String(original ?? record.id)
+                      .replace(/[^a-z0-9]+/gi, '-')
+                      .slice(0, 120)
+                    return [
+                      `---FILE: knowledge/${slug}.md---`,
+                      '---',
+                      `title: ${src?.title ?? record.id}`,
+                      `sources: ["${record.id}"]`,
+                      '---',
+                      `# ${src?.title ?? record.id}`,
+                      src?.text ?? '',
+                      '---END FILE---',
+                    ].join('\n')
+                  })
+                  .join('\n'),
+        })
+
+      const arm = async (
+        driver: ResearchDriver,
+      ): Promise<{ root: string; cost: ReturnType<RouterClient['usage']> }> => {
+        const root = await mkdtemp(join(tmpdir(), 'ad-live-arm-'))
+        const u0 = router.usage()
+        await runTwoAgentResearchLoop({
+          root,
+          goal: liveGoal,
+          worker: staticWorker(withDup),
+          driver,
+          readinessSpecs: liveSpecs,
+          maxRounds: 1,
+        })
+        return { root, cost: diffUsage(u0, router.usage()) }
+      }
+
+      const toResult = async (
+        out: { root: string; cost: ReturnType<RouterClient['usage']> },
+        llmCalls: number,
+      ): Promise<ArmResult> => {
+        const admitted = await admittedCount(out.root)
+        await rm(out.root, { recursive: true, force: true })
+        return {
+          admitted,
+          llmCalls,
+          usd: out.cost.usd,
+          tokens: out.cost.promptTokens + out.cost.completionTokens,
+          wallMs: out.cost.wallMs,
+        }
+      }
+
+      // A. single-agent (no verifier).
+      const singleOut = await arm({ verifySource: () => ({ accept: true }) })
+      const single = await toResult(singleOut, 0)
+
+      // B. full-LLM.
+      const fullOut = await arm(createVerifyingResearchDriver({ router }))
+      const full = await toResult(fullOut, fullOut.cost.chatCalls)
+
+      // C. adaptive — instrument its LLM-stage count via stats().
+      const adaptiveDriver = createAdaptiveResearchDriver({ router })
+      const adaptiveOut = await arm({
+        verifySource: (s, c) => adaptiveDriver.verifySource(s, c),
+      })
+      const adaptive = await toResult(adaptiveOut, adaptiveDriver.stats().llmCalls)
+
+      rows.push({ goal: liveGoal, fetched: withDup.length, single, full, adaptive })
+      console.log(
+        `[LIVE ADAPTIVE ${JSON.stringify(liveGoal)}] fetched=${withDup.length} | ` +
+          `single: admitted=${single.admitted} $0 | ` +
+          `full-LLM: admitted=${full.admitted} calls=${full.llmCalls} $${full.usd.toFixed(4)} tok=${full.tokens} | ` +
+          `adaptive: admitted=${adaptive.admitted} llmCalls=${adaptive.llmCalls} $${adaptive.usd.toFixed(4)} tok=${adaptive.tokens} ` +
+          `(dedup=${adaptiveDriver.stats().dedupRejected} hKeep=${adaptiveDriver.stats().heuristicKept} hDrop=${adaptiveDriver.stats().heuristicDropped})`,
+      )
+    }
+
+    expect(anyFetched).toBe(true)
+
+    // FRONTIER SUMMARY over all topics.
+    const sum = (pick: (r: (typeof rows)[number]) => number) =>
+      rows.reduce((a, r) => a + pick(r), 0)
+    const fullUsd = sum((r) => r.full.usd)
+    const adaptiveUsd = sum((r) => r.adaptive.usd)
+    const fullCalls = sum((r) => r.full.llmCalls)
+    const adaptiveCalls = sum((r) => r.adaptive.llmCalls)
+    const singleAdmitted = sum((r) => r.single.admitted)
+    const fullAdmitted = sum((r) => r.full.admitted)
+    const adaptiveAdmitted = sum((r) => r.adaptive.admitted)
+    const callSaving = fullCalls > 0 ? 1 - adaptiveCalls / fullCalls : 0
+    const usdSaving = fullUsd > 0 ? 1 - adaptiveUsd / fullUsd : 0
+
+    console.log(
+      `[LIVE ADAPTIVE SUMMARY] admitted single=${singleAdmitted} full=${fullAdmitted} adaptive=${adaptiveAdmitted} | ` +
+        `LLM calls full=${fullCalls} adaptive=${adaptiveCalls} (${(callSaving * 100).toFixed(0)}% fewer) | ` +
+        `USD full=$${fullUsd.toFixed(4)} adaptive=$${adaptiveUsd.toFixed(4)} (${(usdSaving * 100).toFixed(0)}% cheaper)`,
+    )
+
+    // The adaptive driver must NEVER spend MORE than the full-LLM arm — it is a
+    // strict subset of full-LLM's calls (the ambiguous tail) plus $0 stages. That
+    // is the one hard invariant; the magnitude of the saving and where adaptive
+    // lands on the cleanliness frontier are reported honestly in the doc.
+    expect(adaptiveCalls).toBeLessThanOrEqual(fullCalls)
+    expect(adaptiveUsd).toBeLessThanOrEqual(fullUsd + 1e-9)
+    // Adaptive cleanliness sits between single (admits all) and full-LLM (admits
+    // least) — it cannot admit MORE than the single-agent floor.
+    expect(adaptiveAdmitted).toBeLessThanOrEqual(singleAdmitted)
+  }, 600_000)
+})
+
+function diffUsage(
+  a: ReturnType<RouterClient['usage']>,
+  b: ReturnType<RouterClient['usage']>,
+): ReturnType<RouterClient['usage']> {
+  return {
+    chatCalls: b.chatCalls - a.chatCalls,
+    searchCalls: b.searchCalls - a.searchCalls,
+    promptTokens: b.promptTokens - a.promptTokens,
+    completionTokens: b.completionTokens - a.completionTokens,
+    usd: b.usd - a.usd,
+    wallMs: b.wallMs - a.wallMs,
+  }
+}
diff --git a/tests/loops/claim-grounding-ab.test.ts b/tests/loops/claim-grounding-ab.test.ts
new file mode 100644
index 0000000..3056dfa
--- /dev/null
+++ b/tests/loops/claim-grounding-ab.test.ts
@@ -0,0 +1,519 @@
+import { mkdtemp, rm } from 'node:fs/promises'
+import { tmpdir } from 'node:os'
+import { join } from 'node:path'
+import { afterEach, beforeEach, describe, expect, it } from 'vitest'
+import { createClaimGroundingVerifier, withCitedClaim } from '../../src/claim-grounding'
+import {
+  buildEvalKnowledgeBundle,
+  defineReadinessSpec,
+  type KnowledgeReadinessSpec,
+} from '../../src/eval-readiness'
+import { buildKnowledgeIndex } from '../../src/indexer'
+import {
+  type ResearchContribution,
+  type ResearchDriver,
+  type ResearchSourceProposal,
+  type ResearchWorker,
+  runTwoAgentResearchLoop,
+  type WorkerResearchContext,
+} from '../../src/two-agent-research-loop'
+import {
+  createTangleRouterClient,
+  createVerifyingResearchDriver,
+  createWebResearchWorker,
+  type RouterClient,
+} from '../../src/web-research-worker'
+
+// ===========================================================================
+// CLAIM-GROUNDING A/B: does checking each citation's CLAIM against the fetched
+// page text catch an error band the relevance verifier and de-dup CANNOT — and
+// does it do so for MORE quality per dollar than the relevance verifier earns
+// on the de-dup-dominated topic set (docs/results/cost-quality.md)?
+//
+// The cost/quality result showed the relevance verifier's cleanliness win is
+// dominated by DE-DUPLICATION, which a deterministic content-hash captures at
+// ~none of the LLM premium. So the open question is: is there a band where the
+// verifier earns its premium — an error a hash and a relevance judge both miss?
+//
+// MISATTRIBUTION is that band: a source that is on-topic, unique, and real, but
+// whose cited CLAIM does not appear in the page (a fabricated / mis-cited fact).
+//   - de-dup: passes it (it's unique).
+//   - relevance judge: passes it (it's on-topic).
+//   - claim-grounding: REJECTS it (the claim is absent from the fetched text).
+//
+// The grounding check is DETERMINISTIC text presence over `htmlToText` output —
+// executable ground truth, $0 inference — so its value-per-dollar denominator is
+// near-zero, which is the whole point: it catches what the expensive judge can't
+// for what the cheap rule can't.
+// ===========================================================================
+
+interface PoolEntry {
+  uri: string
+  title: string
+  text: string
+  /** The claim the worker will cite this source for. */
+  claim: string
+  /** True when `claim` is NOT present in `text` — a planted misattribution. */
+  misattributed: boolean
+}
+
+const goal = 'self-speculative decoding'
+
+/**
+ * A controlled pool: each source's TEXT is what a real fetch would return; the
+ * CLAIM is what a citation asserts. Two sources are correctly grounded; two are
+ * misattributed — on-topic, unique, real text, but the cited claim never appears
+ * in the page. A relevance judge would keep all four; de-dup would keep all four
+ * (distinct URLs); only claim-grounding rejects the misattributed two.
+ */
+const pool: PoolEntry[] = [
+  {
+    uri: 'https://arxiv.org/abs/self-spec-decoding',
+    title: 'Draft & Verify: Lossless LLM Acceleration via Self-Speculative Decoding',
+    text:
+      'Self-speculative decoding drafts tokens by skipping intermediate layers, then verifies them ' +
+      'with the full model. It reports a 1.73x speedup on LLaMA-2 with no quality loss.',
+    claim: 'self-speculative decoding reports a 1.73x speedup on LLaMA-2 with no quality loss',
+    misattributed: false,
+  },
+  {
+    uri: 'https://example.org/layer-skip-explainer',
+    title: 'How layer skipping accelerates decoding',
+    text:
+      'By skipping intermediate transformer layers during the draft stage, the same model produces ' +
+      'candidate tokens cheaply, which the full forward pass then verifies in parallel.',
+    claim: 'skipping intermediate layers lets the same model produce candidate tokens cheaply',
+    misattributed: false,
+  },
+  {
+    // On-topic, real text, UNIQUE url — but the cited claim (a 5x speedup on a
+    // separate draft network) appears NOWHERE in the page. Misattribution.
+    uri: 'https://blog.example.com/spec-decoding-overview',
+    title: 'A short overview of speculative decoding',
+    text:
+      'Speculative decoding uses a small draft model to propose tokens that a larger target model ' +
+      'verifies. Self-speculative variants reuse a single model instead of a separate draft model.',
+    claim: 'achieves a 5x speedup using a separately trained draft transformer network',
+    misattributed: true,
+  },
+  {
+    // On-topic, real text, UNIQUE url — cited claim invents a benchmark/number
+    // the page never states. Misattribution.
+    uri: 'https://news.example.com/decoding-roundup',
+    title: 'Decoding methods roundup',
+    text:
+      'This roundup compares several inference-time decoding strategies and notes that speculative ' +
+      'approaches trade extra compute for lower latency on autoregressive generation.',
+    claim: 'measured a 4.8x speedup on the GPT-4 MT-bench benchmark with zero accuracy drop',
+    misattributed: true,
+  },
+]
+
+const specs: KnowledgeReadinessSpec[] = [
+  defineReadinessSpec({
+    id: 'topic/definition',
+    description: `what ${goal} is and how it works`,
+    query: `${goal} how it works method`,
+    requiredFor: ['ResearchAgent'],
+    importance: 'blocking',
+    minSources: 1,
+    minHits: 1,
+  }),
+]
+
+/** A worker that proposes the whole pool, each source carrying its cited claim. */
+function poolWorker(onPass: () => void): ResearchWorker {
+  return async (_ctx: WorkerResearchContext): Promise<ResearchContribution> => {
+    onPass()
+    const sources: ResearchSourceProposal[] = pool.map((entry) =>
+      withCitedClaim(
+        {
+          uri: entry.uri,
+          title: entry.title,
+          text: entry.text,
+          metadata: { planted_misattributed: entry.misattributed },
+        },
+        entry.claim,
+      ),
+    )
+    return {
+      sources,
+      buildPages: (accepted) =>
+        accepted
+          .map((record) => {
+            const original = record.metadata?.originalUri
+            const entry = pool.find((p) => p.uri === original)
+            const slug = String(original ?? record.id).replace(/[^a-z0-9]+/gi, '-')
+            return [
+              `---FILE: knowledge/${slug}.md---`,
+              '---',
+              `title: ${entry?.title ?? record.id}`,
+              `sources: ["${record.id}"]`,
+              '---',
+              `# ${entry?.title ?? record.id}`,
+              entry?.text ?? '',
+              '---END FILE---',
+            ].join('\n')
+          })
+          .join('\n'),
+      notes: `proposed ${sources.length} sources (2 grounded, 2 misattributed)`,
+    }
+  }
+}
+
+/** A no-op verifier: accepts everything (the "no verifier" arm). */
+const acceptAllDriver: ResearchDriver = { verifySource: () => ({ accept: true }) }
+
+/** A relevance-only verifier stand-in: every pool source IS on-topic, so accept all. */
+const relevanceOnlyDriver: ResearchDriver = {
+  verifySource: (source: ResearchSourceProposal) => {
+    // Everything in the pool mentions decoding/tokens/model → on-topic → accept.
+    // This is exactly why relevance can't catch misattribution: the page IS
+    // relevant; only its cited claim is wrong.
+    const onTopic = /decod|token|model|layer/i.test(source.text)
+    return onTopic ? { accept: true } : { accept: false, reason: 'off-topic' }
+  },
+}
+
+/** Count how many planted-misattributed sources reached the KB. */
+async function misattributedAdmitted(root: string): Promise<number> {
+  const index = await buildKnowledgeIndex(root)
+  return index.sources.filter((s) => s.metadata?.planted_misattributed === true).length
+}
+
+async function admittedCount(root: string): Promise<number> {
+  const index = await buildKnowledgeIndex(root)
+  return index.sources.length
+}
+
+let groundRoot: string
+let relevanceRoot: string
+let noneRoot: string
+
+beforeEach(async () => {
+  groundRoot = await mkdtemp(join(tmpdir(), 'cg-ground-'))
+  relevanceRoot = await mkdtemp(join(tmpdir(), 'cg-relevance-'))
+  noneRoot = await mkdtemp(join(tmpdir(), 'cg-none-'))
+})
+afterEach(async () => {
+  await rm(groundRoot, { recursive: true, force: true })
+  await rm(relevanceRoot, { recursive: true, force: true })
+  await rm(noneRoot, { recursive: true, force: true })
+})
+
+describe('claim-grounding A/B (offline, controlled): catches misattribution dedup+relevance miss', () => {
+  it('only the claim-grounding verifier rejects the planted misattributions', async () => {
+    let groundPasses = 0
+    await runTwoAgentResearchLoop({
+      root: groundRoot,
+      goal,
+      worker: poolWorker(() => {
+        groundPasses += 1
+      }),
+      driver: { verifySource: createClaimGroundingVerifier() },
+      readinessSpecs: specs,
+      maxRounds: 1,
+    })
+    await runTwoAgentResearchLoop({
+      root: relevanceRoot,
+      goal,
+      worker: poolWorker(() => {}),
+      driver: relevanceOnlyDriver,
+      readinessSpecs: specs,
+      maxRounds: 1,
+    })
+    await runTwoAgentResearchLoop({
+      root: noneRoot,
+      goal,
+      worker: poolWorker(() => {}),
+      driver: acceptAllDriver,
+      readinessSpecs: specs,
+      maxRounds: 1,
+    })
+
+    const groundMis = await misattributedAdmitted(groundRoot)
+    const relevanceMis = await misattributedAdmitted(relevanceRoot)
+    const noneMis = await misattributedAdmitted(noneRoot)
+    const groundAdmitted = await admittedCount(groundRoot)
+
+    console.log(
+      `[claim-grounding A/B offline] misattributed admitted — ` +
+        `grounding=${groundMis} relevance=${relevanceMis} no-verifier=${noneMis} ` +
+        `(grounding kept ${groundAdmitted}/${pool.length} sources)`,
+    )
+
+    // The band: relevance + no-verifier let BOTH misattributions through (they
+    // are on-topic + unique); only claim-grounding rejects them.
+    expect(groundMis).toBe(0)
+    expect(relevanceMis).toBe(2)
+    expect(noneMis).toBe(2)
+    // Grounding still keeps the two correctly-grounded sources (no over-rejection).
+    expect(groundAdmitted).toBe(2)
+    expect(groundPasses).toBe(1)
+  })
+})
+
+// ===========================================================================
+// LIVE claim-grounding A/B — the real evidence. Skipped offline (no creds).
+//
+// Runs a REAL web-research worker (glm-5.2 query-gen → live /v1/search →
+// politeFetch → htmlToText) on a topic set, then INJECTS a controlled fraction
+// of MISATTRIBUTED citations (real fetched page + a deliberately-wrong claim) so
+// there is a measurable correctable band. Three verifier arms gate the SAME
+// proposals on the SAME topics:
+//
+//   A. no-verifier      — accept all (the floor).
+//   B. relevance (LLM)  — the shipped createVerifyingResearchDriver: 1 LLM call
+//                         per source, judges on-topic relevance.
+//   C. claim-grounding  — deterministic groundClaimInText over the fetched text:
+//                         $0 inference, rejects misattributions.
+//
+// Using #36's RouterClient.usage() we diff each arm's tokens/$/latency/calls and
+// report VALUE PER DOLLAR = misattributions-caught ÷ marginal-USD. Arm C's
+// denominator is ~$0, so if it catches the injected misattributions it dominates
+// arm B on this band — the cost the relevance judge could not earn on the
+// de-dup-dominated set.
+//
+// Gate: AGENT_KNOWLEDGE_LIVE=1 + TANGLE_API_KEY with glm-5.2 credits.
+//   CLAIM_GROUNDING_LIVE_GOALS  — `|`-separated topics
+//   CLAIM_GROUNDING_LIVE_MODEL  — router chat model (default glm-5.2)
+// ===========================================================================
+
+interface ArmCost {
+  chatCalls: number
+  tokens: number
+  usd: number
+  wallMs: number
+}
+
+function diffCost(
+  a: ReturnType<RouterClient['usage']>,
+  b: ReturnType<RouterClient['usage']>,
+): ArmCost {
+  return {
+    chatCalls: b.chatCalls - a.chatCalls,
+    tokens: b.promptTokens + b.completionTokens - a.promptTokens - a.completionTokens,
+    usd: b.usd - a.usd,
+    wallMs: b.wallMs - a.wallMs,
+  }
+}
+
+/**
+ * Take the sources a real worker fetched and (1) attach a GROUNDED claim built
+ * from the page's own text, then (2) corrupt every `corruptEvery`-th source into
+ * a MISATTRIBUTION by replacing its claim with one whose specific words are not
+ * in the page. Returns the annotated sources plus the count corrupted, so the
+ * arms share an identical proposal set with a known misattribution band.
+ */
+function plantMisattributions(
+  sources: ResearchSourceProposal[],
+  corruptEvery: number,
+): { annotated: ResearchSourceProposal[]; planted: number } {
+  let planted = 0
+  const fabricated = [
+    'this page reports a 7.3x speedup on the FluxBench-9000 benchmark with zero accuracy loss',
+    'the authors trained a separate 12-billion-parameter draft transformer on proprietary data',
+    'results show a 94.2 BLEU score on the never-released Zephyr-XL evaluation suite',
+  ]
+  const annotated = sources.map((source, i) => {
+    if (sources.length >= 2 && i % corruptEvery === corruptEvery - 1) {
+      const claim = fabricated[planted % fabricated.length] ?? fabricated[0]
+      planted += 1
+      return withCitedClaim(
+        { ...source, metadata: { ...source.metadata, planted_misattributed: true } },
+        claim,
+      )
+    }
+    // GROUNDED claim: the first sentence of the page's own text (verbatim-present
+    // by construction), so a correct citation grounds.
+    const firstSentence = source.text.split(/(?<=[.!?])\s/)[0]?.trim() || source.text.slice(0, 160)
+    return withCitedClaim(
+      { ...source, metadata: { ...source.metadata, planted_misattributed: false } },
+      firstSentence,
+    )
+  })
+  return { annotated, planted }
+}
+
+describe.skipIf(!process.env.AGENT_KNOWLEDGE_LIVE)('live: claim-grounding A/B per dollar', () => {
+  it('three verifier arms on real web sources with a planted misattribution band', async () => {
+    const goals = (
+      process.env.CLAIM_GROUNDING_LIVE_GOALS ??
+      'self-speculative decoding|rotary position embeddings|grouped-query attention'
+    )
+      .split('|')
+      .map((g) => g.trim())
+      .filter(Boolean)
+    const model = process.env.CLAIM_GROUNDING_LIVE_MODEL ?? 'glm-5.2'
+    const router: RouterClient = createTangleRouterClient({ model })
+
+    // COST GATE: cheap glm-5.2 smoke BEFORE the multi-topic burn.
+    const smoke = await router.chat(
+      [
+        { role: 'system', content: 'Reply with exactly the word: OK' },
+        { role: 'user', content: 'Say OK.' },
+      ],
+      1200,
+    )
+    console.log(`[LIVE smoke] glm-5.2 visible content length=${smoke.trim().length}`)
+    expect(smoke.trim().length).toBeGreaterThan(0)
+
+    const worker = createWebResearchWorker({
+      router,
+      resultsPerQuery: 3,
+      queriesPerGap: 1,
+      maxSourcesPerRound: 6,
+    })
+    const relevanceDriver = createVerifyingResearchDriver({ router })
+    const groundingDriver: ResearchDriver = { verifySource: createClaimGroundingVerifier() }
+
+    let totalPlanted = 0
+    let anyFetched = false
+    const rows: {
+      goal: string
+      planted: number
+      caught: Record<'none' | 'relevance' | 'grounding', number>
+      cost: Record<'none' | 'relevance' | 'grounding', ArmCost>
+    }[] = []
+
+    for (const liveGoal of goals) {
+      const liveSpecs: KnowledgeReadinessSpec[] = [
+        defineReadinessSpec({
+          id: 'topic/definition',
+          description: `what ${liveGoal} is and how it works`,
+          query: `${liveGoal} how it works method`,
+          requiredFor: ['ResearchAgent'],
+          importance: 'blocking',
+          minSources: 1,
+          minHits: 1,
+        }),
+      ]
+      // 1. REAL fetch ONCE per topic, shared by all three arms.
+      const probeRoot = await mkdtemp(join(tmpdir(), 'cg-live-probe-'))
+      let annotated: ResearchSourceProposal[] = []
+      let planted = 0
+      try {
+        const index = await buildKnowledgeIndex(probeRoot)
+        const readiness = buildEvalKnowledgeBundle({ taskId: liveGoal, index, specs: [] })
+        const contribution = await worker({
+          root: probeRoot,
+          goal: liveGoal,
+          round: 1,
+          index,
+          gaps: liveSpecs.map((s) => ({
+            id: s.id,
+            description: s.description,
+            query: typeof s.metadata?.query === 'string' ? s.metadata.query : s.description,
+            blocking: true,
+          })),
+          readiness,
+        })
+        const fetched = contribution.sources ?? []
+        if (fetched.length > 0) anyFetched = true
+        const result = plantMisattributions(fetched, 2)
+        annotated = result.annotated
+        planted = result.planted
+        totalPlanted += planted
+      } finally {
+        await rm(probeRoot, { recursive: true, force: true })
+      }
+
+      // 2. Run each arm's verifier over the SAME annotated proposals, diffing cost.
+      const staticWorker: ResearchWorker = async () => ({
+        sources: annotated,
+        buildPages: (accepted) =>
+          accepted.length === 0
+            ? undefined
+            : accepted
+                .map((record) => {
+                  const original = record.metadata?.originalUri
+                  const src = annotated.find((s) => s.uri === original)
+                  const slug = String(original ?? record.id)
+                    .replace(/[^a-z0-9]+/gi, '-')
+                    .slice(0, 120)
+                  return [
+                    `---FILE: knowledge/${slug}.md---`,
+                    '---',
+                    `title: ${src?.title ?? record.id}`,
+                    `sources: ["${record.id}"]`,
+                    '---',
+                    `# ${src?.title ?? record.id}`,
+                    src?.text ?? '',
+                    '---END FILE---',
+                  ].join('\n')
+                })
+                .join('\n'),
+      })
+
+      const caught: Record<'none' | 'relevance' | 'grounding', number> = {
+        none: 0,
+        relevance: 0,
+        grounding: 0,
+      }
+      const cost: Record<'none' | 'relevance' | 'grounding', ArmCost> = {
+        none: { chatCalls: 0, tokens: 0, usd: 0, wallMs: 0 },
+        relevance: { chatCalls: 0, tokens: 0, usd: 0, wallMs: 0 },
+        grounding: { chatCalls: 0, tokens: 0, usd: 0, wallMs: 0 },
+      }
+      for (const arm of ['none', 'relevance', 'grounding'] as const) {
+        const driver =
+          arm === 'none' ? acceptAllDriver : arm === 'relevance' ? relevanceDriver : groundingDriver
+        const root = await mkdtemp(join(tmpdir(), `cg-live-${arm}-`))
+        try {
+          const u0 = router.usage()
+          await runTwoAgentResearchLoop({
+            root,
+            goal: liveGoal,
+            worker: staticWorker,
+            driver,
+            readinessSpecs: liveSpecs,
+            maxRounds: 1,
+          })
+          cost[arm] = diffCost(u0, router.usage())
+          const admittedMis = await misattributedAdmitted(root)
+          // caught = planted − admitted misattributions.
+          caught[arm] = planted - admittedMis
+        } finally {
+          await rm(root, { recursive: true, force: true })
+        }
+      }
+
+      rows.push({ goal: liveGoal, planted, caught, cost })
+      console.log(
+        `[LIVE CG ${JSON.stringify(liveGoal)}] planted=${planted} ` +
+          `caught none=${caught.none} relevance=${caught.relevance} grounding=${caught.grounding} | ` +
+          `$ none=${cost.none.usd.toFixed(4)} relevance=${cost.relevance.usd.toFixed(4)} grounding=${cost.grounding.usd.toFixed(4)} | ` +
+          `calls relevance=${cost.relevance.chatCalls} grounding=${cost.grounding.chatCalls}`,
+      )
+    }
+
+    expect(anyFetched).toBe(true)
+
+    // VALUE PER DOLLAR over all topics: misattributions caught ÷ marginal USD.
+    const totalCaught = (arm: 'none' | 'relevance' | 'grounding') =>
+      rows.reduce((acc, r) => acc + r.caught[arm], 0)
+    const totalUsd = (arm: 'none' | 'relevance' | 'grounding') =>
+      rows.reduce((acc, r) => acc + r.cost[arm].usd, 0)
+    const relevanceCaught = totalCaught('relevance')
+    const groundingCaught = totalCaught('grounding')
+    const relevanceUsd = totalUsd('relevance')
+    const groundingUsd = totalUsd('grounding')
+    const perDollar = (caughtN: number, usd: number) =>
+      usd > 0 ? caughtN / usd : caughtN > 0 ? Number.POSITIVE_INFINITY : 0
+
+    console.log(
+      `[LIVE CG SUMMARY] planted=${totalPlanted} ` +
+        `caught relevance=${relevanceCaught} grounding=${groundingCaught} | ` +
+        `$ relevance=${relevanceUsd.toFixed(4)} grounding=${groundingUsd.toFixed(4)} | ` +
+        `per-$ relevance=${perDollar(relevanceCaught, relevanceUsd).toFixed(1)} ` +
+        `grounding=${perDollar(groundingCaught, groundingUsd).toFixed(1)}`,
+    )
+
+    // The claim-grounding arm should catch at least as many misattributions as
+    // the relevance arm (it is the band relevance cannot see) at strictly lower
+    // marginal cost (deterministic, $0 inference). This is the value-per-dollar
+    // claim the doc reports; assert the DIRECTION here, the magnitude in the doc.
+    expect(groundingCaught).toBeGreaterThanOrEqual(relevanceCaught)
+    expect(groundingUsd).toBeLessThanOrEqual(relevanceUsd)
+  }, 600_000)
+})
diff --git a/tests/loops/research-loop-equal-compute.test.ts b/tests/loops/research-loop-equal-compute.test.ts
index 8ae6544..397dbaf 100644
--- a/tests/loops/research-loop-equal-compute.test.ts
+++ b/tests/loops/research-loop-equal-compute.test.ts
@@ -627,6 +627,9 @@ describe.skipIf(!process.env.AGENT_KNOWLEDGE_LIVE)(
         const twoRoot = await mkdtemp(join(tmpdir(), 'live-two-'))
         const singleRoot = await mkdtemp(join(tmpdir(), 'live-single-'))
         try {
+          // Snapshot cumulative router cost so each arm's token/$/latency/calls
+          // can be diffed out — the COST half of the cost/quality result.
+          const u0 = router.usage()
           // TWO-AGENT arm: real worker proposes, real LLM driver verifies.
           const two = await runTwoAgentArm(
             twoRoot,
@@ -635,6 +638,7 @@ describe.skipIf(!process.env.AGENT_KNOWLEDGE_LIVE)(
             { worker: realWorker, driver: realDriver },
             specs,
           )
+          const u1 = router.usage()
           // SINGLE-AGENT arm: the SAME real worker, NO verifier gate, more iters
           // to spend the same agent-pass budget the two-agent loop burns on
           // verification.
@@ -645,6 +649,19 @@ describe.skipIf(!process.env.AGENT_KNOWLEDGE_LIVE)(
             (ctx, onPass) => realWorkerPropose(realWorker, ctx, onPass),
             specs,
           )
+          const u2 = router.usage()
+          const twoCost = {
+            chatCalls: u1.chatCalls - u0.chatCalls,
+            tokens: u1.promptTokens + u1.completionTokens - u0.promptTokens - u0.completionTokens,
+            usd: u1.usd - u0.usd,
+            wallMs: u1.wallMs - u0.wallMs,
+          }
+          const singleCost = {
+            chatCalls: u2.chatCalls - u1.chatCalls,
+            tokens: u2.promptTokens + u2.completionTokens - u1.promptTokens - u1.completionTokens,
+            usd: u2.usd - u1.usd,
+            wallMs: u2.wallMs - u1.wallMs,
+          }
 
           const twoAdmitted = await admittedSourceCount(twoRoot)
           const singleAdmitted = await admittedSourceCount(singleRoot)
@@ -657,8 +674,10 @@ describe.skipIf(!process.env.AGENT_KNOWLEDGE_LIVE)(
 
           console.log(
             `[LIVE A/B ${JSON.stringify(liveGoal)} @ B<=${budgetPasses}] ` +
-              `two-agent: passes=${two.passes} admitted=${twoAdmitted} coverage=${twoCoverage.toFixed(2)} | ` +
-              `single-agent: passes=${single.passes} admitted=${singleAdmitted} coverage=${singleCoverage.toFixed(2)}`,
+              `two-agent: passes=${two.passes} admitted=${twoAdmitted} coverage=${twoCoverage.toFixed(2)} ` +
+              `calls=${twoCost.chatCalls} tok=${twoCost.tokens} $${twoCost.usd.toFixed(4)} ${twoCost.wallMs}ms | ` +
+              `single-agent: passes=${single.passes} admitted=${singleAdmitted} coverage=${singleCoverage.toFixed(2)} ` +
+              `calls=${singleCost.chatCalls} tok=${singleCost.tokens} $${singleCost.usd.toFixed(4)} ${singleCost.wallMs}ms`,
           )
         } finally {
           await rm(twoRoot, { recursive: true, force: true })