tangle-network · drewstone · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026
diff --git a/docs/results/adaptive.md b/docs/results/adaptive.md
@@ -0,0 +1,120 @@
+# Adaptive topology: spend the LLM verifier only when it pays
+
+The cost/quality A/B (`docs/results/cost-quality.md`) found the LLM relevance
+verifier's cleanliness win is dominated by **de-duplication** — which a
+deterministic content-hash / canonical-URL check captures at ~none of the LLM
+premium — and that an LLM check only earns its dollar on the off-scope tail. The
+production move it named was: do the cheap deterministic work first, reserve the
+LLM for the ambiguous tail. `createAdaptiveResearchDriver`
+(`src/adaptive-driver.ts`) is that driver, and this is its measurement.
+
+Per candidate source the adaptive driver runs three stages, cheapest first,
+stopping at the first that decides:
+
+1. **Dedup ($0).** Reject a source whose canonical URL (scheme/`www`/trailing
+   slash/tracking params stripped) or normalized-text content hash matches one
+   already accepted this round or in the KB.
+2. **Heuristic triage ($0).** Classify a unique survivor with host/title/length
+   signals only: an authoritative host (arxiv, `*.edu`, `*.gov`, official docs,
+   github, …) with a substantial body is **kept**; an obvious spam/listicle
+   title or a too-thin body is **dropped**; everything else is **ambiguous**.
+3. **LLM escalation ($).** Only ambiguous survivors reach the shipped LLM
+   relevance verifier (`createVerifyingResearchDriver`) — one call each.
+
+## Live frontier, n=5 topics (glm-5.2)
+
+Real web-research worker fetches each topic once; the same fetched proposals
+(plus one planted tracking-decorated mirror of the first source, so the dedup
+stage has a real duplicate to catch) are gated through all three drivers. Cost
+is the per-arm `RouterClient.usage()` diff (#36). Total spend for the run:
+**$0.033**.
+
+| topic | fetched | single admit | full-LLM admit / calls / $ | adaptive admit / LLM calls / $ |
+|---|---|---|---|---|
+| self-speculative decoding | 3 | 3 | 1 / 3 / $0.0027 | 2 / **0** / **$0.0000** |
+| rotary position embeddings | 3 | 3 | 1 / 3 / $0.0031 | 2 / **0** / **$0.0000** |
+| grouped-query attention | 7 | 7 | 3 / 7 / $0.0072 | 6 / 3 / $0.0030 |
+| KV-cache quantization | 5 | 5 | 3 / 5 / $0.0052 | 4 / **0** / **$0.0000** |
+| LoRA fine-tuning | 7 | 7 | 4 / 7 / $0.0079 | 6 / 3 / $0.0037 |
+| **total** | **25** | **25** | **12 / 25 / $0.0261** | **20 / 6 / $0.0068** |
+
+**Cost.** Adaptive cuts LLM verifier calls **76%** (25 → 6) and dollars **74%**
+($0.0261 → $0.0068). On 3 of the 5 topics it spent **zero** LLM calls — every
+unique survivor was on an authoritative host, so the $0 stages decided
+everything.
+
+## The honest reading: adaptive is a frontier POINT, not a free lunch
+
+Admitted-source counts (lower = cleaner KB): **single 25, adaptive 20,
+full-LLM 12**. Adaptive sits **between** the two:
+
+- It removes **5 of the 13 sources** the full-LLM judge rejects that the
+  single-agent loop keeps (every one of them a real duplicate caught by the $0
+  dedup stage — exactly the de-dup-dominated win the cost/quality result
+  predicted).
+- It does **NOT** match full-LLM cleanliness. The remaining 8 sources full-LLM
+  rejects, adaptive keeps. The cause is structural and visible in the trace: on
+  this topic set every non-duplicate survivor landed on an authoritative host
+  (arxiv / github / official docs), so the heuristic **kept** it without ever
+  asking the LLM — and the LLM, when full-LLM did ask it, judged several of
+  those same authoritative pages not-quite-on-topic and dropped them. The host
+  prior is coarser than the relevance judge.
+
+So the frontier tradeoff is concrete: **adaptive recovers the deterministic
+de-dup half of full-LLM's cleanliness for ~26% of full-LLM's dollars, and gives
+up the relevance-judgment half.** Whether that is the right point depends on the
+cost of a kept-but-marginal source. If a slightly-off-topic authoritative page
+is cheap to carry, adaptive dominates. If every admitted source must clear a
+relevance bar, the host heuristic is too permissive and you want the full LLM —
+or a tightened heuristic.
+
+## Where the heuristic is weak — stated plainly
+
+The escalation count is the diagnostic. On 3 of 5 topics it was **zero**: the
+heuristic never deferred to the LLM, so on those topics adaptive is a
+**pure host/title/length rule**, and its cleanliness is exactly that rule's
+cleanliness — no smarter than "trust arxiv/github, drop spam." That is fine when
+the worker's sources are dominated by authoritative hosts (as here), but it
+means the LLM's relevance judgment is contributing nothing on those topics, by
+construction. The two topics where adaptive *did* escalate (grouped-query
+attention, LoRA) are where some survivors were on unknown hosts — and there the
+3 LLM calls per topic are the off-scope tail the verifier is actually for.
+
+The heuristic would mis-route in two directions a richer worker would expose,
+neither seen on this authoritative-host-heavy set:
+
+- **False keep:** an authoritative-host page that is off-topic or low-value
+  (an arxiv paper on an unrelated subject) is kept without the LLM ever seeing
+  it. The host prior cannot catch this; only the relevance judge can.
+- **False drop:** a genuinely good source on an unknown blog/host with a
+  spam-shaped title, or under the 400-char body floor, is dropped before the LLM
+  could rescue it.
+
+## What this changes
+
+The deployable recommendation from the cost/quality result was "deterministic
+dedup first, reserve the LLM for the tail." This driver ships that and the
+measurement confirms the **cost** half cleanly (76% fewer calls, 74% cheaper)
+and qualifies the **quality** half honestly: adaptive captures the de-dup
+cleanliness (the dominant, free win) but not the LLM's relevance cleanliness,
+because the host heuristic resolves authoritative survivors without asking. For
+a worker whose sources are mostly authoritative, adaptive is the right frontier
+point. For one whose junk is on-topic-looking pages on unknown hosts, the
+ambiguous tail grows and adaptive converges toward full-LLM cost — which is the
+correct behavior: it pays for the LLM exactly when the cheap signals can't
+decide.
+
+## Run it
+
+```
+# offline (controlled wiring + escalates-only-ambiguous proof, no creds)
+pnpm exec vitest run tests/loops/adaptive-ab.test.ts
+
+# live three-topology frontier (needs TANGLE_API_KEY with glm-5.2 credits)
+AGENT_KNOWLEDGE_LIVE=1 \
+ADAPTIVE_LIVE_GOALS="self-speculative decoding|rotary position embeddings|grouped-query attention|KV-cache quantization|LoRA fine-tuning" \
+pnpm exec vitest run tests/loops/adaptive-ab.test.ts -t "three-topology"
+```
+
+A cheap one-call glm-5.2 smoke gates the multi-topic burn (fails fast if the key
+or the reasoning-token floor is broken) before any dollars are spent.
diff --git a/docs/results/claim-grounding.md b/docs/results/claim-grounding.md
@@ -0,0 +1,97 @@
+# Claim-grounding: the band where the verifier earns its dollar
+
+`docs/results/cost-quality.md` found the relevance verifier's cleanliness win is
+**dominated by de-duplication** — a deterministic content-hash captures most of it
+at ~none of the LLM premium. So the open question was: is there an error band where
+a verifier earns its cost — something a hash AND a relevance judge both miss?
+
+**Yes: misattributed citations.** A source that is on-topic, unique, and real, but
+whose cited CLAIM does not appear in the page (the LLM wrote a plausible sentence
+and hung a real URL off it). De-dup passes it (it's unique). A relevance judge
+passes it (the page is on-topic). Only checking the claim against the fetched text
+catches it — and that check is **deterministic text presence, $0 inference**.
+
+## The mode
+
+Each proposed source now carries the specific claim it is cited for
+(`withCitedClaim` → `metadata.citedClaim`). The verifier
+(`createClaimGroundingVerifier`) runs `groundClaimInText(claim, pageText)` over the
+`htmlToText` output of the page the worker actually fetched — verbatim, normalized
+(punctuation/whitespace-insensitive), or a ≥70% content-word overlap close
+paraphrase. A claim that isn't present is rejected as **misattributed**. The oracle
+is text presence, not a model call, so it composes with the LLM relevance verifier
+(reject off-topic AND misattributed) or runs alone at zero inference cost.
+
+## Live A/B (glm-5.2, real web fetch, planted misattribution band)
+
+Real worker (glm-5.2 query-gen → live `/v1/search` → `politeFetch` → `htmlToText`)
+fetches the sources once per topic; we then plant ONE misattribution per topic (real
+fetched page + a deliberately-wrong claim) and run three verifier arms over the SAME
+proposals. Cost diffed per arm with the #36 `RouterClient.usage()` instrumentation.
+
+| n=5 topics | misattributions caught | marginal $ | $/topic | per-$ caught |
+|---|---|---|---|---|
+| no-verifier | 0 / 5 | $0.0000 | — | — |
+| relevance (LLM judge) | **4 / 5** | $0.0157 | ~$0.0031 | 254 |
+| claim-grounding (text) | **5 / 5** | **$0.0000** | $0 | ∞ |
+
+Per-topic (caught relevance / grounding): self-speculative decoding 1/1, rotary
+position embeddings 1/1, grouped-query attention 1/1, **KV-cache quantization 0/1**,
+LoRA 1/1. (An earlier 3-topic run missed self-speculative decoding instead — the
+miss moves around; it is not a fixed topic.)
+
+**Reading.** Claim-grounding catches every misattribution at **$0**; the relevance
+judge catches most but **misses one in five at ~$0.003/topic**. The miss is the
+point: the relevance verifier only ever sees the page text, never the cited claim,
+so it is *structurally blind* to misattribution. It catches one only by accident —
+when the fabricated claim happens to also make the page read off-topic
+(e.g. a "12-billion-parameter draft transformer" claim on a rotary-embeddings page).
+When the fabrication stays on-topic (the KV-cache case), the judge waves it through.
+
+So on THIS band the verifier-per-dollar comparison inverts the cost/quality result:
+there, the LLM verifier bought a dedup-shaped gain a free hash already captures —
+expensive for what a cheap rule does. Here the cheap, deterministic check
+**dominates** the expensive judge: it catches strictly more (5/5 vs 4/5) at strictly
+less ($0 vs $0.0157). The verifier earns its dollar on misattribution; it does not on
+de-duplication.
+
+## Why this is a real correctable band (not dedup, not relevance)
+
+- **Not de-duplication.** Every planted source has a unique URL and unique text; a
+  content-hash / canonical-URL dedup keeps all of them.
+- **Not generic relevance.** Every planted source is on-topic; the relevance judge
+  (and the offline relevance stand-in) accept them. The error is in the *claim*, not
+  the *topic*.
+- **Executable ground truth.** The check is presence/close-paraphrase of the claim
+  in the fetched text — deployable in production with no oracle and no model call.
+
+The offline arm proves the floor with a controlled 4-source pool (2 grounded, 2
+misattributed): claim-grounding admits **0/2** misattributions and keeps **2/2**
+grounded sources, while relevance and no-verifier both admit **2/2**.
+
+## Threats to validity
+
+- **n=5 topics, 1 misattribution each.** The direction (grounding ≥ relevance caught,
+  at ≤ cost) is asserted in the test on every run; the magnitude is small-n. The
+  relevance miss-rate (1/5 here, 1/3 earlier) is an existence proof of the blind
+  spot, not a calibrated rate.
+- **Planted misattributions, not naturally-occurring ones.** Like the cost/quality
+  offline floor, the misattribution is injected so the band is measurable. It models
+  the real LLM citation-fabrication failure but does not measure its base rate in the
+  wild — that needs a corpus of model-written citations checked by hand.
+- **The grounding oracle is conservative.** A real paraphrase whose inflected words
+  differ from the page ("drafts" vs "draft") can score below 0.7 and be rejected —
+  a false-positive misattribution flag. `minOverlap` tunes this; the worker should
+  cite the page's own key terms (the `createClaimDecorator` extractor is told to).
+
+## Run it
+
+```bash
+# offline floor (no creds)
+pnpm exec vitest run tests/loops/claim-grounding-ab.test.ts -t "offline"
+
+# live A/B (creds-gated). A cheap glm-5.2 smoke runs BEFORE the multi-topic burn.
+AGENT_KNOWLEDGE_LIVE=1 TANGLE_API_KEY=… \
+  CLAIM_GROUNDING_LIVE_GOALS='self-speculative decoding|rotary position embeddings|grouped-query attention|KV-cache quantization|LoRA' \
+  pnpm exec vitest run tests/loops/claim-grounding-ab.test.ts -t "three verifier arms"
+```
diff --git a/docs/results/cost-quality.md b/docs/results/cost-quality.md
@@ -0,0 +1,27 @@
+# Cost/quality: the two-agent loop's inference premium
+
+Live 9-topic A/B (glm-5.2, budget B ≤ 4 passes/arm), measured per arm with the
+router-client instrumentation. The original A/B reported only admitted-sources at
+"equal passes" — which charged the two-agent verify step as one pass while it is
+actually N `verifySource` LLM calls. Pricing the calls shows what that hid.
+
+| per topic (mean) | two-agent | single-agent | ratio |
+|---|---|---|---|
+| LLM chat calls | 5.4 | 1.0 | ~5.4× |
+| tokens (in+out) | ~4,900 | ~530 | ~9× |
+| cost (USD) | ~$0.0072 | ~$0.0013 | ~5.5× |
+| latency (wall) | ~37 s | ~11 s | ~3.4× |
+| cleanliness Δ (single − two admitted) | — | — | +1.56, 95% CI [0.33, 2.67] |
+
+Per-topic Δ (single − two admitted) this run: self-speculative decoding +4,
+grouped-query attention 0, rotary position embeddings +1, KV-cache quantization
+**−1**, LoRA +4, ring attention +2, constitutional AI +3, transformer +2, gradient
+descent **−1**. Coverage 1.00 every topic, both arms.
+
+**Reading.** The verifier buys ~1.5–2.7 fewer junk sources for roughly **5× the
+dollars, 9× the tokens, and 3× the latency** — and on two topics it admitted *more*
+than the single agent (the cleanliness signal is real but noisier than the +2.3 /
++2.7 of earlier runs). The cleanliness gain is dominated by de-duplication, so the
+honest production move is a deterministic content-hash / canonical-URL dedup, which
+captures most of the cleanliness at ~none of this premium; reserve an LLM check for
+the off-scope tail. This is the cost half the "equal passes" framing left out.