Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 120 additions & 0 deletions docs/results/adaptive.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Adaptive topology: spend the LLM verifier only when it pays

The cost/quality A/B (`docs/results/cost-quality.md`) found the LLM relevance
verifier's cleanliness win is dominated by **de-duplication** — which a
deterministic content-hash / canonical-URL check captures at ~none of the LLM
premium — and that an LLM check only earns its dollar on the off-scope tail. The
production move it named was: do the cheap deterministic work first, reserve the
LLM for the ambiguous tail. `createAdaptiveResearchDriver`
(`src/adaptive-driver.ts`) is that driver, and this is its measurement.

Per candidate source the adaptive driver runs three stages, cheapest first,
stopping at the first that decides:

1. **Dedup ($0).** Reject a source whose canonical URL (scheme/`www`/trailing
slash/tracking params stripped) or normalized-text content hash matches one
already accepted this round or in the KB.
2. **Heuristic triage ($0).** Classify a unique survivor with host/title/length
signals only: an authoritative host (arxiv, `*.edu`, `*.gov`, official docs,
github, …) with a substantial body is **kept**; an obvious spam/listicle
title or a too-thin body is **dropped**; everything else is **ambiguous**.
3. **LLM escalation ($).** Only ambiguous survivors reach the shipped LLM
relevance verifier (`createVerifyingResearchDriver`) — one call each.

## Live frontier, n=5 topics (glm-5.2)

Real web-research worker fetches each topic once; the same fetched proposals
(plus one planted tracking-decorated mirror of the first source, so the dedup
stage has a real duplicate to catch) are gated through all three drivers. Cost
is the per-arm `RouterClient.usage()` diff (#36). Total spend for the run:
**$0.033**.

| topic | fetched | single admit | full-LLM admit / calls / $ | adaptive admit / LLM calls / $ |
|---|---|---|---|---|
| self-speculative decoding | 3 | 3 | 1 / 3 / $0.0027 | 2 / **0** / **$0.0000** |
| rotary position embeddings | 3 | 3 | 1 / 3 / $0.0031 | 2 / **0** / **$0.0000** |
| grouped-query attention | 7 | 7 | 3 / 7 / $0.0072 | 6 / 3 / $0.0030 |
| KV-cache quantization | 5 | 5 | 3 / 5 / $0.0052 | 4 / **0** / **$0.0000** |
| LoRA fine-tuning | 7 | 7 | 4 / 7 / $0.0079 | 6 / 3 / $0.0037 |
| **total** | **25** | **25** | **12 / 25 / $0.0261** | **20 / 6 / $0.0068** |

**Cost.** Adaptive cuts LLM verifier calls **76%** (25 → 6) and dollars **74%**
($0.0261 → $0.0068). On 3 of the 5 topics it spent **zero** LLM calls — every
unique survivor was on an authoritative host, so the $0 stages decided
everything.

## The honest reading: adaptive is a frontier POINT, not a free lunch

Admitted-source counts (lower = cleaner KB): **single 25, adaptive 20,
full-LLM 12**. Adaptive sits **between** the two:

- It removes **5 of the 13 sources** the full-LLM judge rejects that the
single-agent loop keeps (every one of them a real duplicate caught by the $0
dedup stage — exactly the de-dup-dominated win the cost/quality result
predicted).
- It does **NOT** match full-LLM cleanliness. The remaining 8 sources full-LLM
rejects, adaptive keeps. The cause is structural and visible in the trace: on
this topic set every non-duplicate survivor landed on an authoritative host
(arxiv / github / official docs), so the heuristic **kept** it without ever
asking the LLM — and the LLM, when full-LLM did ask it, judged several of
those same authoritative pages not-quite-on-topic and dropped them. The host
prior is coarser than the relevance judge.

So the frontier tradeoff is concrete: **adaptive recovers the deterministic
de-dup half of full-LLM's cleanliness for ~26% of full-LLM's dollars, and gives
up the relevance-judgment half.** Whether that is the right point depends on the
cost of a kept-but-marginal source. If a slightly-off-topic authoritative page
is cheap to carry, adaptive dominates. If every admitted source must clear a
relevance bar, the host heuristic is too permissive and you want the full LLM —
or a tightened heuristic.

## Where the heuristic is weak — stated plainly

The escalation count is the diagnostic. On 3 of 5 topics it was **zero**: the
heuristic never deferred to the LLM, so on those topics adaptive is a
**pure host/title/length rule**, and its cleanliness is exactly that rule's
cleanliness — no smarter than "trust arxiv/github, drop spam." That is fine when
the worker's sources are dominated by authoritative hosts (as here), but it
means the LLM's relevance judgment is contributing nothing on those topics, by
construction. The two topics where adaptive *did* escalate (grouped-query
attention, LoRA) are where some survivors were on unknown hosts — and there the
3 LLM calls per topic are the off-scope tail the verifier is actually for.

The heuristic would mis-route in two directions a richer worker would expose,
neither seen on this authoritative-host-heavy set:

- **False keep:** an authoritative-host page that is off-topic or low-value
(an arxiv paper on an unrelated subject) is kept without the LLM ever seeing
it. The host prior cannot catch this; only the relevance judge can.
- **False drop:** a genuinely good source on an unknown blog/host with a
spam-shaped title, or under the 400-char body floor, is dropped before the LLM
could rescue it.

## What this changes

The deployable recommendation from the cost/quality result was "deterministic
dedup first, reserve the LLM for the tail." This driver ships that and the
measurement confirms the **cost** half cleanly (76% fewer calls, 74% cheaper)
and qualifies the **quality** half honestly: adaptive captures the de-dup
cleanliness (the dominant, free win) but not the LLM's relevance cleanliness,
because the host heuristic resolves authoritative survivors without asking. For
a worker whose sources are mostly authoritative, adaptive is the right frontier
point. For one whose junk is on-topic-looking pages on unknown hosts, the
ambiguous tail grows and adaptive converges toward full-LLM cost — which is the
correct behavior: it pays for the LLM exactly when the cheap signals can't
decide.

## Run it

```
# offline (controlled wiring + escalates-only-ambiguous proof, no creds)
pnpm exec vitest run tests/loops/adaptive-ab.test.ts

# live three-topology frontier (needs TANGLE_API_KEY with glm-5.2 credits)
AGENT_KNOWLEDGE_LIVE=1 \
ADAPTIVE_LIVE_GOALS="self-speculative decoding|rotary position embeddings|grouped-query attention|KV-cache quantization|LoRA fine-tuning" \
pnpm exec vitest run tests/loops/adaptive-ab.test.ts -t "three-topology"
```

A cheap one-call glm-5.2 smoke gates the multi-topic burn (fails fast if the key
or the reasoning-token floor is broken) before any dollars are spent.
97 changes: 97 additions & 0 deletions docs/results/claim-grounding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Claim-grounding: the band where the verifier earns its dollar

`docs/results/cost-quality.md` found the relevance verifier's cleanliness win is
**dominated by de-duplication** — a deterministic content-hash captures most of it
at ~none of the LLM premium. So the open question was: is there an error band where
a verifier earns its cost — something a hash AND a relevance judge both miss?

**Yes: misattributed citations.** A source that is on-topic, unique, and real, but
whose cited CLAIM does not appear in the page (the LLM wrote a plausible sentence
and hung a real URL off it). De-dup passes it (it's unique). A relevance judge
passes it (the page is on-topic). Only checking the claim against the fetched text
catches it — and that check is **deterministic text presence, $0 inference**.

## The mode

Each proposed source now carries the specific claim it is cited for
(`withCitedClaim` → `metadata.citedClaim`). The verifier
(`createClaimGroundingVerifier`) runs `groundClaimInText(claim, pageText)` over the
`htmlToText` output of the page the worker actually fetched — verbatim, normalized
(punctuation/whitespace-insensitive), or a ≥70% content-word overlap close
paraphrase. A claim that isn't present is rejected as **misattributed**. The oracle
is text presence, not a model call, so it composes with the LLM relevance verifier
(reject off-topic AND misattributed) or runs alone at zero inference cost.

## Live A/B (glm-5.2, real web fetch, planted misattribution band)

Real worker (glm-5.2 query-gen → live `/v1/search` → `politeFetch` → `htmlToText`)
fetches the sources once per topic; we then plant ONE misattribution per topic (real
fetched page + a deliberately-wrong claim) and run three verifier arms over the SAME
proposals. Cost diffed per arm with the #36 `RouterClient.usage()` instrumentation.

| n=5 topics | misattributions caught | marginal $ | $/topic | per-$ caught |
|---|---|---|---|---|
| no-verifier | 0 / 5 | $0.0000 | — | — |
| relevance (LLM judge) | **4 / 5** | $0.0157 | ~$0.0031 | 254 |
| claim-grounding (text) | **5 / 5** | **$0.0000** | $0 | ∞ |

Per-topic (caught relevance / grounding): self-speculative decoding 1/1, rotary
position embeddings 1/1, grouped-query attention 1/1, **KV-cache quantization 0/1**,
LoRA 1/1. (An earlier 3-topic run missed self-speculative decoding instead — the
miss moves around; it is not a fixed topic.)

**Reading.** Claim-grounding catches every misattribution at **$0**; the relevance
judge catches most but **misses one in five at ~$0.003/topic**. The miss is the
point: the relevance verifier only ever sees the page text, never the cited claim,
so it is *structurally blind* to misattribution. It catches one only by accident —
when the fabricated claim happens to also make the page read off-topic
(e.g. a "12-billion-parameter draft transformer" claim on a rotary-embeddings page).
When the fabrication stays on-topic (the KV-cache case), the judge waves it through.

So on THIS band the verifier-per-dollar comparison inverts the cost/quality result:
there, the LLM verifier bought a dedup-shaped gain a free hash already captures —
expensive for what a cheap rule does. Here the cheap, deterministic check
**dominates** the expensive judge: it catches strictly more (5/5 vs 4/5) at strictly
less ($0 vs $0.0157). The verifier earns its dollar on misattribution; it does not on
de-duplication.

## Why this is a real correctable band (not dedup, not relevance)

- **Not de-duplication.** Every planted source has a unique URL and unique text; a
content-hash / canonical-URL dedup keeps all of them.
- **Not generic relevance.** Every planted source is on-topic; the relevance judge
(and the offline relevance stand-in) accept them. The error is in the *claim*, not
the *topic*.
- **Executable ground truth.** The check is presence/close-paraphrase of the claim
in the fetched text — deployable in production with no oracle and no model call.

The offline arm proves the floor with a controlled 4-source pool (2 grounded, 2
misattributed): claim-grounding admits **0/2** misattributions and keeps **2/2**
grounded sources, while relevance and no-verifier both admit **2/2**.

## Threats to validity

- **n=5 topics, 1 misattribution each.** The direction (grounding ≥ relevance caught,
at ≤ cost) is asserted in the test on every run; the magnitude is small-n. The
relevance miss-rate (1/5 here, 1/3 earlier) is an existence proof of the blind
spot, not a calibrated rate.
- **Planted misattributions, not naturally-occurring ones.** Like the cost/quality
offline floor, the misattribution is injected so the band is measurable. It models
the real LLM citation-fabrication failure but does not measure its base rate in the
wild — that needs a corpus of model-written citations checked by hand.
- **The grounding oracle is conservative.** A real paraphrase whose inflected words
differ from the page ("drafts" vs "draft") can score below 0.7 and be rejected —
a false-positive misattribution flag. `minOverlap` tunes this; the worker should
cite the page's own key terms (the `createClaimDecorator` extractor is told to).

## Run it

```bash
# offline floor (no creds)
pnpm exec vitest run tests/loops/claim-grounding-ab.test.ts -t "offline"

# live A/B (creds-gated). A cheap glm-5.2 smoke runs BEFORE the multi-topic burn.
AGENT_KNOWLEDGE_LIVE=1 TANGLE_API_KEY=… \
CLAIM_GROUNDING_LIVE_GOALS='self-speculative decoding|rotary position embeddings|grouped-query attention|KV-cache quantization|LoRA' \
pnpm exec vitest run tests/loops/claim-grounding-ab.test.ts -t "three verifier arms"
```
27 changes: 27 additions & 0 deletions docs/results/cost-quality.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Cost/quality: the two-agent loop's inference premium

Live 9-topic A/B (glm-5.2, budget B ≤ 4 passes/arm), measured per arm with the
router-client instrumentation. The original A/B reported only admitted-sources at
"equal passes" — which charged the two-agent verify step as one pass while it is
actually N `verifySource` LLM calls. Pricing the calls shows what that hid.

| per topic (mean) | two-agent | single-agent | ratio |
|---|---|---|---|
| LLM chat calls | 5.4 | 1.0 | ~5.4× |
| tokens (in+out) | ~4,900 | ~530 | ~9× |
| cost (USD) | ~$0.0072 | ~$0.0013 | ~5.5× |
| latency (wall) | ~37 s | ~11 s | ~3.4× |
| cleanliness Δ (single − two admitted) | — | — | +1.56, 95% CI [0.33, 2.67] |

Per-topic Δ (single − two admitted) this run: self-speculative decoding +4,
grouped-query attention 0, rotary position embeddings +1, KV-cache quantization
**−1**, LoRA +4, ring attention +2, constitutional AI +3, transformer +2, gradient
descent **−1**. Coverage 1.00 every topic, both arms.

**Reading.** The verifier buys ~1.5–2.7 fewer junk sources for roughly **5× the
dollars, 9× the tokens, and 3× the latency** — and on two topics it admitted *more*
than the single agent (the cleanliness signal is real but noisier than the +2.3 /
+2.7 of earlier runs). The cleanliness gain is dominated by de-duplication, so the
honest production move is a deterministic content-hash / canonical-URL dedup, which
captures most of the cleanliness at ~none of this premium; reserve an LLM check for
the off-scope tail. This is the cost half the "equal passes" framing left out.
Loading
Loading