diff --git a/README.md b/README.md index 82ae32e..039c3ce 100644 --- a/README.md +++ b/README.md @@ -271,6 +271,12 @@ with its own pass (`driverResearches: true`), and gates on the readiness check. Both are yours (no creds) — the loop owns the deterministic mechanics (indexing, applying write blocks, scoring readiness) and stops once no blocking gap remains. +Does the verifying driver actually earn its keep? See +[docs/two-agent-research-ab.md](docs/two-agent-research-ab.md) for an equal-compute +A/B (9 ML topics, `glm-5.2`): the two-agent loop admits ~2.33 fewer sources per topic +at identical coverage — though most of that win is de-duplication, not relevance +filtering. Honest caveats and how to reproduce included. + ```ts import { defineReadinessSpec, diff --git a/docs/two-agent-research-ab.md b/docs/two-agent-research-ab.md new file mode 100644 index 0000000..47e8bf4 --- /dev/null +++ b/docs/two-agent-research-ab.md @@ -0,0 +1,151 @@ +# Two-agent research loop: does a verifying second agent build a cleaner knowledge base? + +A small, honest A/B. We ran two research loops side by side at **equal compute** and +measured what each one wrote into the knowledge base. The two-agent loop admitted +**2.33 fewer sources per topic at identical coverage** — but most of that win is +cheap de-duplication, not the LLM verifier earning its keep. The numbers and the +caveats are below. + +## What it is + +A **two-agent research loop** that grows one knowledge base: + +- A **worker** turns each open knowledge-gap into web search queries, runs a real + web search, fetches the top results, and proposes them as sources. It only *adds*. +- A **verifying driver** — a *second* agent — judges each fetched source **before** + it is saved: is it on-topic for the goal, and is it a near-duplicate of something + already accepted this round? It rejects the rest. It only *gates*. +- Both loops stop on the same **readiness spec** (a checklist of what the knowledge + base must cover). The driver never gets free compute — every verify pass is + charged against the same budget the single-agent loop is free to spend on extra + fetches. + +We compare this against a **single-agent loop** that does the same research but just +accumulates every source it finds, with no second agent gating what gets saved. + +The real code: + +- The loop — [`runTwoAgentResearchLoop`](../src/two-agent-research-loop.ts) + (`src/two-agent-research-loop.ts`) +- The worker + verifying driver — [`createWebResearchWorker` and + `createVerifyingResearchDriver`](../src/web-research-worker.ts) + (`src/web-research-worker.ts`) +- The A/B harness — [`tests/loops/research-loop-equal-compute.test.ts`](../tests/loops/research-loop-equal-compute.test.ts) + +## The result + +Real run: **n = 9 ML topics**, equal compute, `glm-5.2` as both worker and verifier. + +The metric is **admitted sources per topic** at equal coverage. Fewer admitted +sources at the *same* coverage = a cleaner knowledge base (the same facts, less +redundant clutter). Coverage was a perfect **1.00 for both arms on every topic**, so +the comparison is apples-to-apples: same coverage, fewer sources kept. + +**The two-agent loop admitted 2.33 fewer sources per topic.** +95% confidence interval **[1.78, 2.89]** (paired bootstrap via agent-eval's +`pairedBootstrap`). The interval is well above zero, so the gap is not noise. +**Reproduced on an independent re-run** of the same 9 topics: +2.67 fewer +sources per topic, 95% CI **[2.22, 3.00]**. Both runs exclude zero — the effect +is robust, though the exact magnitude varies run-to-run with what the web returns. + +Per-topic delta (single-agent admitted − two-agent admitted): + +| Topic | Fewer sources with the verifier (Δ) | +| --- | --- | +| Self-speculative decoding | 3 | +| Grouped-query attention | 3 | +| Constitutional AI | 3 | +| Transformer architecture | 3 | +| Gradient descent | 3 | +| Rotary position embeddings | 2 | +| Ring attention | 2 | +| KV-cache quantization | 1 | +| LoRA | 1 | + +Mean Δ = 2.33. Every topic moved in the same direction. + +## The honest nuance + +**The win is mostly de-duplication, not relevance filtering.** + +When you web-search an ML topic, the same canonical paper comes back mirrored across +arxiv, OpenReview, the NeurIPS proceedings, a lab's blog, and so on. The verifier's +biggest job, in practice, is spotting *"this is the same paper I already accepted"* — +a near-duplicate rejection that fires on **every** topic regardless of how hard the +topic is. Genuine off-topic rejection (spam, a marketing page, a tangential result) +is real, but it's the minority of what the verifier catches. + +The blunt implication: **a cheap content-hash or canonical-URL dedup would capture +most of this value without an LLM verifier at all.** The LLM verifier earns its keep +only on the off-topic minority — the cases a hash can't see. If you're deciding +whether to pay for a second agent, that's the honest trade: you're mostly paying for +de-dup you could do for free, plus a smaller slice of judgement you can't. + +## Threats to validity + +Read these before quoting the headline number. + +- **The verifier is also the judge.** Admitted-source count is a *proxy* for + cleanliness. There is no independent oracle saying which sources a knowledge base + *should* have kept — so "fewer admitted at equal coverage" is the best signal we + have, not ground truth. +- **The deltas are conservative.** The single-agent loop stops early when readiness + is met, so it often doesn't even spend its full budget — meaning the real gap under + a fixed budget could be larger than what we measured here. +- **n = 2 clean controls is thin.** The offline, deterministic version of this A/B + (the one that runs with no credentials) uses a planted source pool with just two + clean control sources. It proves the harness wiring and a controlled lower bound, + not real-world magnitude. The live 9-topic sweep is the real evidence. +- **It's `glm-5.2`-specific.** Both the worker and the verifier are `glm-5.2`. A + different model could be a sharper or a sloppier judge. +- **Web research is high-variance.** Run-to-run, the same topic can swing — one topic + produced Δ of 0, then 1, then 3 across repeats, because the live web returns + different result sets each time. The interval above accounts for between-topic + spread, not this within-topic search noise. + +## How to run + +### Offline A/B (deterministic, no credentials) + +This proves the harness and the controlled lower bound — no network, no keys. It runs +both loops on a planted source pool (clean sources plus planted spam) and asserts the +verifying driver admits no more junk than the single-agent loop at equal compute. + +```bash +pnpm install +pnpm exec vitest run tests/loops/research-loop-equal-compute.test.ts +``` + +The test logs the A/B line, e.g.: + +``` +[A/B @ B<=6 passes] two-agent: passes=2 junk=0 coverage=1.00 | single-agent: passes=2 junk=2 coverage=1.00 +``` + +— same coverage, the two-agent loop keeps the junk out. The live arm in the same file +is skipped offline. + +### Live sweep (the real evidence) + +The live arm runs both loops on real topics with the real web-research worker +(`glm-5.2` query-gen → live web search → fetch → readable text) and a real `glm-5.2` +verifying driver, then reports the paired comparison via `pairedBootstrap`. It's gated +on credentials so it never runs by accident. + +Set the credentials and the topic list, then run the same test file: + +```bash +AGENT_KNOWLEDGE_LIVE=1 \ +TANGLE_API_KEY=sk-... # a Tangle router key with glm-5.2 credits \ +AGENT_KNOWLEDGE_LIVE_GOALS="self-speculative decoding|grouped-query attention|rotary position embeddings|KV-cache quantization|LoRA|ring attention|constitutional AI|transformer architecture|gradient descent" \ +pnpm exec vitest run tests/loops/research-loop-equal-compute.test.ts +``` + +Knobs (all optional): + +- `AGENT_KNOWLEDGE_LIVE_GOALS` — `|`-separated topics. The live arm already supports a + list and runs the `pairedBootstrap` across them. Default: a single topic. +- `AGENT_KNOWLEDGE_LIVE_BUDGET` — agent-pass ceiling per arm (default `4`). +- `AGENT_KNOWLEDGE_LIVE_MODEL` — router chat model (default `glm-5.2`). + +The full 9-topic sweep above costs roughly **$0.20** in router spend.