Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,12 @@ with its own pass (`driverResearches: true`), and gates on the readiness check.
Both are yours (no creds) — the loop owns the deterministic mechanics (indexing,
applying write blocks, scoring readiness) and stops once no blocking gap remains.

Does the verifying driver actually earn its keep? See
[docs/two-agent-research-ab.md](docs/two-agent-research-ab.md) for an equal-compute
A/B (9 ML topics, `glm-5.2`): the two-agent loop admits ~2.33 fewer sources per topic
at identical coverage — though most of that win is de-duplication, not relevance
filtering. Honest caveats and how to reproduce included.

```ts
import {
defineReadinessSpec,
Expand Down
151 changes: 151 additions & 0 deletions docs/two-agent-research-ab.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Two-agent research loop: does a verifying second agent build a cleaner knowledge base?

A small, honest A/B. We ran two research loops side by side at **equal compute** and
measured what each one wrote into the knowledge base. The two-agent loop admitted
**2.33 fewer sources per topic at identical coverage** — but most of that win is
cheap de-duplication, not the LLM verifier earning its keep. The numbers and the
caveats are below.

## What it is

A **two-agent research loop** that grows one knowledge base:

- A **worker** turns each open knowledge-gap into web search queries, runs a real
web search, fetches the top results, and proposes them as sources. It only *adds*.
- A **verifying driver** — a *second* agent — judges each fetched source **before**
it is saved: is it on-topic for the goal, and is it a near-duplicate of something
already accepted this round? It rejects the rest. It only *gates*.
- Both loops stop on the same **readiness spec** (a checklist of what the knowledge
base must cover). The driver never gets free compute — every verify pass is
charged against the same budget the single-agent loop is free to spend on extra
fetches.

We compare this against a **single-agent loop** that does the same research but just
accumulates every source it finds, with no second agent gating what gets saved.

The real code:

- The loop — [`runTwoAgentResearchLoop`](../src/two-agent-research-loop.ts)
(`src/two-agent-research-loop.ts`)
- The worker + verifying driver — [`createWebResearchWorker` and
`createVerifyingResearchDriver`](../src/web-research-worker.ts)
(`src/web-research-worker.ts`)
- The A/B harness — [`tests/loops/research-loop-equal-compute.test.ts`](../tests/loops/research-loop-equal-compute.test.ts)

## The result

Real run: **n = 9 ML topics**, equal compute, `glm-5.2` as both worker and verifier.

The metric is **admitted sources per topic** at equal coverage. Fewer admitted
sources at the *same* coverage = a cleaner knowledge base (the same facts, less
redundant clutter). Coverage was a perfect **1.00 for both arms on every topic**, so
the comparison is apples-to-apples: same coverage, fewer sources kept.

**The two-agent loop admitted 2.33 fewer sources per topic.**
95% confidence interval **[1.78, 2.89]** (paired bootstrap via agent-eval's
`pairedBootstrap`). The interval is well above zero, so the gap is not noise.
**Reproduced on an independent re-run** of the same 9 topics: +2.67 fewer
sources per topic, 95% CI **[2.22, 3.00]**. Both runs exclude zero — the effect
is robust, though the exact magnitude varies run-to-run with what the web returns.

Per-topic delta (single-agent admitted − two-agent admitted):

| Topic | Fewer sources with the verifier (Δ) |
| --- | --- |
| Self-speculative decoding | 3 |
| Grouped-query attention | 3 |
| Constitutional AI | 3 |
| Transformer architecture | 3 |
| Gradient descent | 3 |
| Rotary position embeddings | 2 |
| Ring attention | 2 |
| KV-cache quantization | 1 |
| LoRA | 1 |

Mean Δ = 2.33. Every topic moved in the same direction.

## The honest nuance

**The win is mostly de-duplication, not relevance filtering.**

When you web-search an ML topic, the same canonical paper comes back mirrored across
arxiv, OpenReview, the NeurIPS proceedings, a lab's blog, and so on. The verifier's
biggest job, in practice, is spotting *"this is the same paper I already accepted"* —
a near-duplicate rejection that fires on **every** topic regardless of how hard the
topic is. Genuine off-topic rejection (spam, a marketing page, a tangential result)
is real, but it's the minority of what the verifier catches.

The blunt implication: **a cheap content-hash or canonical-URL dedup would capture
most of this value without an LLM verifier at all.** The LLM verifier earns its keep
only on the off-topic minority — the cases a hash can't see. If you're deciding
whether to pay for a second agent, that's the honest trade: you're mostly paying for
de-dup you could do for free, plus a smaller slice of judgement you can't.

## Threats to validity

Read these before quoting the headline number.

- **The verifier is also the judge.** Admitted-source count is a *proxy* for
cleanliness. There is no independent oracle saying which sources a knowledge base
*should* have kept — so "fewer admitted at equal coverage" is the best signal we
have, not ground truth.
- **The deltas are conservative.** The single-agent loop stops early when readiness
is met, so it often doesn't even spend its full budget — meaning the real gap under
a fixed budget could be larger than what we measured here.
- **n = 2 clean controls is thin.** The offline, deterministic version of this A/B
(the one that runs with no credentials) uses a planted source pool with just two
clean control sources. It proves the harness wiring and a controlled lower bound,
not real-world magnitude. The live 9-topic sweep is the real evidence.
- **It's `glm-5.2`-specific.** Both the worker and the verifier are `glm-5.2`. A
different model could be a sharper or a sloppier judge.
- **Web research is high-variance.** Run-to-run, the same topic can swing — one topic
produced Δ of 0, then 1, then 3 across repeats, because the live web returns
different result sets each time. The interval above accounts for between-topic
spread, not this within-topic search noise.

## How to run

### Offline A/B (deterministic, no credentials)

This proves the harness and the controlled lower bound — no network, no keys. It runs
both loops on a planted source pool (clean sources plus planted spam) and asserts the
verifying driver admits no more junk than the single-agent loop at equal compute.

```bash
pnpm install
pnpm exec vitest run tests/loops/research-loop-equal-compute.test.ts
```

The test logs the A/B line, e.g.:

```
[A/B @ B<=6 passes] two-agent: passes=2 junk=0 coverage=1.00 | single-agent: passes=2 junk=2 coverage=1.00
```

— same coverage, the two-agent loop keeps the junk out. The live arm in the same file
is skipped offline.

### Live sweep (the real evidence)

The live arm runs both loops on real topics with the real web-research worker
(`glm-5.2` query-gen → live web search → fetch → readable text) and a real `glm-5.2`
verifying driver, then reports the paired comparison via `pairedBootstrap`. It's gated
on credentials so it never runs by accident.

Set the credentials and the topic list, then run the same test file:

```bash
AGENT_KNOWLEDGE_LIVE=1 \
TANGLE_API_KEY=sk-... # a Tangle router key with glm-5.2 credits \
AGENT_KNOWLEDGE_LIVE_GOALS="self-speculative decoding|grouped-query attention|rotary position embeddings|KV-cache quantization|LoRA|ring attention|constitutional AI|transformer architecture|gradient descent" \
pnpm exec vitest run tests/loops/research-loop-equal-compute.test.ts
```

Knobs (all optional):

- `AGENT_KNOWLEDGE_LIVE_GOALS` — `|`-separated topics. The live arm already supports a
list and runs the `pairedBootstrap` across them. Default: a single topic.
- `AGENT_KNOWLEDGE_LIVE_BUDGET` — agent-pass ceiling per arm (default `4`).
- `AGENT_KNOWLEDGE_LIVE_MODEL` — router chat model (default `glm-5.2`).

The full 9-topic sweep above costs roughly **$0.20** in router spend.
Loading