Skip to content

docs: two-agent research-loop A/B — result + how to run#34

Merged
drewstone merged 2 commits into
mainfrom
docs/two-agent-research-ab
Jun 24, 2026
Merged

docs: two-agent research-loop A/B — result + how to run#34
drewstone merged 2 commits into
mainfrom
docs/two-agent-research-ab

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

A public-facing writeup of the two-agent research-loop A/B, so a Twitter/X post can link to it.

What the doc covers

  • What it is — a two-agent research loop that grows one knowledge base: a worker finds web sources for the open knowledge-gaps; a verifying driver (a second agent) judges each fetched source before it's saved (on-topic relevance + near-duplicate), then gates on a readiness spec. Compared against a single-agent loop that just accumulates. Points at the real code: runTwoAgentResearchLoop (src/two-agent-research-loop.ts), createWebResearchWorker + createVerifyingResearchDriver (src/web-research-worker.ts), and the A/B harness (tests/loops/research-loop-equal-compute.test.ts).
  • The result — real run, n=9 ML topics, equal compute, glm-5.2: the two-agent loop admits 2.33 fewer sources/topic at identical (1.00) coverage, 95% CI [1.78, 2.89] via agent-eval pairedBootstrap. Includes the per-topic delta table.
  • The honest nuance — the win is mostly de-duplication (same canonical paper mirrored across arxiv/openreview/neurips/blogs), which is band-independent and fires on every topic; off-scope rejection is real but the minority. A cheap content-hash/canonical-URL dedup would capture most of the value without an LLM verifier.
  • Threats to validity — verifier is also the judge (admitted-count is a proxy); deltas are conservative (single-agent stops early on readiness); n=2 clean offline controls is thin; glm-5.2-specific; high web run-to-run variance.
  • How to run — the offline A/B (deterministic, no creds) and the live sweep (AGENT_KNOWLEDGE_LIVE=1 + TANGLE_API_KEY + AGENT_KNOWLEDGE_LIVE_GOALS |-separated, ~$0.20 for the full 9-topic sweep).

Verification

  • pnpm run lint — clean (2 pre-existing warnings in wikilinks.ts, unrelated).
  • pnpm exec vitest run tests/loops/research-loop-equal-compute.test.ts2 passed, 1 skipped (the live arm). The documented offline command ran green: [A/B @ B<=6 passes] two-agent: passes=2 junk=0 coverage=1.00 | single-agent: passes=2 junk=2 coverage=1.00.

Links the doc from the README's two-agent research loop section. Docs-only + one README line; no code changes.

Public-facing writeup of the equal-compute A/B between the two-agent
research loop (worker + verifying driver) and a single-agent
accumulate-only loop. Records the real result (n=9 ML topics, glm-5.2:
2.33 fewer admitted sources/topic at identical coverage, 95% CI
[1.78, 2.89] via pairedBootstrap), the honest nuance (the win is mostly
de-duplication, not relevance filtering), threats to validity, and the
verified offline + live run commands. Links it from the README's
two-agent research loop section.
tangletools
tangletools previously approved these changes Jun 24, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — c4ae34fe

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T22:50:00Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 367bbe4f

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T23:02:55Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants