Skip to content

feat(research): real web-research worker + genuinely-live A/B arm#32

Merged
drewstone merged 1 commit into
mainfrom
feat/real-web-research-worker
Jun 24, 2026
Merged

feat(research): real web-research worker + genuinely-live A/B arm#32
drewstone merged 1 commit into
mainfrom
feat/real-web-research-worker

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Makes the two-agent research loop's live A/B arm genuinely live. Before this, tests/loops/research-loop-equal-compute.test.ts's describe.skipIf(!AGENT_KNOWLEDGE_LIVE) block was a skeleton — it ran the SAME offline naive proposer over a hardcoded chicken-coop pool with a junk/-prefix verifier, so AGENT_KNOWLEDGE_LIVE=1 proved nothing about real research. The repo had no committed real web-research worker.

New module — src/web-research-worker.ts (general, any-topic, no hardcoded corpus)

  • createWebResearchWorker — given the open knowledge gaps, glm-5.2 forms focused search queries → real web search over the Tangle router (POST /v1/search, the same endpoint tcloud mcp's web_search tool forwards to) → fetches each top hit with the repo's politeFetch → reduces with htmlToText → proposes citing knowledge/*.md pages via buildPages. Conforms to the loop's ResearchWorker contract.
  • createVerifyingResearchDriver — the differentiated driver role: a glm-5.2 verifySource pass judging each fetched source's on-topic relevance to the goal + open gaps and near-duplicate against the round. Fail-closed (reject) on router/parse failure so an unverified source never poisons the KB.
  • createTangleRouterClient — dependency-free router client over fetch (search + chat), so it works with or without the tcloud CLI installed. glm-5.2 calls get max_tokens >= 1200 so visible content isn't starved by hidden reasoning_content.

Live-arm wiring

The live arm injects the real worker + real verifier + topic-relevant readiness specs, runs both arms at equal agent-pass budget, cost-gates with a cheap glm-5.2 smoke first, asserts the worker actually web-searched (fails loud on zero sources = a false null), and reports admitted-source count + coverage per arm with agent-eval's pairedBootstrap. The offline arm is unchanged (CI, $0, deterministic) — the arm-runners only gain defaulted parameters.

For real research there is no planted-junk oracle, so the live cleanliness signal is admitted-source COUNT: the verifying driver rejects off-topic fetches, so the two-agent KB admits FEWER sources at equal-or-higher coverage — the real-world analogue of the offline junk count.

Verification (all run)

  • pnpm run lint / typecheck / build / offline pnpm test: all green (131 passed, 7 creds-gated skips; offline arm byte-identical: two-agent junk=0 coverage=1.00 | single-agent junk=2 coverage=1.00).
  • Live run (AGENT_KNOWLEDGE_LIVE=1, glm-5.2, goal "self-speculative decoding", B≤4): the worker really web-searched (Google Research / NVIDIA / BentoML pages on speculative decoding) and the A/B reported:
    • two-agent (real worker + LLM verifier): admitted=2, coverage=1.00
    • single-agent (real worker, no verifier): admitted=3, coverage=1.00
    • paired delta (single−two admitted) = 1 → the verifying driver kept the KB cleaner at equal coverage and equal compute.

DO NOT MERGE — review first.

Replace the live A/B arm's skeleton (which ran the same offline naive
proposer over a hardcoded pool with a junk/-prefix verifier) with a real,
any-topic web-research worker and a real LLM verifying driver, so
AGENT_KNOWLEDGE_LIVE=1 runs a genuine experiment.

src/web-research-worker.ts (general, no hardcoded corpus):
- createWebResearchWorker — glm-5.2 turns open gaps into search queries,
  runs real web search over the router (/v1/search, the endpoint tcloud
  mcp's web_search forwards to), fetches with politeFetch, reduces with
  htmlToText, proposes citing pages via buildPages. Conforms to the loop's
  ResearchWorker contract.
- createVerifyingResearchDriver — a glm-5.2 verifySource pass judging each
  fetched source's on-topic relevance + near-duplicate against the round,
  fail-closed on parse/router failure.
- createTangleRouterClient — dependency-free router client over fetch
  (search + chat), so it works with or without the tcloud CLI installed.
  Reasoning-model floor: glm-5.2 calls get max_tokens >= 1200 so visible
  content isn't starved by reasoning_content.

Live arm: injects the real worker + verifier + topic-relevant readiness
specs, runs both arms at equal agent-pass budget, cost-gates with a cheap
glm-5.2 smoke first, asserts the worker actually web-searched, and reports
admitted-source count + coverage per arm with agent-eval's pairedBootstrap.
The offline arm is unchanged (CI, $0, deterministic) — the arm-runners gain
defaulted parameters only.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 113f6efc

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T22:16:38Z

@drewstone drewstone merged commit 99e3bc8 into main Jun 24, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants