feat(research): real web-research worker + genuinely-live A/B arm#32
Merged
Conversation
Replace the live A/B arm's skeleton (which ran the same offline naive proposer over a hardcoded pool with a junk/-prefix verifier) with a real, any-topic web-research worker and a real LLM verifying driver, so AGENT_KNOWLEDGE_LIVE=1 runs a genuine experiment. src/web-research-worker.ts (general, no hardcoded corpus): - createWebResearchWorker — glm-5.2 turns open gaps into search queries, runs real web search over the router (/v1/search, the endpoint tcloud mcp's web_search forwards to), fetches with politeFetch, reduces with htmlToText, proposes citing pages via buildPages. Conforms to the loop's ResearchWorker contract. - createVerifyingResearchDriver — a glm-5.2 verifySource pass judging each fetched source's on-topic relevance + near-duplicate against the round, fail-closed on parse/router failure. - createTangleRouterClient — dependency-free router client over fetch (search + chat), so it works with or without the tcloud CLI installed. Reasoning-model floor: glm-5.2 calls get max_tokens >= 1200 so visible content isn't starved by reasoning_content. Live arm: injects the real worker + verifier + topic-relevant readiness specs, runs both arms at equal agent-pass budget, cost-gates with a cheap glm-5.2 smoke first, asserts the worker actually web-searched, and reports admitted-source count + coverage per arm with agent-eval's pairedBootstrap. The offline arm is unchanged (CI, $0, deterministic) — the arm-runners gain defaulted parameters only.
tangletools
approved these changes
Jun 24, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — 113f6efc
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T22:16:38Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Makes the two-agent research loop's live A/B arm genuinely live. Before this,
tests/loops/research-loop-equal-compute.test.ts'sdescribe.skipIf(!AGENT_KNOWLEDGE_LIVE)block was a skeleton — it ran the SAME offline naive proposer over a hardcoded chicken-coop pool with ajunk/-prefix verifier, soAGENT_KNOWLEDGE_LIVE=1proved nothing about real research. The repo had no committed real web-research worker.New module —
src/web-research-worker.ts(general, any-topic, no hardcoded corpus)createWebResearchWorker— given the open knowledge gaps, glm-5.2 forms focused search queries → real web search over the Tangle router (POST /v1/search, the same endpointtcloud mcp'sweb_searchtool forwards to) → fetches each top hit with the repo'spoliteFetch→ reduces withhtmlToText→ proposes citingknowledge/*.mdpages viabuildPages. Conforms to the loop'sResearchWorkercontract.createVerifyingResearchDriver— the differentiated driver role: a glm-5.2verifySourcepass judging each fetched source's on-topic relevance to the goal + open gaps and near-duplicate against the round. Fail-closed (reject) on router/parse failure so an unverified source never poisons the KB.createTangleRouterClient— dependency-free router client overfetch(search + chat), so it works with or without thetcloudCLI installed. glm-5.2 calls getmax_tokens >= 1200so visible content isn't starved by hiddenreasoning_content.Live-arm wiring
The live arm injects the real worker + real verifier + topic-relevant readiness specs, runs both arms at equal agent-pass budget, cost-gates with a cheap glm-5.2 smoke first, asserts the worker actually web-searched (fails loud on zero sources = a false null), and reports admitted-source count + coverage per arm with agent-eval's
pairedBootstrap. The offline arm is unchanged (CI, $0, deterministic) — the arm-runners only gain defaulted parameters.For real research there is no planted-junk oracle, so the live cleanliness signal is admitted-source COUNT: the verifying driver rejects off-topic fetches, so the two-agent KB admits FEWER sources at equal-or-higher coverage — the real-world analogue of the offline junk count.
Verification (all run)
pnpm run lint/typecheck/build/ offlinepnpm test: all green (131 passed, 7 creds-gated skips; offline arm byte-identical:two-agent junk=0 coverage=1.00 | single-agent junk=2 coverage=1.00).AGENT_KNOWLEDGE_LIVE=1, glm-5.2, goal "self-speculative decoding", B≤4): the worker really web-searched (Google Research / NVIDIA / BentoML pages on speculative decoding) and the A/B reported:DO NOT MERGE — review first.