feat(research): per-arm cost instrumentation for the two-agent A/B#36
Conversation
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 9c08b0dc
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T00:10:47Z
… inference premium
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 884cdc6d
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T00:22:02Z
…itations the relevance judge can't The cost/quality A/B (docs/results/cost-quality.md) showed the relevance verifier's cleanliness win is dominated by de-duplication, which a deterministic content-hash captures at ~none of the LLM premium. This adds the band where a verifier earns its dollar: misattributed citations — a source that is on-topic, unique, and real, but whose cited CLAIM does not appear in the page. - src/claim-grounding.ts: each proposal carries the specific claim it is cited for (withCitedClaim → metadata.citedClaim); groundClaimInText checks it against the htmlToText output of the fetched page (verbatim / normalized / ≥70% content-word overlap). createClaimGroundingVerifier gates on grounding and composes the LLM relevance verifier. Deterministic, $0 inference — a deployable non-oracle verifier. createClaimDecorator extracts a grounded claim for relevance-only workers. - tests/claim-grounding.test.ts: 14 unit tests for the oracle + driver gate. - tests/loops/claim-grounding-ab.test.ts: offline controlled floor (grounding admits 0/2 misattributions, relevance + no-verifier admit 2/2) + a live A/B arm that fetches real pages, plants one misattribution per topic, and reports caught-per-dollar across no-verifier / relevance / grounding using the #36 RouterClient.usage() instrumentation. Cheap glm-5.2 smoke before the burn. - docs/results/claim-grounding.md: live n=5 result — grounding catches 5/5 at $0, relevance catches 4/5 at $0.0157 (structurally blind to misattribution: it only sees page text, never the claim). The verifier-per-dollar comparison inverts the cost/quality finding on this band.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 8cf74b66
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T00:53:44Z
…only for the ambiguous tail The cost/quality A/B (docs/results/cost-quality.md) showed the LLM relevance verifier's cleanliness win is dominated by de-duplication and that an LLM check only earns its dollar on the off-scope tail. This ships the driver that acts on that finding and measures all three topologies on the cost/quality frontier. - src/adaptive-driver.ts: createAdaptiveResearchDriver runs three stages per candidate, cheapest first. (1) DEDUP ($0): canonical-URL (scheme/www/trailing slash/tracking params stripped) + normalized-text content hash against the round + KB index. (2) HEURISTIC TRIAGE ($0): host/title/length signal classes a unique survivor keep (authoritative host + substantial body) / drop (spam or thin) / ambiguous. (3) LLM ESCALATION ($): only ambiguous survivors reach the shipped createVerifyingResearchDriver relevance judge. stats() exposes where every decision was spent. Reuses sha256 + the relevance verifier; reinvents nothing. - tests/loops/adaptive-ab.test.ts: pure-unit coverage of the deterministic stages; an offline controlled A/B proving adaptive escalates ONLY the ambiguous tail (6 candidates -> 2 LLM calls, matching full-LLM cleanliness on the controlled pool); and a live three-topology frontier (single / full-LLM / adaptive) gating the same fetched proposals, diffing per-arm cost with #36's RouterClient.usage(). Cheap glm-5.2 smoke gates the burn. - docs/results/adaptive.md: live n=5 result. Adaptive cuts LLM calls 76% (25->6) and dollars 74% ($0.0261->$0.0068). HONEST quality reading: adaptive sits BETWEEN single (25 admitted) and full-LLM (12 admitted) at 20 admitted — it recovers the deterministic de-dup half of full-LLM's cleanliness (all 5 of its removals are duplicates) but NOT the relevance-judgment half, because on this authoritative-host-heavy set the heuristic keeps arxiv/github survivors full-LLM would drop (3 of 5 topics escalated zero LLM calls). A frontier point, not a free lunch; the doc states where the heuristic is weak.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 5c8e7e07
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T01:10:24Z
… paper Rewrite docs/two-agent-research-ab.md into the final short paper, folding all four results: the reproduced cleanliness A/B (Δ 2.33/2.67), the cost/quality frontier (~5x dollars / 9x tokens / 3x latency, and the correction that 'equal passes' hid the per-source verifier LLM cost), the claim-grounding band (misattribution caught 5/5 at $0 vs the relevance judge's 4/5 at ~$0.003/topic), and the adaptive driver (76% fewer LLM calls, 74% cheaper, recovers the de-dup half of cleanliness). Rewrite the 'simpler loop' section to reflect what was actually built (adaptive driver + claim-grounding mode), with only the AgentProfile worker still deferred.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 321e596a
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T01:14:50Z
Phase 1 of the cost/quality study, by hand (the workflow was blocked by a session usage limit). The router client now accumulates per-call tokens/$/latency + a chat-call counter, exposed via
RouterClient.usage(); the live A/B diffs usage per arm and logs calls/tokens/$/ms next to admitted/coverage. This surfaces that the two-agent verify step is N LLM calls — it spends more inference than 'equal passes' implies. Phases 2-5 (live cost/quality run, claim-grounding judgment task, adaptive policy, paper) follow when the limit resets.