feat(research): per-arm cost instrumentation for the two-agent A/B#36

Merged

drewstone merged 5 commits into

mainfrom

feat/cost-quality-loop

Jun 25, 2026

drewstone commented Jun 25, 2026

Contributor

Phase 1 of the cost/quality study, by hand (the workflow was blocked by a session usage limit). The router client now accumulates per-call tokens/$/latency + a chat-call counter, exposed via RouterClient.usage(); the live A/B diffs usage per arm and logs calls/tokens/$/ms next to admitted/coverage. This surfaces that the two-agent verify step is N LLM calls — it spends more inference than 'equal passes' implies. Phases 2-5 (live cost/quality run, claim-grounding judgment task, adaptive policy, paper) follow when the limit resets.


          feat(research): instrument per-arm router cost (tokens/$/latency/call…

9c08b0d

…s) for the A/B

tangletools previously approved these changes

View reviewed changes

tangletools left a comment

Contributor

✅ Auto-approved PR — `9c08b0dc`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T00:10:47Z}


          docs(results): live cost/quality numbers — the two-agent loop's ~5-9x…

884cdc6

… inference premium

drewstone dismissed tangletools’s stale review via

884cdc6

June 25, 2026 00:21

tangletools previously approved these changes

View reviewed changes

tangletools left a comment

Contributor

✅ Auto-approved PR — `884cdc6d`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T00:22:02Z}


          feat(research): claim-grounding verifier mode — catch misattributed c…

8cf74b6

…itations the relevance judge can't

The cost/quality A/B (docs/results/cost-quality.md) showed the relevance
verifier's cleanliness win is dominated by de-duplication, which a deterministic
content-hash captures at ~none of the LLM premium. This adds the band where a
verifier earns its dollar: misattributed citations — a source that is on-topic,
unique, and real, but whose cited CLAIM does not appear in the page.

- src/claim-grounding.ts: each proposal carries the specific claim it is cited
  for (withCitedClaim → metadata.citedClaim); groundClaimInText checks it against
  the htmlToText output of the fetched page (verbatim / normalized / ≥70%
  content-word overlap). createClaimGroundingVerifier gates on grounding and
  composes the LLM relevance verifier. Deterministic, $0 inference — a deployable
  non-oracle verifier. createClaimDecorator extracts a grounded claim for
  relevance-only workers.
- tests/claim-grounding.test.ts: 14 unit tests for the oracle + driver gate.
- tests/loops/claim-grounding-ab.test.ts: offline controlled floor (grounding
  admits 0/2 misattributions, relevance + no-verifier admit 2/2) + a live A/B
  arm that fetches real pages, plants one misattribution per topic, and reports
  caught-per-dollar across no-verifier / relevance / grounding using the #36
  RouterClient.usage() instrumentation. Cheap glm-5.2 smoke before the burn.
- docs/results/claim-grounding.md: live n=5 result — grounding catches 5/5 at $0,
  relevance catches 4/5 at $0.0157 (structurally blind to misattribution: it only
  sees page text, never the claim). The verifier-per-dollar comparison inverts
  the cost/quality finding on this band.

drewstone dismissed tangletools’s stale review via

8cf74b6

June 25, 2026 00:53

tangletools previously approved these changes

View reviewed changes

tangletools left a comment

Contributor

✅ Auto-approved PR — `8cf74b66`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T00:53:44Z}


          feat(research): adaptive verifier — $0 dedup + heuristic triage, LLM …

5c8e7e0

…only for the ambiguous tail

The cost/quality A/B (docs/results/cost-quality.md) showed the LLM relevance
verifier's cleanliness win is dominated by de-duplication and that an LLM check
only earns its dollar on the off-scope tail. This ships the driver that acts on
that finding and measures all three topologies on the cost/quality frontier.

- src/adaptive-driver.ts: createAdaptiveResearchDriver runs three stages per
  candidate, cheapest first. (1) DEDUP ($0): canonical-URL (scheme/www/trailing
  slash/tracking params stripped) + normalized-text content hash against the
  round + KB index. (2) HEURISTIC TRIAGE ($0): host/title/length signal classes
  a unique survivor keep (authoritative host + substantial body) / drop (spam or
  thin) / ambiguous. (3) LLM ESCALATION ($): only ambiguous survivors reach the
  shipped createVerifyingResearchDriver relevance judge. stats() exposes where
  every decision was spent. Reuses sha256 + the relevance verifier; reinvents
  nothing.
- tests/loops/adaptive-ab.test.ts: pure-unit coverage of the deterministic
  stages; an offline controlled A/B proving adaptive escalates ONLY the ambiguous
  tail (6 candidates -> 2 LLM calls, matching full-LLM cleanliness on the
  controlled pool); and a live three-topology frontier (single / full-LLM /
  adaptive) gating the same fetched proposals, diffing per-arm cost with #36's
  RouterClient.usage(). Cheap glm-5.2 smoke gates the burn.
- docs/results/adaptive.md: live n=5 result. Adaptive cuts LLM calls 76%
  (25->6) and dollars 74% ($0.0261->$0.0068). HONEST quality reading: adaptive
  sits BETWEEN single (25 admitted) and full-LLM (12 admitted) at 20 admitted —
  it recovers the deterministic de-dup half of full-LLM's cleanliness (all 5 of
  its removals are duplicates) but NOT the relevance-judgment half, because on
  this authoritative-host-heavy set the heuristic keeps arxiv/github survivors
  full-LLM would drop (3 of 5 topics escalated zero LLM calls). A frontier point,
  not a free lunch; the doc states where the heuristic is weak.

drewstone dismissed tangletools’s stale review via

5c8e7e0

June 25, 2026 01:10

tangletools previously approved these changes

View reviewed changes

tangletools left a comment

Contributor

✅ Auto-approved PR — `5c8e7e07`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T01:10:24Z}


          docs: fold cost, claim-grounding, and adaptive results into the final…

321e596

… paper

Rewrite docs/two-agent-research-ab.md into the final short paper, folding all
four results: the reproduced cleanliness A/B (Δ 2.33/2.67), the cost/quality
frontier (~5x dollars / 9x tokens / 3x latency, and the correction that
'equal passes' hid the per-source verifier LLM cost), the claim-grounding band
(misattribution caught 5/5 at $0 vs the relevance judge's 4/5 at ~$0.003/topic),
and the adaptive driver (76% fewer LLM calls, 74% cheaper, recovers the de-dup
half of cleanliness). Rewrite the 'simpler loop' section to reflect what was
actually built (adaptive driver + claim-grounding mode), with only the
AgentProfile worker still deferred.

drewstone dismissed tangletools’s stale review via

321e596

June 25, 2026 01:14

tangletools approved these changes

View reviewed changes

tangletools left a comment

Contributor

✅ Auto-approved PR — `321e596a`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T01:14:50Z}

drewstone merged commit a137d7d into main

1 check passed

drewstone mentioned this pull request

chore(release): agent-knowledge 1.10.0 — claim-grounding verifier + adaptive driver + cost instrumentation #38

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet