Skip to content

feat(research): per-arm cost instrumentation for the two-agent A/B#36

Merged
drewstone merged 5 commits into
mainfrom
feat/cost-quality-loop
Jun 25, 2026
Merged

feat(research): per-arm cost instrumentation for the two-agent A/B#36
drewstone merged 5 commits into
mainfrom
feat/cost-quality-loop

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Phase 1 of the cost/quality study, by hand (the workflow was blocked by a session usage limit). The router client now accumulates per-call tokens/$/latency + a chat-call counter, exposed via RouterClient.usage(); the live A/B diffs usage per arm and logs calls/tokens/$/ms next to admitted/coverage. This surfaces that the two-agent verify step is N LLM calls — it spends more inference than 'equal passes' implies. Phases 2-5 (live cost/quality run, claim-grounding judgment task, adaptive policy, paper) follow when the limit resets.

tangletools
tangletools previously approved these changes Jun 25, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 9c08b0dc

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T00:10:47Z

tangletools
tangletools previously approved these changes Jun 25, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 884cdc6d

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T00:22:02Z

…itations the relevance judge can't

The cost/quality A/B (docs/results/cost-quality.md) showed the relevance
verifier's cleanliness win is dominated by de-duplication, which a deterministic
content-hash captures at ~none of the LLM premium. This adds the band where a
verifier earns its dollar: misattributed citations — a source that is on-topic,
unique, and real, but whose cited CLAIM does not appear in the page.

- src/claim-grounding.ts: each proposal carries the specific claim it is cited
  for (withCitedClaim → metadata.citedClaim); groundClaimInText checks it against
  the htmlToText output of the fetched page (verbatim / normalized / ≥70%
  content-word overlap). createClaimGroundingVerifier gates on grounding and
  composes the LLM relevance verifier. Deterministic, $0 inference — a deployable
  non-oracle verifier. createClaimDecorator extracts a grounded claim for
  relevance-only workers.
- tests/claim-grounding.test.ts: 14 unit tests for the oracle + driver gate.
- tests/loops/claim-grounding-ab.test.ts: offline controlled floor (grounding
  admits 0/2 misattributions, relevance + no-verifier admit 2/2) + a live A/B
  arm that fetches real pages, plants one misattribution per topic, and reports
  caught-per-dollar across no-verifier / relevance / grounding using the #36
  RouterClient.usage() instrumentation. Cheap glm-5.2 smoke before the burn.
- docs/results/claim-grounding.md: live n=5 result — grounding catches 5/5 at $0,
  relevance catches 4/5 at $0.0157 (structurally blind to misattribution: it only
  sees page text, never the claim). The verifier-per-dollar comparison inverts
  the cost/quality finding on this band.
tangletools
tangletools previously approved these changes Jun 25, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 8cf74b66

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T00:53:44Z

…only for the ambiguous tail

The cost/quality A/B (docs/results/cost-quality.md) showed the LLM relevance
verifier's cleanliness win is dominated by de-duplication and that an LLM check
only earns its dollar on the off-scope tail. This ships the driver that acts on
that finding and measures all three topologies on the cost/quality frontier.

- src/adaptive-driver.ts: createAdaptiveResearchDriver runs three stages per
  candidate, cheapest first. (1) DEDUP ($0): canonical-URL (scheme/www/trailing
  slash/tracking params stripped) + normalized-text content hash against the
  round + KB index. (2) HEURISTIC TRIAGE ($0): host/title/length signal classes
  a unique survivor keep (authoritative host + substantial body) / drop (spam or
  thin) / ambiguous. (3) LLM ESCALATION ($): only ambiguous survivors reach the
  shipped createVerifyingResearchDriver relevance judge. stats() exposes where
  every decision was spent. Reuses sha256 + the relevance verifier; reinvents
  nothing.
- tests/loops/adaptive-ab.test.ts: pure-unit coverage of the deterministic
  stages; an offline controlled A/B proving adaptive escalates ONLY the ambiguous
  tail (6 candidates -> 2 LLM calls, matching full-LLM cleanliness on the
  controlled pool); and a live three-topology frontier (single / full-LLM /
  adaptive) gating the same fetched proposals, diffing per-arm cost with #36's
  RouterClient.usage(). Cheap glm-5.2 smoke gates the burn.
- docs/results/adaptive.md: live n=5 result. Adaptive cuts LLM calls 76%
  (25->6) and dollars 74% ($0.0261->$0.0068). HONEST quality reading: adaptive
  sits BETWEEN single (25 admitted) and full-LLM (12 admitted) at 20 admitted —
  it recovers the deterministic de-dup half of full-LLM's cleanliness (all 5 of
  its removals are duplicates) but NOT the relevance-judgment half, because on
  this authoritative-host-heavy set the heuristic keeps arxiv/github survivors
  full-LLM would drop (3 of 5 topics escalated zero LLM calls). A frontier point,
  not a free lunch; the doc states where the heuristic is weak.
tangletools
tangletools previously approved these changes Jun 25, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 5c8e7e07

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T01:10:24Z

… paper

Rewrite docs/two-agent-research-ab.md into the final short paper, folding all
four results: the reproduced cleanliness A/B (Δ 2.33/2.67), the cost/quality
frontier (~5x dollars / 9x tokens / 3x latency, and the correction that
'equal passes' hid the per-source verifier LLM cost), the claim-grounding band
(misattribution caught 5/5 at $0 vs the relevance judge's 4/5 at ~$0.003/topic),
and the adaptive driver (76% fewer LLM calls, 74% cheaper, recovers the de-dup
half of cleanliness). Rewrite the 'simpler loop' section to reflect what was
actually built (adaptive driver + claim-grounding mode), with only the
AgentProfile worker still deferred.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 321e596a

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T01:14:50Z

@drewstone drewstone merged commit a137d7d into main Jun 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants