feat(treatment-gate): treatment-applied validity gate (generalized tool-fired check) by drewstone · Pull Request #281 · tangle-network/agent-eval

drewstone · 2026-06-24T12:33:06Z

What

Lifts a treatment-applied validity gate into the substrate so any eval can verify a treatment arm actually exercised its treatment before crediting the result.

Generalizes "did the search tool fire?" to "did the required tool category fire?" — the tool matcher is a (toolName) => boolean parameter, so it works for any tool-treatment A/B (search, browser, a specific MCP, …), not just search.

Pure predicate gateTreatmentApplied({ toolHistogram, matches }) + adapters that consume existing substrate primitives (computeTraceMetrics, ToolSpan) — never re-derives the histogram.
classifyTreatment maps onto the existing exclusion-flag pattern (measurable | treatment-not-applied) — no new enum.
Fail-open: a run with zero captured tool telemetry is never quarantined (opt-in fail-closed).

Why

A treatment arm that never used its tool can't speak to that treatment; counting it pollutes the comparison. This makes the validity check a reusable substrate primitive instead of benchmark-local code.

Tests / checks

Unit test exercises a non-search (browser) matcher to prove generality; tsc --noEmit 0 errors, vitest 9/9.

…rate Generalize a benchmark's search-fired check into a domain-free manipulation gate: did a tool-treatment's tool actually fire this run? Pure predicate over computeTraceMetrics(spans).toolHistogram with a (toolName)=>boolean matcher parameter (no domain literal). Mirrors gateRealness's fail-open/fail-closed contract; an empty histogram fails open so a telemetry gap is never quarantined. classifyTreatment maps the verdict onto the existing objective-exclusion pattern (measurable vs treatment-not-applied), adding no new classification enum. Protects the downstream paired A/B tests (mcnemar, pairedRiskDifference) by filtering treatment-not-applied runs before them.

tangletools

✅ Auto-approved PR — `7139541f`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T12:33:13Z}

tangletools approved these changes Jun 24, 2026

View reviewed changes

drewstone merged commit 49d87ff into main Jun 24, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(treatment-gate): treatment-applied validity gate (generalized tool-fired check)#281

feat(treatment-gate): treatment-applied validity gate (generalized tool-fired check)#281
drewstone merged 1 commit into
mainfrom
lift/treatment-fired-gate-from-benchmark

drewstone commented Jun 24, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 24, 2026

What

Why

Tests / checks

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 7139541f

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `7139541f`