feat(treatment-gate): treatment-applied validity gate (generalized tool-fired check)#281
Merged
Merged
Conversation
…rate Generalize a benchmark's search-fired check into a domain-free manipulation gate: did a tool-treatment's tool actually fire this run? Pure predicate over computeTraceMetrics(spans).toolHistogram with a (toolName)=>boolean matcher parameter (no domain literal). Mirrors gateRealness's fail-open/fail-closed contract; an empty histogram fails open so a telemetry gap is never quarantined. classifyTreatment maps the verdict onto the existing objective-exclusion pattern (measurable vs treatment-not-applied), adding no new classification enum. Protects the downstream paired A/B tests (mcnemar, pairedRiskDifference) by filtering treatment-not-applied runs before them.
tangletools
approved these changes
Jun 24, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — 7139541f
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T12:33:13Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Lifts a treatment-applied validity gate into the substrate so any eval can verify a treatment arm actually exercised its treatment before crediting the result.
Generalizes "did the search tool fire?" to "did the required tool category fire?" — the tool matcher is a
(toolName) => booleanparameter, so it works for any tool-treatment A/B (search, browser, a specific MCP, …), not just search.gateTreatmentApplied({ toolHistogram, matches })+ adapters that consume existing substrate primitives (computeTraceMetrics,ToolSpan) — never re-derives the histogram.classifyTreatmentmaps onto the existing exclusion-flag pattern (measurable | treatment-not-applied) — no new enum.Why
A treatment arm that never used its tool can't speak to that treatment; counting it pollutes the comparison. This makes the validity check a reusable substrate primitive instead of benchmark-local code.
Tests / checks
Unit test exercises a non-search (
browser) matcher to prove generality;tsc --noEmit0 errors, vitest 9/9.