Skip to content

feat(treatment-gate): treatment-applied validity gate (generalized tool-fired check)#281

Merged
drewstone merged 1 commit into
mainfrom
lift/treatment-fired-gate-from-benchmark
Jun 24, 2026
Merged

feat(treatment-gate): treatment-applied validity gate (generalized tool-fired check)#281
drewstone merged 1 commit into
mainfrom
lift/treatment-fired-gate-from-benchmark

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Lifts a treatment-applied validity gate into the substrate so any eval can verify a treatment arm actually exercised its treatment before crediting the result.

Generalizes "did the search tool fire?" to "did the required tool category fire?" — the tool matcher is a (toolName) => boolean parameter, so it works for any tool-treatment A/B (search, browser, a specific MCP, …), not just search.

  • Pure predicate gateTreatmentApplied({ toolHistogram, matches }) + adapters that consume existing substrate primitives (computeTraceMetrics, ToolSpan) — never re-derives the histogram.
  • classifyTreatment maps onto the existing exclusion-flag pattern (measurable | treatment-not-applied) — no new enum.
  • Fail-open: a run with zero captured tool telemetry is never quarantined (opt-in fail-closed).

Why

A treatment arm that never used its tool can't speak to that treatment; counting it pollutes the comparison. This makes the validity check a reusable substrate primitive instead of benchmark-local code.

Tests / checks

Unit test exercises a non-search (browser) matcher to prove generality; tsc --noEmit 0 errors, vitest 9/9.

…rate

Generalize a benchmark's search-fired check into a domain-free manipulation
gate: did a tool-treatment's tool actually fire this run? Pure predicate over
computeTraceMetrics(spans).toolHistogram with a (toolName)=>boolean matcher
parameter (no domain literal). Mirrors gateRealness's fail-open/fail-closed
contract; an empty histogram fails open so a telemetry gap is never
quarantined. classifyTreatment maps the verdict onto the existing
objective-exclusion pattern (measurable vs treatment-not-applied), adding no
new classification enum. Protects the downstream paired A/B tests (mcnemar,
pairedRiskDifference) by filtering treatment-not-applied runs before them.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 7139541f

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T12:33:13Z

@drewstone drewstone merged commit 49d87ff into main Jun 24, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants