Skip to content

XY-1156: Implement qmd candidate-replay comparability gate#350

Merged
yvette-carlisle merged 1 commit into
mainfrom
y/elf-xy-1156
Jul 3, 2026
Merged

XY-1156: Implement qmd candidate-replay comparability gate#350
yvette-carlisle merged 1 commit into
mainfrom
y/elf-xy-1156

Conversation

@yvette-carlisle

Copy link
Copy Markdown
Member

Summary

Implements the XY-1156 qmd candidate-replay comparability gate for the Docker-owned quantitative benchmark path.

Changes

  • Adds scripts/materialize-qmd-candidate-replay-gate.py with schema elf.qmd_candidate_replay_comparability_gate/v1.
  • Wires scripts/real-world-quantitative-docker.sh to emit qmd-candidate-replay-comparability-gate.json after the freshness manifest.
  • Adds Rust integration tests for pass and blocker behavior, including the P1 false-positive cases found by skeptic review.
  • Documents the gate contract and claim boundary in the quantitative benchmark spec and benchmarking evidence docs.

Gate Coverage

The gate blocks unless qmd has:

  • typed product result state and measured pass state;
  • every qmd product/per-query row mapped to the manifest corpus_id;
  • held-out and leakage-audited status plus an audit manifest id;
  • passing per-query replay rows with positive candidate and expected-relevance counts plus explicit qrels;
  • aggregate replay fields that match the per-query replay rows;
  • a matching freshness row with both valid runner image digest and product commit;
  • no unqualified leaderboard claim.

Decodex / Manual Intervention

Decodex attempt xy-1156-attempt-1-1783069646 failed with no_effective_diff after the app-server timeout path. A second Decodex run was skipped because of Linear connector backoff. Manual intervention was used under the goal authorization, then this PR keeps the Decodex issue branch and authority XY-1156.

Validation

  • python3 -m py_compile scripts/materialize-qmd-candidate-replay-gate.py
  • cargo test -p elf-eval --test real_world_job_benchmark qmd_candidate_replay -- --nocapture - 6 passed
  • python3 scripts/check-docs.py
  • git diff --check
  • cargo make checks - check, clippy, vstyle, and nextest: 437 passed, 92 skipped

Review

  • Initial read-only skeptic: support_with_changes, 4 P1 false-positive findings.
  • Follow-up read-only skeptic: pass, no remaining P0/P1 blockers.

Claim Boundary

This supports only a qualified qmd same-corpus candidate-replay comparability claim when the gate passes. It does not permit a broad product leaderboard or ELF-vs-qmd win claim.

…y comparability gate","authority":"XY-1156"}
@yvette-carlisle yvette-carlisle merged commit 853abc3 into main Jul 3, 2026
12 checks passed
@yvette-carlisle yvette-carlisle deleted the y/elf-xy-1156 branch July 3, 2026 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant