Skip to content

XY-1155: Implement source-backed memory quality benchmark#349

Merged
yvette-carlisle merged 1 commit into
mainfrom
y/elf-xy-1155
Jul 3, 2026
Merged

XY-1155: Implement source-backed memory quality benchmark#349
yvette-carlisle merged 1 commit into
mainfrom
y/elf-xy-1155

Conversation

@yvette-carlisle

Copy link
Copy Markdown
Member

Summary

Implements the XY-1155 executable source-backed memory quality benchmark gate for ELF's final source-backed project memory program.

Changes:

  • adds source_backed_quality JSON report surface and Markdown rendering
  • adds validate-source-backed-quality so the benchmark command fails on non-pass source-backed quality gates
  • adds cargo make source-backed-memory-quality JSON -> validate -> Markdown workflow
  • adds a Context Pack activation/suppression fixture with structured routing decisions
  • hard-fails required non-pass scenarios, required job trap/hard-fail usage, cross-scope leaks, journal-only authority claims, and incorrect/missing Context Pack routing decisions
  • counts private_scope_leak as a scope violation for the source-backed quality gate
  • documents benchmark evidence and review disposition

Decodex/manual intervention

Decodex was attempted first for XY-1155. Manual intervention was required after repeated Decodex app-server timeouts/no-effective-diff loops, including Timed out while waiting for app-server output at apps/decodex/src/agent/json_rpc/connection.rs:287 and retained attention after attempts 2/3. This PR is the manual completion path and should be rebound into Decodex review handoff.

Benchmark evidence

cargo make source-backed-memory-quality now runs:

  1. JSON generation
  2. validate-source-backed-quality
  3. Markdown publication

Latest local artifact: tmp/source-backed-memory-quality/report.json:

  • source_backed_quality.result_state = pass
  • hard_fail_passed = true
  • expected evidence recall: 1.0
  • precision@5: 0.414
  • irrelevant context ratio: 0.0
  • source-ref coverage: 1.0
  • stale suppression: 1.0
  • correction persistence: 1.0
  • delete/tombstone suppression: 1.0
  • unsupported claim rate: 0.0
  • cross-scope leak count: 0
  • journal-only authority claim count: 0
  • Context Pack activation precision/recall/trace coverage: 1.0 / 1.0 / 1.0
  • Context Pack routing decisions: 7 total, 7 traced, 0 incorrect

The aggregate report still preserves typed non-pass competitor/runtime boundaries and does not support an unqualified product-runtime leaderboard claim.

Review

A first read-only skeptic review blocked the change on three P1s: missing executable gate, insufficient hard-fail coverage, and tag-derived Context Pack metrics. The implementation was updated, and a second read-only skeptic review returned pass with no P0/P1 findings.

Validation

  • cargo make checks
  • cargo make source-backed-memory-quality
  • python3 scripts/check-docs.py
  • git diff --check

…y quality benchmark harness","authority":"XY-1155"}
@yvette-carlisle yvette-carlisle merged commit 96254a7 into main Jul 3, 2026
12 checks passed
@yvette-carlisle yvette-carlisle deleted the y/elf-xy-1155 branch July 3, 2026 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant