diff --git a/README.md b/README.md index 93ab6b98..9ab8b251 100644 --- a/README.md +++ b/README.md @@ -246,12 +246,14 @@ provider-backed ELF evidence was required. replay command. Missing anchors remain explicit `not_requested` layers, so the panel improves debug ergonomics without turning untested or blocked layers into pass claims. -- Agent Knowledge OS closeout after XY-1023: the June 20 closeout report publishes - the full product/scenario matrix for 19 tracked products and six Agent Knowledge OS +- Historical Agent Knowledge OS closeout after XY-1023: the June 20 closeout report + publishes the full product/scenario matrix for 19 tracked products and six + then-named Agent Knowledge OS layers, after rerunning `cargo make real-world-memory` at 62 jobs, 55 pass, - 0 wrong_result, and 7 typed blockers. ELF is the strongest measured integrated - Agent Knowledge OS product because all six ELF-owned layers have checked-in - evidence, but the report preserves qmd + 0 wrong_result, and 7 typed blockers. Within that historical matrix, ELF had the + strongest measured integrated evidence because all six ELF-owned layers had + checked-in evidence, but the current product boundary is source-backed project + memory for AI agents rather than a generic Knowledge OS. The report preserves qmd retrieval/debug ergonomics, OpenViking trajectory, mem0/OpenMemory history and UI/export, Letta core/archive, graph/RAG temporal-citation, agentmemory/claude-mem capture/viewer, and VectifyAI PageIndex/OpenKB long-document knowledge-library @@ -488,6 +490,9 @@ Detailed evidence and interpretation: - [P4 Production-Readiness Evidence Gates Report - June 23, 2026](docs/evidence/benchmarking/2026-06-23-p4-production-readiness-evidence-gates-report.md) - [P4 Quality Hardening and Productization Readiness Report - June 23, 2026](docs/evidence/benchmarking/2026-06-23-p4-quality-hardening-productization-readiness-report.md) - [Public Quantitative Competitor Scoreboard Report - June 27, 2026](docs/evidence/benchmarking/2026-06-27-public-quantitative-competitor-scoreboard-report.md) +- [Source-Backed Memory Quality Benchmark Harness - July 3, 2026](docs/evidence/benchmarking/2026-07-03-source-backed-quality-benchmark-harness.md) +- [qmd Candidate-Replay Comparability Gate - July 3, 2026](docs/evidence/benchmarking/2026-07-03-qmd-candidate-replay-comparability-gate.md) +- [Final Source-Backed Project Memory Closeout Report - July 3, 2026](docs/evidence/benchmarking/2026-07-03-final-source-backed-project-memory-closeout-report.md) - [Live Baseline Benchmark Runbook](docs/runbook/benchmarking/live_baseline_benchmark.md) - [Single-User Production Runbook](docs/runbook/single_user_production.md) - Benchmark contract: @@ -598,7 +603,7 @@ Detailed comparison, mechanism-level analysis, and source map: - [Derived Knowledge Page Follow-Up Research](docs/research/derived_knowledge_page_followup.md) - [Dreaming Product Surface Follow-Up Research](docs/research/dreaming_product_surface_followup.md) -Latest real-world benchmark report: June 27, 2026. Latest external research refresh: +Latest real-world benchmark report: July 3, 2026. Latest external research refresh: June 11, 2026; June 20 adds the Agent Knowledge OS Closeout Benchmark Report, the Graph Topic-Map Report - June 20, 2026, Knowledge Workspace Version-Diff Report - June 20, 2026, and the Live Knowledge-Page Rebuild/Lint Report - June 20, @@ -614,7 +619,9 @@ Dreaming readback, the qmd debug-ergonomics Dreaming retest, the June 17 competitor-strength closeout, and the June 16 temporal reconciliation, live consolidation self-check, proactive-brief, and scheduled-memory scoring evidence. June 27 adds the public quantitative competitor scoreboard report with row-level -comparability gates and no universal leaderboard claim. +comparability gates and no universal leaderboard claim. July 3 adds the +source-backed memory quality benchmark harness, the qmd candidate-replay +comparability gate, and the final source-backed project memory closeout report. ## Documentation diff --git a/apps/elf-eval/tests/real_world_job_benchmark/closeout_reports.rs b/apps/elf-eval/tests/real_world_job_benchmark/closeout_reports.rs index 10c588eb..cf3361ff 100644 --- a/apps/elf-eval/tests/real_world_job_benchmark/closeout_reports.rs +++ b/apps/elf-eval/tests/real_world_job_benchmark/closeout_reports.rs @@ -1,5 +1,6 @@ mod closeout_reports_agent_knowledge; mod closeout_reports_competitor_strength; +mod closeout_reports_final_source_backed; mod closeout_reports_graph_rag; mod closeout_reports_helpers; mod closeout_reports_openmemory; diff --git a/apps/elf-eval/tests/real_world_job_benchmark/closeout_reports_final_source_backed.rs b/apps/elf-eval/tests/real_world_job_benchmark/closeout_reports_final_source_backed.rs new file mode 100644 index 00000000..42eea78c --- /dev/null +++ b/apps/elf-eval/tests/real_world_job_benchmark/closeout_reports_final_source_backed.rs @@ -0,0 +1,59 @@ +use std::fs; + +use color_eyre::Result; + +use crate::support; + +#[test] +fn final_source_backed_closeout_report_covers_xy1157_review_surface() -> Result<()> { + let report = fs::read_to_string( + support::final_source_backed_project_memory_closeout_report_markdown_path()?, + )?; + + for required in [ + "Source Library", + "Memory Authority", + "Source-to-Memory Authority loop", + "Knowledge Workspace", + "Work Journal", + "Dreaming Review", + "Context Pack v1", + "Automatic Context Routing", + "Recall Engine", + "Recall Debug", + "benchmark validity", + "Competitor And Unsupported Claim Boundaries", + "Decodex Status Accuracy", + "`decodex status --json`", + "Any P0 or P1 finding in those areas remains a blocker", + ] { + assert!(report.contains(required), "missing closeout coverage for {required}"); + } + + Ok(()) +} + +#[test] +fn final_source_backed_closeout_report_preserves_claim_boundaries_and_docs_links() -> Result<()> { + let report = fs::read_to_string( + support::final_source_backed_project_memory_closeout_report_markdown_path()?, + )?; + let index = fs::read_to_string(support::benchmarking_index_path()?)?; + let readme = fs::read_to_string(support::readme_path()?)?; + + for boundary in [ + "no universal leaderboard", + "no broad \"ELF beats every competitor\" claim", + "no private-corpus or provider-backed production quality claim", + "qmd still has a short local replay/debug ergonomics edge", + "Missing, blocked, incomplete, wrong-result, not-tested, public-proxy, local fixture", + ] { + assert!(report.contains(boundary), "missing claim boundary {boundary}"); + } + + assert!(index.contains("2026-07-03-final-source-backed-project-memory-closeout-report.md")); + assert!(readme.contains("Final Source-Backed Project Memory Closeout Report - July 3, 2026")); + assert!(readme.contains("Latest real-world benchmark report: July 3, 2026")); + + Ok(()) +} diff --git a/apps/elf-eval/tests/real_world_job_benchmark/qmd_debug_retest.rs b/apps/elf-eval/tests/real_world_job_benchmark/qmd_debug_retest.rs index 8c19bafe..c2d68220 100644 --- a/apps/elf-eval/tests/real_world_job_benchmark/qmd_debug_retest.rs +++ b/apps/elf-eval/tests/real_world_job_benchmark/qmd_debug_retest.rs @@ -175,6 +175,6 @@ fn assert_qmd_debug_retest_markdown_and_indexes( ); assert!(readme.contains("qmd Debug-Ergonomics Dreaming Retest Report - June 19, 2026")); assert!(readme.contains("Temporal and Trajectory Adapter Coverage Report - June 23, 2026")); - assert!(readme.contains("Latest real-world benchmark report: June 27, 2026")); + assert!(readme.contains("Latest real-world benchmark report: July 3, 2026")); assert!(readme.contains("keeps the qmd edge unchanged")); } diff --git a/apps/elf-eval/tests/real_world_job_benchmark/support.rs b/apps/elf-eval/tests/real_world_job_benchmark/support.rs index 0283e761..7c007938 100644 --- a/apps/elf-eval/tests/real_world_job_benchmark/support.rs +++ b/apps/elf-eval/tests/real_world_job_benchmark/support.rs @@ -31,6 +31,7 @@ pub(super) use self::{ dreaming_competitor_strength_retest_report_markdown_path, dreaming_readiness_stage_ledger_json_path, dreaming_readiness_stage_ledger_markdown_path, dreaming_review_queue_report_json_path, dreaming_review_queue_report_markdown_path, + final_source_backed_project_memory_closeout_report_markdown_path, graph_rag_adapter_matrix_report_json_path, graph_rag_adapter_matrix_report_markdown_path, graph_rag_citation_navigation_promotion_report_json_path, graph_rag_citation_navigation_promotion_report_markdown_path, diff --git a/apps/elf-eval/tests/real_world_job_benchmark/support/report_paths.rs b/apps/elf-eval/tests/real_world_job_benchmark/support/report_paths.rs index 5459f51d..4fdc5888 100644 --- a/apps/elf-eval/tests/real_world_job_benchmark/support/report_paths.rs +++ b/apps/elf-eval/tests/real_world_job_benchmark/support/report_paths.rs @@ -9,6 +9,7 @@ pub(crate) use self::{ competitor_strength_adoption_report_path, competitor_strength_matrix_path, dreaming_competitor_strength_retest_report_markdown_path, dreaming_readiness_stage_ledger_markdown_path, dreaming_review_queue_report_markdown_path, + final_source_backed_project_memory_closeout_report_markdown_path, graph_rag_adapter_matrix_report_markdown_path, graph_rag_citation_navigation_promotion_report_markdown_path, graph_topic_map_report_markdown_path, iteration_direction_report_path, diff --git a/apps/elf-eval/tests/real_world_job_benchmark/support/report_paths_markdown.rs b/apps/elf-eval/tests/real_world_job_benchmark/support/report_paths_markdown.rs index d32e85ef..56b6e9d0 100644 --- a/apps/elf-eval/tests/real_world_job_benchmark/support/report_paths_markdown.rs +++ b/apps/elf-eval/tests/real_world_job_benchmark/support/report_paths_markdown.rs @@ -64,6 +64,11 @@ pub(crate) fn agent_knowledge_os_closeout_benchmark_report_markdown_path() -> Re benchmarking_path("2026-06-20-agent-knowledge-os-closeout-benchmark-report.md") } +pub(crate) fn final_source_backed_project_memory_closeout_report_markdown_path() -> Result +{ + benchmarking_path("2026-07-03-final-source-backed-project-memory-closeout-report.md") +} + pub(crate) fn p2_knowledge_workspace_pageindex_openkb_closeout_report_markdown_path() -> Result { benchmarking_path("2026-06-22-p2-knowledge-workspace-pageindex-openkb-closeout-report.md") diff --git a/docs/evidence/benchmarking/2026-07-03-final-source-backed-project-memory-closeout-report.md b/docs/evidence/benchmarking/2026-07-03-final-source-backed-project-memory-closeout-report.md new file mode 100644 index 00000000..3309167b --- /dev/null +++ b/docs/evidence/benchmarking/2026-07-03-final-source-backed-project-memory-closeout-report.md @@ -0,0 +1,222 @@ +--- +type: Evidence +title: "Final Source-Backed Project Memory Closeout Report - July 3, 2026" +description: "XY-1157 final review-readiness evidence for ELF as source-backed project memory for AI agents." +resource: docs/evidence/benchmarking/2026-07-03-final-source-backed-project-memory-closeout-report.md +status: active +authority: evidence +owner: benchmarking +last_verified: 2026-07-03 +tags: + - docs + - evidence + - benchmarking + - source-backed-project-memory +source_refs: + - https://linear.app/hack-ink/issue/XY-1157/run-independent-review-and-decodex-closeout-for-final-elf-memory-system +code_refs: + - Makefile.toml + - makefiles/benchmark-memory-b.toml + - apps/elf-eval/src/bin/real_world_job_benchmark/source_backed_quality.rs + - apps/elf-eval/tests/real_world_job_benchmark/source_backed_quality.rs +related: + - docs/spec/agent_memory_knowledge_system_v1.md + - docs/spec/system_context_pack_v1.md + - docs/spec/system_recall_debug_panel_v1.md + - docs/spec/system_work_journal_v1.md + - docs/spec/system_knowledge_pages_v1.md + - docs/spec/system_consolidation_proposals_v1.md + - docs/evidence/benchmarking/2026-07-03-source-backed-quality-benchmark-harness.md + - docs/evidence/benchmarking/2026-07-03-qmd-candidate-replay-comparability-gate.md + - docs/evidence/benchmarking/2026-06-27-public-quantitative-competitor-scoreboard-report.md + - docs/evidence/benchmarking/2026-06-23-p4-quality-hardening-productization-readiness-report.md +drift_watch: + - docs/spec/agent_memory_knowledge_system_v1.md + - docs/evidence/benchmarking/ + - Makefile.toml +--- +# Final Source-Backed Project Memory Closeout Report - July 3, 2026 + +Purpose: Record the XY-1157 implementation-to-validation-ready closeout evidence for +ELF as open-source, source-backed project memory for AI agents. +Status: evidence +Read this when: You are reviewing final source-backed memory readiness, claim +boundaries, independent review coverage, or next optimization tasks. +Not this document: Low-level service API semantics, fixture schemas, or operational +setup steps. + +## Scope + +This closeout is scoped to the accepted XY-1150 product direction and generated issue +XY-1157. It does not rename ELF into a generic Knowledge OS, broad RAG platform, +wiki compiler, hosted memory SDK, graph database, Notion clone, or document-search +replacement. + +The final source-backed project memory surface is: + +- Source Library +- Memory Authority +- Source-to-Memory Authority loop +- Knowledge Workspace +- Work Journal +- Dreaming Review +- Context Pack v1 +- Automatic Context Routing +- Recall Engine +- Recall Debug +- benchmark harness and comparison evidence + +## Changed Implementation And Evidence + +XY-1151 through XY-1156 left the tree with the following checked-in implementation +and report evidence: + +| Area | Evidence | Current result | +| --- | --- | --- | +| Product boundary | `docs/spec/agent_memory_knowledge_system_v1.md` | ELF is explicitly scoped as source-backed project memory for AI agents, with non-goals against generic Knowledge OS/RAG positioning. | +| Source Library and Memory Authority | `docs/evidence/benchmarking/2026-06-22-p1-memory-authority-closeout-report.md` | Source capture, memory candidate approval, recall/debug, stale suppression, correction, and rollback are covered by the P1 closeout slice. | +| Knowledge Workspace | `docs/evidence/benchmarking/2026-06-22-p2-knowledge-workspace-pageindex-openkb-closeout-report.md` | Derived pages, citations, stale-source lint, version diffs, and changed-source watch/rebuild are evidenced without promoting pages to authoritative memory. | +| Work Journal | `apps/elf-eval/fixtures/real_world_memory/work_continuity/` and `docs/spec/system_work_journal_v1.md` | Journal readback supports continuity while journal-only facts remain non-authoritative unless promoted through memory authority. | +| Dreaming Review | `docs/evidence/benchmarking/2026-06-20-dreaming-review-queue-report.md` | Proposals expose source refs, affected refs, lint, diff, policy, and review audit; source mutation remains disallowed. | +| Context Pack v1 and routing | `docs/spec/system_context_pack_v1.md` and `docs/evidence/benchmarking/2026-07-03-source-backed-quality-benchmark-harness.md` | Packs are ephemeral read-time transport artifacts with cited items, routing traces, privacy boundaries, and activation/suppression metrics. | +| Recall Engine and Recall Debug | `docs/evidence/benchmarking/2026-06-20-recall-debug-panel-report.md` | Recall/debug reports selected, dropped, stale, blocked, and not-requested context with source refs, replay aids, and authority labels. | +| Benchmark harness | `cargo make source-backed-memory-quality` | The source-backed quality gate validates required scenarios, hard-fail counters, Context Pack routing decisions, latency, and cost. | +| qmd comparability | `docs/evidence/benchmarking/2026-07-03-qmd-candidate-replay-comparability-gate.md` | qmd candidate replay may be compared only when same-corpus mapping, held-out/leakage audit, replay rows, digest, and product commit gates pass. | + +## Benchmark Metrics + +The current source-backed quality report records +`source_backed_quality.result_state = "pass"` for the executable fixture/product +runtime gate. Its published July 3 run reports: + +| Metric | Value | +| --- | --- | +| expected evidence recall | `1.0` | +| precision@5 | `0.414` | +| irrelevant context ratio | `0.0` | +| source-ref coverage | `1.0` | +| stale suppression rate | `1.0` | +| correction persistence rate | `1.0` | +| delete/tombstone suppression rate | `1.0` | +| unsupported claim rate | `0.0` | +| cross-scope leak count | `0` | +| journal-only authority claim count | `0` | +| Context Pack activation precision | `1.0` | +| Context Pack activation recall | `1.0` | +| activation trace coverage | `1.0` | +| mean latency | `2.864ms` | + +The benchmark preserves typed non-pass evidence elsewhere in the aggregate reports. +Missing, blocked, incomplete, wrong-result, not-tested, public-proxy, local fixture, +or reference-only evidence is not treated as a pass. + +## Independent Review Coverage + +The final review contract for XY-1157 must explicitly cover: + +- Source Library source capture, excerpt hydration, lifecycle, and no implicit memory + promotion. +- Memory Authority note/core-block history, policy decisions, provenance, correction, + rollback, active-only recall, and source-of-truth boundaries. +- Source-to-Memory loop proposal, approval, promotion, correction, rollback, and + audit transitions. +- Knowledge Workspace citation, lint, stale-source, version-diff, and derived-only + boundaries. +- Work Journal continuity readback and journal-only non-authority boundaries. +- Dreaming Review proposal queue, unsupported-claim lint, source mutation + prohibition, and explicit review actions. +- Context Pack v1 read-time-only assembly, citations, scope/lifecycle eligibility, + privacy-safe debug output, and no shadow memory. +- Automatic Context Routing rationale, layer selection, suppression, stale handling, + disabled layers, blocked layers, and pinned-ineligible behavior. +- Recall Engine typed authority/freshness labels, authoritative revalidation, and + non-pass context handling. +- Recall Debug selected/dropped/stale/blocked/not-requested evidence, replay aids, + and privacy boundaries. +- benchmark validity, including required scenario coverage, hard-fail counters, + typed non-pass preservation, qmd replay comparability gates, and no unqualified + leaderboard claims. +- Decodex lifecycle/status accuracy, including the fact that `In Review` is a + PR-backed handoff state and not phase acceptance by itself. + +Any P0 or P1 finding in those areas remains a blocker for final issue completion. + +## Strengths + +ELF is strongest in the checked-in evidence on: + +- evidence-linked memory writes; +- deterministic memory authority and policy decisions; +- source-to-memory promotion, correction, and rollback; +- Postgres source-of-truth plus rebuildable derived indexes; +- cited derived knowledge and reviewable proposal surfaces; +- Context Pack and Recall Debug authority labels, traces, and privacy boundaries; +- executable source-backed quality benchmarks with hard-fail counters. + +These strengths are source-backed project memory claims, not broad product-market or +generic RAG superiority claims. + +## Competitor And Unsupported Claim Boundaries + +Competitor strengths remain optimization inputs: + +- qmd still has a short local replay/debug ergonomics edge unless ELF emits + comparable replay artifacts for the exact same claim. +- PageIndex/OpenKB tree/wiki artifacts remain reference or typed non-pass until + same-corpus source-id-mapped outputs exist. +- mem0/OpenMemory history, hosted ecosystem, and UI/export surfaces remain separate + from local SDK or fixture evidence. +- Letta core/archive parity remains blocked until exported core block and archival + readback artifacts map to ELF source ids. +- Graphiti/Zep and graph/RAG citation/navigation strengths remain typed blockers or + non-comparable rows unless contained same-corpus product-runtime artifacts exist. +- OpenViking context trajectory remains blocked until staged, hierarchy, and + recursive/context expansion artifacts are available. + +Unsupported claims for this closeout: + +- no universal leaderboard; +- no broad "ELF beats every competitor" claim; +- no private-corpus or provider-backed production quality claim from local fixture or + public-proxy evidence; +- no hosted managed-memory, UI/export, graph/RAG, core/archive, PageIndex/OpenKB, or + OpenViking parity claim without comparable product-runtime evidence. + +## Decodex Status Accuracy + +`decodex status --json` is part of the XY-1157 validation evidence. A degraded +operator snapshot or control-plane environment warning is a Decodex runtime/status +condition, not proof that the product implementation or benchmark evidence failed. +The final issue state must still be recorded through Decodex tracker checkpoints, +repo-native validation, independent review, and PR-backed handoff. + +Do not mark XY-1157 complete while any P0/P1 review finding remains unresolved, while +required validation evidence is missing, or while Decodex lifecycle/status evidence +contradicts the closeout claim. + +## Next Optimization Tasks + +Recommended follow-up work remains optimization, not closeout-blocking evidence: + +- improve qmd-style local replay/debug ergonomics without weakening source + authority; +- materialize PageIndex/OpenKB same-corpus tree/wiki artifacts; +- materialize OpenViking staged trajectory, hierarchy, and recursive expansion + outputs; +- deepen mem0/OpenMemory UI/export and hosted-boundary evidence; +- deepen Letta core/archive export/readback evidence; +- add contained graph/RAG citation/navigation adapters; +- run private-corpus and provider-backed production quality gates only with + operator-owned manifests and credentials. + +## Current Validation For This Closeout + +The implement-to-validation-ready lane must run at least: + +- `cargo make check-docs` +- `cargo test -p elf-eval --test real_world_job_benchmark source_backed_quality` +- `cargo test -p elf-eval --test real_world_job_benchmark closeout_reports` +- `decodex status --json` + +Before PR handoff or any push that refreshes a PR head, run the registered Decodex +repo gate: `cargo make fmt`, `cargo make lint-fix`, then `cargo make checks`. diff --git a/docs/evidence/benchmarking/index.md b/docs/evidence/benchmarking/index.md index 66cb1d65..6e17cb15 100644 --- a/docs/evidence/benchmarking/index.md +++ b/docs/evidence/benchmarking/index.md @@ -61,3 +61,4 @@ Routes to: Benchmarking evidence concepts under `docs/evidence/benchmarking/`. - `2026-06-27-public-quantitative-competitor-scoreboard-report.md`: Public Quantitative Competitor Scoreboard Report - June 27, 2026; publishes `elf.quality_scoreboard/v1` rows for 20 tracked products, including VectifyAI PageIndex, VectifyAI OpenKB, and plastic-labs Honcho typed rows. Rows expose recall@5, precision@5, MRR, nDCG, lifecycle, source-ref, and latency metrics where measured, and typed blocker, source-provenance, and next-evidence metadata where comparable metrics are not yet available, while preserving zero comparable product-runtime pass claims until held-out, leakage-audited, digest-identified runtime evidence exists. - `2026-07-03-source-backed-quality-benchmark-harness.md`: Source-Backed Memory Quality Benchmark Harness - July 3, 2026; adds `cargo make source-backed-memory-quality` and the `elf.source_backed_memory_quality_benchmark/v1` report surface for expected evidence recall, precision@5, source-ref coverage, stale/correction/delete behavior, Context Pack activation, Recall Debug privacy, hard-fail leak counters, latency, and typed scenario coverage. - `2026-07-03-qmd-candidate-replay-comparability-gate.md`: qmd Candidate-Replay Comparability Gate - July 3, 2026; adds the Docker-aggregate `qmd-candidate-replay-comparability-gate.json` artifact and the qmd-specific typed pass/blocked gate for complete source-id mapping, held-out/leakage audit evidence, passing per-query replay rows, aggregate replay consistency, row-bound runner image digest plus product commit provenance, and no unqualified leaderboard claim. +- `2026-07-03-final-source-backed-project-memory-closeout-report.md`: Final Source-Backed Project Memory Closeout Report - July 3, 2026; records XY-1157 review-readiness evidence for Source Library, Memory Authority, Source-to-Memory loop, Knowledge Workspace, Work Journal, Dreaming Review, Context Pack v1, Automatic Context Routing, Recall Engine, Recall Debug, benchmark validity, competitor claim honesty, Decodex lifecycle/status accuracy, unsupported claims, and next optimization tasks. diff --git a/docs/log.md b/docs/log.md index 6f7dc6d5..39adbef5 100644 --- a/docs/log.md +++ b/docs/log.md @@ -191,3 +191,8 @@ logs. candidate-replay comparison. Digest and product commit evidence must be bound to the same matching freshness row. The gate remains typed pass/blocked and never permits an unqualified product leaderboard claim. +- Added the XY-1157 final source-backed project memory closeout report, tying + Source Library, Memory Authority, Source-to-Memory loop, Knowledge Workspace, Work + Journal, Dreaming Review, Context Pack, routing, Recall Engine, Recall Debug, + benchmark validity, competitor claim boundaries, and Decodex lifecycle/status + accuracy into one review-readiness evidence surface. diff --git a/docs/spec/agent_memory_knowledge_system_v1.md b/docs/spec/agent_memory_knowledge_system_v1.md index 495f0b1f..7ab7fafc 100644 --- a/docs/spec/agent_memory_knowledge_system_v1.md +++ b/docs/spec/agent_memory_knowledge_system_v1.md @@ -29,9 +29,15 @@ related: - docs/spec/system_graph_memory_postgres_v1.md - docs/spec/system_memory_summary_v1.md - docs/spec/system_work_journal_v1.md + - docs/evidence/benchmarking/2026-07-03-source-backed-quality-benchmark-harness.md + - docs/evidence/benchmarking/2026-07-03-qmd-candidate-replay-comparability-gate.md + - docs/evidence/benchmarking/2026-07-03-final-source-backed-project-memory-closeout-report.md drift_watch: - docs/spec/agent_memory_knowledge_system_v1.md - docs/evidence/benchmarking/2026-06-23-p4-quality-hardening-productization-readiness-report.md + - docs/evidence/benchmarking/2026-07-03-source-backed-quality-benchmark-harness.md + - docs/evidence/benchmarking/2026-07-03-qmd-candidate-replay-comparability-gate.md + - docs/evidence/benchmarking/2026-07-03-final-source-backed-project-memory-closeout-report.md - docs/runbook/benchmarking/real_world_agent_memory_benchmark.md - Makefile.toml ---