Problem
We claim Evolve's recalled guidelines make agent sessions cheaper and shorter, but we have no defensible numbers. #6 already calls for a benchmark on AppWorld + a ReAct agent measuring step counts and pass rate before/after tip generation; this issue is the broader version of that ask: produce results that show how much recall reduces token cost, wall-clock, and steps on a real benchmark, on at least one of the platforms Evolve supports (Claude Code, Codex, Bob, claw-code, …).
AppWorld is the obvious starting candidate (per #6), but the deliverable is results, not a specific benchmark choice. If a different public benchmark fits better — SWE-bench-verified slices, ToolBench, MultiHopRAG, BrowseComp, something else — pick that and justify briefly.
What "results" means here
For each (benchmark task or task family, platform) pair:
- Token cost — total, input vs. output, cache reads vs. creates.
- Wall-clock per session.
- Steps / turns per session.
- Task success rate.
- The specific guidelines recalled (so a reader can see what knowledge drove the savings).
- N trials per condition (with vs. without recall) so the numbers are defensible, not anecdotal.
Headline output: a table per platform, plus a short writeup explaining the gap on a representative task — including the tool calls each condition made, so the why is visible alongside the what.
Scope
- At least one public benchmark, multiple distinct task families inside it.
- At least one Evolve-supported platform end-to-end. A second platform is a strong nice-to-have but not required to close.
- Whatever harness exists (e.g.
tests/e2e/experiment_token_savings.py) is a starting point, not the deliverable. The deliverable is the numbers and the writeup.
Out of scope
- Building a long-lived measurement framework (separate issue if needed).
- Wiring this into CI as a regression check (separate issue).
- Multi-platform coverage beyond a second platform.
Related
Acceptance
- A writeup (issue comment, doc, or PR) with the comparison tables described above, on at least one benchmark and at least one platform.
- N ≥ 5 trials per condition.
- Concrete examples of recalled guidelines and the tool-call gap they removed.
Problem
We claim Evolve's recalled guidelines make agent sessions cheaper and shorter, but we have no defensible numbers. #6 already calls for a benchmark on AppWorld + a ReAct agent measuring step counts and pass rate before/after tip generation; this issue is the broader version of that ask: produce results that show how much recall reduces token cost, wall-clock, and steps on a real benchmark, on at least one of the platforms Evolve supports (Claude Code, Codex, Bob, claw-code, …).
AppWorld is the obvious starting candidate (per #6), but the deliverable is results, not a specific benchmark choice. If a different public benchmark fits better — SWE-bench-verified slices, ToolBench, MultiHopRAG, BrowseComp, something else — pick that and justify briefly.
What "results" means here
For each (benchmark task or task family, platform) pair:
Headline output: a table per platform, plus a short writeup explaining the gap on a representative task — including the tool calls each condition made, so the why is visible alongside the what.
Scope
tests/e2e/experiment_token_savings.py) is a starting point, not the deliverable. The deliverable is the numbers and the writeup.Out of scope
Related
Acceptance