Get token-cost / step-reduction results from recalled guidelines on a real benchmark (e.g. AppWorld)

## Problem

We claim Evolve's recalled guidelines make agent sessions cheaper and shorter, but we have no defensible numbers. #6 already calls for a benchmark on AppWorld + a ReAct agent measuring step counts and pass rate before/after tip generation; this issue is the broader version of that ask: **produce results that show how much recall reduces token cost, wall-clock, and steps on a real benchmark**, on at least one of the platforms Evolve supports (Claude Code, Codex, Bob, claw-code, …).

AppWorld is the obvious starting candidate (per #6), but the deliverable is *results*, not a specific benchmark choice. If a different public benchmark fits better — SWE-bench-verified slices, ToolBench, MultiHopRAG, BrowseComp, something else — pick that and justify briefly.

## What "results" means here

For each (benchmark task or task family, platform) pair:

- **Token cost** — total, input vs. output, cache reads vs. creates.
- **Wall-clock** per session.
- **Steps / turns** per session.
- **Task success rate**.
- The **specific guidelines** recalled (so a reader can see *what knowledge* drove the savings).
- N trials per condition (with vs. without recall) so the numbers are defensible, not anecdotal.

Headline output: a table per platform, plus a short writeup explaining the gap on a representative task — including the tool calls each condition made, so the *why* is visible alongside the *what*.

## Scope

- At least one public benchmark, multiple distinct task families inside it.
- At least one Evolve-supported platform end-to-end. A second platform is a strong nice-to-have but not required to close.
- Whatever harness exists (e.g. `tests/e2e/experiment_token_savings.py`) is a starting point, not the deliverable. The deliverable is the numbers and the writeup.

## Out of scope

- Building a long-lived measurement framework (separate issue if needed).
- Wiring this into CI as a regression check (separate issue).
- Multi-platform coverage beyond a second platform.

## Related

- #6 — Benchmark Tip Generation (AppWorld + ReAct angle — the most direct prior)
- #142 — Cuga demo apps for a memory-focused Evolve demo
- #257 — RFC: .md-first procedural memory (moat thesis this would help validate)

## Acceptance

- A writeup (issue comment, doc, or PR) with the comparison tables described above, on at least one benchmark and at least one platform.
- N ≥ 5 trials per condition.
- Concrete examples of recalled guidelines and the tool-call gap they removed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get token-cost / step-reduction results from recalled guidelines on a real benchmark (e.g. AppWorld) #260

Problem

What "results" means here

Scope

Out of scope

Related

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Get token-cost / step-reduction results from recalled guidelines on a real benchmark (e.g. AppWorld) #260

Description

Problem

What "results" means here

Scope

Out of scope

Related

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions