Skip to content

Get token-cost / step-reduction results from recalled guidelines on a real benchmark (e.g. AppWorld) #260

@vinodmut

Description

@vinodmut

Problem

We claim Evolve's recalled guidelines make agent sessions cheaper and shorter, but we have no defensible numbers. #6 already calls for a benchmark on AppWorld + a ReAct agent measuring step counts and pass rate before/after tip generation; this issue is the broader version of that ask: produce results that show how much recall reduces token cost, wall-clock, and steps on a real benchmark, on at least one of the platforms Evolve supports (Claude Code, Codex, Bob, claw-code, …).

AppWorld is the obvious starting candidate (per #6), but the deliverable is results, not a specific benchmark choice. If a different public benchmark fits better — SWE-bench-verified slices, ToolBench, MultiHopRAG, BrowseComp, something else — pick that and justify briefly.

What "results" means here

For each (benchmark task or task family, platform) pair:

  • Token cost — total, input vs. output, cache reads vs. creates.
  • Wall-clock per session.
  • Steps / turns per session.
  • Task success rate.
  • The specific guidelines recalled (so a reader can see what knowledge drove the savings).
  • N trials per condition (with vs. without recall) so the numbers are defensible, not anecdotal.

Headline output: a table per platform, plus a short writeup explaining the gap on a representative task — including the tool calls each condition made, so the why is visible alongside the what.

Scope

  • At least one public benchmark, multiple distinct task families inside it.
  • At least one Evolve-supported platform end-to-end. A second platform is a strong nice-to-have but not required to close.
  • Whatever harness exists (e.g. tests/e2e/experiment_token_savings.py) is a starting point, not the deliverable. The deliverable is the numbers and the writeup.

Out of scope

  • Building a long-lived measurement framework (separate issue if needed).
  • Wiring this into CI as a regression check (separate issue).
  • Multi-platform coverage beyond a second platform.

Related

Acceptance

  • A writeup (issue comment, doc, or PR) with the comparison tables described above, on at least one benchmark and at least one platform.
  • N ≥ 5 trials per condition.
  • Concrete examples of recalled guidelines and the tool-call gap they removed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions