Skip to content

SDK benchmark: Ponytail ON → more tool calls & cost, fewer output lines; skillCount ≠ skills read #121

@RicardoCostaGit

Description

@RicardoCostaGit

Cursor SDK benchmark findings (Ponytail A/B)

Measured with ponytail-benchmark — isolated git worktrees, settingSources: ["project"], same prompt per model, Ponytail ON vs OFF toggling only .cursor/rules/ponytail.mdc.

Date: June 2026
Models: Composer 2.5 Fast, Gemini 3.5 Flash, Claude Opus 4.8, Claude Sonnet 4.6, GPT-5.5 High (10 finished runs; GPT-5.5 Medium failed)
Task: Structured test-design prompt with frozen benchmark/context/ copied into each worktree.

Cost figures are estimated from SDK turn-ended usage × public list pricing, not Admin API billing.


1. Skills indexed vs skills actually read

The SDK startup index (skillCount / AgentSkillsCursorRulesService) reported ~36–44 skills available in the workspace, but stream analysis showed only 1–4 unique SKILL.md files actually read per run via tool calls.

Metric Typical Ponytail ON Typical Ponytail OFF
SDK skillCount (startup) ~40 ~40
Unique SKILL.md read (tool read) 2–4 2–4
Total SKILL.md read events 2–6 (re-reads) 2–6

Implication: Reporting “40 skills loaded” does not mean the model used 40 skills. Glob/list events do not count as usage — only read of …/SKILL.md does.


2. Ponytail ON: higher cost and more tool calls, often fewer lines written

Across several models, Ponytail ON correlated with more SDK tool calls and higher estimated token cost, while git lines added to the draft file were often lower than Ponytail OFF for the same model and prompt.

Representative per-model results

Model Est. cost ON Est. cost OFF Δ cost Tool calls ON Tool calls OFF Lines added ON Lines added OFF
Composer 2.5 Fast ~$1.07 ~$0.69 +55% higher lower lower higher
Gemini 3.5 Flash ~$1.32 ~$1.25 +6% 22 lower 693 higher
Claude Sonnet 4.6 ~$1.55 ~$1.48 +5% higher lower ~626 ~617
Claude Opus 4.8 ~$2.35 ~$2.39 −2% similar similar ~523 ~560
GPT-5.5 High ~$1.80 ~$2.06 −13% lower higher ~698 ~647

Pattern (most models): Ponytail ON → more exploration (reads/skills/context) → more tool calls and tokenshigher estimated cost, but leaner draft output (fewer lines in .cursor/drafts/benchmark-*.md).

Exceptions: Claude Opus and GPT-5.5 High did not follow the cost/tool pattern in this matrix; Opus ON was slightly cheaper and faster.

Composer 2.5 Fast (strongest cost gap)

Ponytail ON Ponytail OFF
Total tokens (input + output) ~308k ~187k
Estimated cost ~$1.07 ~$0.69
Tool calls more fewer
Lines added (git) fewer more

3. Interpretation (not a product bug report)

These runs used a large, completable task (full test-design draft), not a “blocked MCP / stop early” scenario. In blocked scenarios, Ponytail’s “lean / don’t snowball” behavior can reduce cost dramatically; in completion-forced benchmarks, ON often means more disciplined reading before writing, which shows up as higher tool and token use with shorter written output.


4. Reproduce

git clone https://github.com/RicardoCostaGit/ponytail-benchmark-from-cursor.git
cd ponytail-benchmark-from-cursor
npm install && cp .env.example .env
# set CURSOR_API_KEY, point benchmark.config.json at your repo
bash scripts/prepare-example-target.sh
npm run run && npm run aggregate

Artifacts per run: runs/{run_id}/metrics.json, usage.json, stream.jsonl, skill-usage.json.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions