Cursor SDK benchmark findings (Ponytail A/B)
Measured with ponytail-benchmark — isolated git worktrees, settingSources: ["project"], same prompt per model, Ponytail ON vs OFF toggling only .cursor/rules/ponytail.mdc.
Date: June 2026
Models: Composer 2.5 Fast, Gemini 3.5 Flash, Claude Opus 4.8, Claude Sonnet 4.6, GPT-5.5 High (10 finished runs; GPT-5.5 Medium failed)
Task: Structured test-design prompt with frozen benchmark/context/ copied into each worktree.
Cost figures are estimated from SDK turn-ended usage × public list pricing, not Admin API billing.
1. Skills indexed vs skills actually read
The SDK startup index (skillCount / AgentSkillsCursorRulesService) reported ~36–44 skills available in the workspace, but stream analysis showed only 1–4 unique SKILL.md files actually read per run via tool calls.
| Metric |
Typical Ponytail ON |
Typical Ponytail OFF |
SDK skillCount (startup) |
~40 |
~40 |
Unique SKILL.md read (tool read) |
2–4 |
2–4 |
Total SKILL.md read events |
2–6 (re-reads) |
2–6 |
Implication: Reporting “40 skills loaded” does not mean the model used 40 skills. Glob/list events do not count as usage — only read of …/SKILL.md does.
2. Ponytail ON: higher cost and more tool calls, often fewer lines written
Across several models, Ponytail ON correlated with more SDK tool calls and higher estimated token cost, while git lines added to the draft file were often lower than Ponytail OFF for the same model and prompt.
Representative per-model results
| Model |
Est. cost ON |
Est. cost OFF |
Δ cost |
Tool calls ON |
Tool calls OFF |
Lines added ON |
Lines added OFF |
| Composer 2.5 Fast |
~$1.07 |
~$0.69 |
+55% |
higher |
lower |
lower |
higher |
| Gemini 3.5 Flash |
~$1.32 |
~$1.25 |
+6% |
22 |
lower |
693 |
higher |
| Claude Sonnet 4.6 |
~$1.55 |
~$1.48 |
+5% |
higher |
lower |
~626 |
~617 |
| Claude Opus 4.8 |
~$2.35 |
~$2.39 |
−2% |
similar |
similar |
~523 |
~560 |
| GPT-5.5 High |
~$1.80 |
~$2.06 |
−13% |
lower |
higher |
~698 |
~647 |
Pattern (most models): Ponytail ON → more exploration (reads/skills/context) → more tool calls and tokens → higher estimated cost, but leaner draft output (fewer lines in .cursor/drafts/benchmark-*.md).
Exceptions: Claude Opus and GPT-5.5 High did not follow the cost/tool pattern in this matrix; Opus ON was slightly cheaper and faster.
Composer 2.5 Fast (strongest cost gap)
|
Ponytail ON |
Ponytail OFF |
| Total tokens (input + output) |
~308k |
~187k |
| Estimated cost |
~$1.07 |
~$0.69 |
| Tool calls |
more |
fewer |
| Lines added (git) |
fewer |
more |
3. Interpretation (not a product bug report)
These runs used a large, completable task (full test-design draft), not a “blocked MCP / stop early” scenario. In blocked scenarios, Ponytail’s “lean / don’t snowball” behavior can reduce cost dramatically; in completion-forced benchmarks, ON often means more disciplined reading before writing, which shows up as higher tool and token use with shorter written output.
4. Reproduce
git clone https://github.com/RicardoCostaGit/ponytail-benchmark-from-cursor.git
cd ponytail-benchmark-from-cursor
npm install && cp .env.example .env
# set CURSOR_API_KEY, point benchmark.config.json at your repo
bash scripts/prepare-example-target.sh
npm run run && npm run aggregate
Artifacts per run: runs/{run_id}/metrics.json, usage.json, stream.jsonl, skill-usage.json.
Cursor SDK benchmark findings (Ponytail A/B)
Measured with ponytail-benchmark — isolated git worktrees,
settingSources: ["project"], same prompt per model, Ponytail ON vs OFF toggling only.cursor/rules/ponytail.mdc.Date: June 2026
Models: Composer 2.5 Fast, Gemini 3.5 Flash, Claude Opus 4.8, Claude Sonnet 4.6, GPT-5.5 High (10 finished runs; GPT-5.5 Medium failed)
Task: Structured test-design prompt with frozen
benchmark/context/copied into each worktree.Cost figures are estimated from SDK
turn-endedusage × public list pricing, not Admin API billing.1. Skills indexed vs skills actually read
The SDK startup index (
skillCount/AgentSkillsCursorRulesService) reported ~36–44 skills available in the workspace, but stream analysis showed only 1–4 uniqueSKILL.mdfiles actually read per run via tool calls.skillCount(startup)SKILL.mdread (toolread)SKILL.mdread eventsImplication: Reporting “40 skills loaded” does not mean the model used 40 skills. Glob/list events do not count as usage — only
readof…/SKILL.mddoes.2. Ponytail ON: higher cost and more tool calls, often fewer lines written
Across several models, Ponytail ON correlated with more SDK tool calls and higher estimated token cost, while git lines added to the draft file were often lower than Ponytail OFF for the same model and prompt.
Representative per-model results
Pattern (most models): Ponytail ON → more exploration (reads/skills/context) → more tool calls and tokens → higher estimated cost, but leaner draft output (fewer lines in
.cursor/drafts/benchmark-*.md).Exceptions: Claude Opus and GPT-5.5 High did not follow the cost/tool pattern in this matrix; Opus ON was slightly cheaper and faster.
Composer 2.5 Fast (strongest cost gap)
3. Interpretation (not a product bug report)
These runs used a large, completable task (full test-design draft), not a “blocked MCP / stop early” scenario. In blocked scenarios, Ponytail’s “lean / don’t snowball” behavior can reduce cost dramatically; in completion-forced benchmarks, ON often means more disciplined reading before writing, which shows up as higher tool and token use with shorter written output.
4. Reproduce
Artifacts per run:
runs/{run_id}/metrics.json,usage.json,stream.jsonl,skill-usage.json.