SDK benchmark: Ponytail ON → more tool calls & cost, fewer output lines; skillCount ≠ skills read

# Cursor SDK benchmark findings (Ponytail A/B)

Measured with [ponytail-benchmark](https://github.com/RicardoCostaGit/ponytail-benchmark-from-cursor) — isolated git worktrees, `settingSources: ["project"]`, same prompt per model, Ponytail ON vs OFF toggling only `.cursor/rules/ponytail.mdc`.

**Date:** June 2026  
**Models:** Composer 2.5 Fast, Gemini 3.5 Flash, Claude Opus 4.8, Claude Sonnet 4.6, GPT-5.5 High (10 finished runs; GPT-5.5 Medium failed)  
**Task:** Structured test-design prompt with frozen `benchmark/context/` copied into each worktree.

Cost figures are **estimated** from SDK `turn-ended` usage × [public list pricing](https://cursor.com/docs/models-and-pricing), not Admin API billing.

---

## 1. Skills indexed vs skills actually read
The SDK startup index (`skillCount` / `AgentSkillsCursorRulesService`) reported **~36–44 skills** available in the workspace, but stream analysis showed only **1–4 unique `SKILL.md` files** actually read per run via tool calls.

| Metric | Typical Ponytail ON | Typical Ponytail OFF |
|--------|---------------------|----------------------|
| SDK `skillCount` (startup) | ~40 | ~40 |
| Unique `SKILL.md` read (tool `read`) | 2–4 | 2–4 |
| Total `SKILL.md` read events | 2–6 (re-reads) | 2–6 |

**Implication:** Reporting “40 skills loaded” does not mean the model used 40 skills. Glob/list events do not count as usage — only `read` of `…/SKILL.md` does.

---

## 2. Ponytail ON: higher cost and more tool calls, often fewer lines written
Across several models, **Ponytail ON** correlated with **more SDK tool calls** and **higher estimated token cost**, while **git lines added** to the draft file were often **lower** than Ponytail OFF for the same model and prompt.

### Representative per-model results
| Model | Est. cost ON | Est. cost OFF | Δ cost | Tool calls ON | Tool calls OFF | Lines added ON | Lines added OFF |
|-------|--------------|---------------|--------|---------------|----------------|----------------|-----------------|
| Composer 2.5 Fast | ~$1.07 | ~$0.69 | **+55%** | higher | lower | lower | higher |
| Gemini 3.5 Flash | ~$1.32 | ~$1.25 | +6% | 22 | lower | 693 | higher |
| Claude Sonnet 4.6 | ~$1.55 | ~$1.48 | +5% | higher | lower | ~626 | ~617 |
| Claude Opus 4.8 | ~$2.35 | ~$2.39 | −2% | similar | similar | ~523 | ~560 |
| GPT-5.5 High | ~$1.80 | ~$2.06 | −13% | lower | higher | ~698 | ~647 |

**Pattern (most models):** Ponytail ON → more exploration (reads/skills/context) → **more tool calls and tokens** → **higher estimated cost**, but **leaner draft output** (fewer lines in `.cursor/drafts/benchmark-*.md`).

**Exceptions:** Claude Opus and GPT-5.5 High did not follow the cost/tool pattern in this matrix; Opus ON was slightly cheaper and faster.

### Composer 2.5 Fast (strongest cost gap)

| | Ponytail ON | Ponytail OFF |
|--|-------------|--------------|
| Total tokens (input + output) | ~308k | ~187k |
| Estimated cost | ~$1.07 | ~$0.69 |
| Tool calls | more | fewer |
| Lines added (git) | fewer | more |

---

## 3. Interpretation (not a product bug report)
These runs used a **large, completable task** (full test-design draft), not a “blocked MCP / stop early” scenario. In blocked scenarios, Ponytail’s “lean / don’t snowball” behavior can reduce cost dramatically; in **completion-forced** benchmarks, ON often means **more disciplined reading** before writing, which shows up as **higher tool and token use** with **shorter written output**.

---

## 4. Reproduce

```bash
git clone https://github.com/RicardoCostaGit/ponytail-benchmark-from-cursor.git
cd ponytail-benchmark-from-cursor
npm install && cp .env.example .env
# set CURSOR_API_KEY, point benchmark.config.json at your repo
bash scripts/prepare-example-target.sh
npm run run && npm run aggregate
```
Artifacts per run: `runs/{run_id}/metrics.json`, `usage.json`, `stream.jsonl`, `skill-usage.json`.
---

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SDK benchmark: Ponytail ON → more tool calls & cost, fewer output lines; skillCount ≠ skills read #121

Cursor SDK benchmark findings (Ponytail A/B)

1. Skills indexed vs skills actually read

2. Ponytail ON: higher cost and more tool calls, often fewer lines written

Representative per-model results

Composer 2.5 Fast (strongest cost gap)

3. Interpretation (not a product bug report)

4. Reproduce

Artifacts per run: `runs/{run_id}/metrics.json`, `usage.json`, `stream.jsonl`, `skill-usage.json`.

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	Typical Ponytail ON	Typical Ponytail OFF
SDK `skillCount` (startup)	~40	~40
Unique `SKILL.md` read (tool `read`)	2–4	2–4
Total `SKILL.md` read events	2–6 (re-reads)	2–6

Model	Est. cost ON	Est. cost OFF	Δ cost	Tool calls ON	Tool calls OFF	Lines added ON	Lines added OFF
Composer 2.5 Fast	~$1.07	~$0.69	+55%	higher	lower	lower	higher
Gemini 3.5 Flash	~$1.32	~$1.25	+6%	22	lower	693	higher
Claude Sonnet 4.6	~$1.55	~$1.48	+5%	higher	lower	~626	~617
Claude Opus 4.8	~$2.35	~$2.39	−2%	similar	similar	~523	~560
GPT-5.5 High	~$1.80	~$2.06	−13%	lower	higher	~698	~647

	Ponytail ON	Ponytail OFF
Total tokens (input + output)	~308k	~187k
Estimated cost	~$1.07	~$0.69
Tool calls	more	fewer
Lines added (git)	fewer	more

Uh oh!

SDK benchmark: Ponytail ON → more tool calls & cost, fewer output lines; skillCount ≠ skills read #121

Description

Cursor SDK benchmark findings (Ponytail A/B)

1. Skills indexed vs skills actually read

2. Ponytail ON: higher cost and more tool calls, often fewer lines written

Representative per-model results

Composer 2.5 Fast (strongest cost gap)

3. Interpretation (not a product bug report)

4. Reproduce

Artifacts per run: runs/{run_id}/metrics.json, usage.json, stream.jsonl, skill-usage.json.

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Artifacts per run: `runs/{run_id}/metrics.json`, `usage.json`, `stream.jsonl`, `skill-usage.json`.