fix(skill-creator): run_eval trigger detection misses real skill name and bails on first non-Skill tool by Polluelo978 · Pull Request #1323 · anthropics/skills

Polluelo978 · 2026-06-16T10:04:30Z

Problem

scripts/run_eval.py::run_single_query fails to detect that a skill triggered, causing the description-optimization loop (run_loop.py) to report recall=0% for every should-trigger query. With all candidates tied at 0, the loop just returns the original description and never actually optimizes.

Two root causes in the stream detection:

Only the registered command name is matched. The function registers a temp command named {skill}-skill-{uuid} and only checks whether that random name (clean_name) appears in the tool input. But an installed skill fires under its real name (e.g. cve-intel), so the random name never matches -> always False.
Bails to False on the first non-Skill/Read tool. Real Claude Code runs often call Bash/WebFetch/etc. before consulting the skill. The content_block_start handler does return False on the first such tool, missing a later Skill/Read call.

Fix

Match the real skill_name in addition to the registered clean_name, in both the streaming path and the full-assistant-message fallback.
Do not return False on the first non-Skill/Read tool. Reset the pending state and keep scanning until type == "result".

Early-return on a positive match keeps positives fast; negatives run to result (short Q&A), bounded by --timeout.

Evidence (same skill + eval set, 3 runs/query, `--model claude-sonnet-4-6 --timeout 120`)

metric	before	after (held-out test)
recall	0%	100%
precision	100%	100%
accuracy	50%	100%

After the fix the loop can measure description quality as intended (held-out test 8/8).

Notes

No behavioral change for skills not installed locally: clean_name matching is preserved; skill_name matching is additive.
Single-file change in skills/skill-creator/scripts/run_eval.py.

…n-Skill tool in run_eval run_eval.py::run_single_query failed to detect skill triggering, making run_loop.py report recall=0% for every should-trigger query. With all candidates tied at 0, the loop returns the original description and never actually optimizes. Two root causes in the stream detection: 1. Only the registered temp command name (clean_name) was matched, but an installed skill fires under its real name (skill_name) -> never matched. 2. The first non-Skill/Read tool_use caused an immediate `return False`, missing a later Skill/Read call (real runs often call Bash/WebFetch first). Fix: - Match skill_name in addition to clean_name (streaming path and fallback). - Don't return False on the first non-Skill/Read tool; reset pending state and keep scanning until type == "result". Verified on a skill + eval set (3 runs/query, --model claude-sonnet-4-6 --timeout 120): recall 0% -> 100%, precision 100%, held-out test 8/8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(skill-creator): run_eval trigger detection misses real skill name and bails on first non-Skill tool#1323

fix(skill-creator): run_eval trigger detection misses real skill name and bails on first non-Skill tool#1323
Polluelo978 wants to merge 1 commit into
anthropics:mainfrom
Polluelo978:fix/run-eval-trigger-detection

Polluelo978 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Polluelo978 commented Jun 16, 2026

Problem

Fix

Evidence (same skill + eval set, 3 runs/query, --model claude-sonnet-4-6 --timeout 120)

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Evidence (same skill + eval set, 3 runs/query, `--model claude-sonnet-4-6 --timeout 120`)