Skip to content

fix(skill-creator): run_eval trigger detection misses real skill name and bails on first non-Skill tool#1323

Open
Polluelo978 wants to merge 1 commit into
anthropics:mainfrom
Polluelo978:fix/run-eval-trigger-detection
Open

fix(skill-creator): run_eval trigger detection misses real skill name and bails on first non-Skill tool#1323
Polluelo978 wants to merge 1 commit into
anthropics:mainfrom
Polluelo978:fix/run-eval-trigger-detection

Conversation

@Polluelo978

Copy link
Copy Markdown

Problem

scripts/run_eval.py::run_single_query fails to detect that a skill triggered, causing the description-optimization loop (run_loop.py) to report recall=0% for every should-trigger query. With all candidates tied at 0, the loop just returns the original description and never actually optimizes.

Two root causes in the stream detection:

  1. Only the registered command name is matched. The function registers a temp command named {skill}-skill-{uuid} and only checks whether that random name (clean_name) appears in the tool input. But an installed skill fires under its real name (e.g. cve-intel), so the random name never matches -> always False.
  2. Bails to False on the first non-Skill/Read tool. Real Claude Code runs often call Bash/WebFetch/etc. before consulting the skill. The content_block_start handler does return False on the first such tool, missing a later Skill/Read call.

Fix

  • Match the real skill_name in addition to the registered clean_name, in both the streaming path and the full-assistant-message fallback.
  • Do not return False on the first non-Skill/Read tool. Reset the pending state and keep scanning until type == "result".

Early-return on a positive match keeps positives fast; negatives run to result (short Q&A), bounded by --timeout.

Evidence (same skill + eval set, 3 runs/query, --model claude-sonnet-4-6 --timeout 120)

metric before after (held-out test)
recall 0% 100%
precision 100% 100%
accuracy 50% 100%

After the fix the loop can measure description quality as intended (held-out test 8/8).

Notes

  • No behavioral change for skills not installed locally: clean_name matching is preserved; skill_name matching is additive.
  • Single-file change in skills/skill-creator/scripts/run_eval.py.

…n-Skill tool in run_eval

run_eval.py::run_single_query failed to detect skill triggering, making
run_loop.py report recall=0% for every should-trigger query. With all
candidates tied at 0, the loop returns the original description and never
actually optimizes.

Two root causes in the stream detection:
1. Only the registered temp command name (clean_name) was matched, but an
   installed skill fires under its real name (skill_name) -> never matched.
2. The first non-Skill/Read tool_use caused an immediate `return False`,
   missing a later Skill/Read call (real runs often call Bash/WebFetch first).

Fix:
- Match skill_name in addition to clean_name (streaming path and fallback).
- Don't return False on the first non-Skill/Read tool; reset pending state
  and keep scanning until type == "result".

Verified on a skill + eval set (3 runs/query, --model claude-sonnet-4-6
--timeout 120): recall 0% -> 100%, precision 100%, held-out test 8/8.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant