Environment
- OS: Windows 11
- Python 3.14.4
skill-creator from anthropics/skills (example-skills plugin), run via python -m scripts.run_loop
Two Windows bugs make the eval / description-optimization harness unusable; the second is the dangerous one because it fails silently and reports plausible-looking but meaningless metrics.
Bug 1 — parse_skill_md crashes on any non-cp1252 SKILL.md (blocking)
scripts/utils.py:9 reads SKILL.md with no explicit encoding:
content = (skill_path / "SKILL.md").read_text()
On Windows the default text encoding is cp1252, so any SKILL.md containing UTF-8 beyond cp1252 — emoji (✅ 🟡), arrows (→), em-dashes (—), middots (·) — aborts the whole run:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 2363: character maps to <undefined>
File ".../scripts/utils.py", line 9, in parse_skill_md
content = (skill_path / "SKILL.md").read_text()
Same latent issue at many other sites: scripts/quick_validate.py:22, generate_report.py:314, and numerous open() / write_text() calls in aggregate_benchmark.py, run_eval.py, run_loop.py, improve_description.py.
Fix: pass encoding="utf-8" to every read_text() / write_text() / open() in scripts/. Minimal blocker fix:
content = (skill_path / "SKILL.md").read_text(encoding="utf-8")
Workaround: run under PYTHONUTF8=1.
Bug 2 — nested claude -p fails on Windows and the failure is scored as "no trigger" (critical)
After working around Bug 1 with PYTHONUTF8=1, run_loop runs to exit 0 — but every query (should-trigger and should-not-trigger alike) returns trigger_rate: 0.0, with stderr flooded (~40 per iteration) by:
Warning: query failed: [WinError 10038] An operation was attempted on something that is not a socket
run_eval.py runs each triggering check by spawning a nested claude -p via ProcessPoolExecutor + subprocess.Popen (≈ lines 16 / 85). On Windows every subprocess call fails with WinError 10038, so no triggering verdict is ever obtained.
The harness treats a failed query as "did not trigger," so should-not-trigger queries pass by default and the run reports a confident-looking result (accuracy 50%, precision 100%, recall 0%) plus a best_description — when in fact nothing was measured. This is easy to mistake for a real benchmark.
Suggested fix: abort (or loudly fail the iteration) when a high fraction of queries error, instead of scoring errored queries as non-triggers. Separately, the nested-claude -p subprocess path needs a Windows-compatible execution strategy.
Repro:
PYTHONUTF8=1 python -m scripts.run_loop --eval-set eval.json --skill-path <skill> --model claude-opus-4-8 --num-workers 4
Environment
skill-creatorfromanthropics/skills(example-skills plugin), run viapython -m scripts.run_loopTwo Windows bugs make the eval / description-optimization harness unusable; the second is the dangerous one because it fails silently and reports plausible-looking but meaningless metrics.
Bug 1 —
parse_skill_mdcrashes on any non-cp1252 SKILL.md (blocking)scripts/utils.py:9reads SKILL.md with no explicit encoding:On Windows the default text encoding is cp1252, so any SKILL.md containing UTF-8 beyond cp1252 — emoji (✅ 🟡), arrows (→), em-dashes (—), middots (·) — aborts the whole run:
Same latent issue at many other sites:
scripts/quick_validate.py:22,generate_report.py:314, and numerousopen()/write_text()calls inaggregate_benchmark.py,run_eval.py,run_loop.py,improve_description.py.Fix: pass
encoding="utf-8"to everyread_text()/write_text()/open()inscripts/. Minimal blocker fix:Workaround: run under
PYTHONUTF8=1.Bug 2 — nested
claude -pfails on Windows and the failure is scored as "no trigger" (critical)After working around Bug 1 with
PYTHONUTF8=1,run_loopruns to exit 0 — but every query (should-trigger and should-not-trigger alike) returnstrigger_rate: 0.0, with stderr flooded (~40 per iteration) by:run_eval.pyruns each triggering check by spawning a nestedclaude -pviaProcessPoolExecutor+subprocess.Popen(≈ lines 16 / 85). On Windows every subprocess call fails with WinError 10038, so no triggering verdict is ever obtained.The harness treats a failed query as "did not trigger," so should-not-trigger queries pass by default and the run reports a confident-looking result (
accuracy 50%, precision 100%, recall 0%) plus abest_description— when in fact nothing was measured. This is easy to mistake for a real benchmark.Suggested fix: abort (or loudly fail the iteration) when a high fraction of queries error, instead of scoring errored queries as non-triggers. Separately, the nested-
claude -psubprocess path needs a Windows-compatible execution strategy.Repro: