Skip to content

Windows: eval/optimize harness broken — cp1252 read_text() crash + nested claude -p WinError 10038 silently zeroes all results #1304

@nhosler

Description

@nhosler

Environment

  • OS: Windows 11
  • Python 3.14.4
  • skill-creator from anthropics/skills (example-skills plugin), run via python -m scripts.run_loop

Two Windows bugs make the eval / description-optimization harness unusable; the second is the dangerous one because it fails silently and reports plausible-looking but meaningless metrics.

Bug 1 — parse_skill_md crashes on any non-cp1252 SKILL.md (blocking)

scripts/utils.py:9 reads SKILL.md with no explicit encoding:

content = (skill_path / "SKILL.md").read_text()

On Windows the default text encoding is cp1252, so any SKILL.md containing UTF-8 beyond cp1252 — emoji (✅ 🟡), arrows (→), em-dashes (—), middots (·) — aborts the whole run:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 2363: character maps to <undefined>
  File ".../scripts/utils.py", line 9, in parse_skill_md
    content = (skill_path / "SKILL.md").read_text()

Same latent issue at many other sites: scripts/quick_validate.py:22, generate_report.py:314, and numerous open() / write_text() calls in aggregate_benchmark.py, run_eval.py, run_loop.py, improve_description.py.

Fix: pass encoding="utf-8" to every read_text() / write_text() / open() in scripts/. Minimal blocker fix:

content = (skill_path / "SKILL.md").read_text(encoding="utf-8")

Workaround: run under PYTHONUTF8=1.

Bug 2 — nested claude -p fails on Windows and the failure is scored as "no trigger" (critical)

After working around Bug 1 with PYTHONUTF8=1, run_loop runs to exit 0 — but every query (should-trigger and should-not-trigger alike) returns trigger_rate: 0.0, with stderr flooded (~40 per iteration) by:

Warning: query failed: [WinError 10038] An operation was attempted on something that is not a socket

run_eval.py runs each triggering check by spawning a nested claude -p via ProcessPoolExecutor + subprocess.Popen (≈ lines 16 / 85). On Windows every subprocess call fails with WinError 10038, so no triggering verdict is ever obtained.

The harness treats a failed query as "did not trigger," so should-not-trigger queries pass by default and the run reports a confident-looking result (accuracy 50%, precision 100%, recall 0%) plus a best_description — when in fact nothing was measured. This is easy to mistake for a real benchmark.

Suggested fix: abort (or loudly fail the iteration) when a high fraction of queries error, instead of scoring errored queries as non-triggers. Separately, the nested-claude -p subprocess path needs a Windows-compatible execution strategy.

Repro:

PYTHONUTF8=1 python -m scripts.run_loop --eval-set eval.json --skill-path <skill> --model claude-opus-4-8 --num-workers 4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions