Windows: eval/optimize harness broken — cp1252 read_text() crash + nested claude -p WinError 10038 silently zeroes all results

## Environment
- OS: Windows 11
- Python 3.14.4
- `skill-creator` from `anthropics/skills` (example-skills plugin), run via `python -m scripts.run_loop`

Two Windows bugs make the eval / description-optimization harness unusable; the second is the dangerous one because it **fails silently and reports plausible-looking but meaningless metrics**.

## Bug 1 — `parse_skill_md` crashes on any non-cp1252 SKILL.md (blocking)

`scripts/utils.py:9` reads SKILL.md with no explicit encoding:

```python
content = (skill_path / "SKILL.md").read_text()
```

On Windows the default text encoding is cp1252, so any SKILL.md containing UTF-8 beyond cp1252 — emoji (✅ 🟡), arrows (→), em-dashes (—), middots (·) — aborts the whole run:

```
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 2363: character maps to <undefined>
  File ".../scripts/utils.py", line 9, in parse_skill_md
    content = (skill_path / "SKILL.md").read_text()
```

Same latent issue at many other sites: `scripts/quick_validate.py:22`, `generate_report.py:314`, and numerous `open()` / `write_text()` calls in `aggregate_benchmark.py`, `run_eval.py`, `run_loop.py`, `improve_description.py`.

**Fix:** pass `encoding="utf-8"` to every `read_text()` / `write_text()` / `open()` in `scripts/`. Minimal blocker fix:

```python
content = (skill_path / "SKILL.md").read_text(encoding="utf-8")
```

Workaround: run under `PYTHONUTF8=1`.

## Bug 2 — nested `claude -p` fails on Windows and the failure is scored as "no trigger" (critical)

After working around Bug 1 with `PYTHONUTF8=1`, `run_loop` runs to exit 0 — but **every** query (should-trigger and should-not-trigger alike) returns `trigger_rate: 0.0`, with stderr flooded (~40 per iteration) by:

```
Warning: query failed: [WinError 10038] An operation was attempted on something that is not a socket
```

`run_eval.py` runs each triggering check by spawning a nested `claude -p` via `ProcessPoolExecutor` + `subprocess.Popen` (≈ lines 16 / 85). On Windows every subprocess call fails with WinError 10038, so no triggering verdict is ever obtained.

The harness treats a failed query as "did not trigger," so should-not-trigger queries pass by default and the run reports a confident-looking result (`accuracy 50%, precision 100%, recall 0%`) plus a `best_description` — when in fact nothing was measured. This is easy to mistake for a real benchmark.

**Suggested fix:** abort (or loudly fail the iteration) when a high fraction of queries error, instead of scoring errored queries as non-triggers. Separately, the nested-`claude -p` subprocess path needs a Windows-compatible execution strategy.

**Repro:**
```
PYTHONUTF8=1 python -m scripts.run_loop --eval-set eval.json --skill-path <skill> --model claude-opus-4-8 --num-workers 4
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows: eval/optimize harness broken — cp1252 read_text() crash + nested claude -p WinError 10038 silently zeroes all results #1304

Environment

Bug 1 — `parse_skill_md` crashes on any non-cp1252 SKILL.md (blocking)

Bug 2 — nested `claude -p` fails on Windows and the failure is scored as "no trigger" (critical)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Windows: eval/optimize harness broken — cp1252 read_text() crash + nested claude -p WinError 10038 silently zeroes all results #1304

Description

Environment

Bug 1 — parse_skill_md crashes on any non-cp1252 SKILL.md (blocking)

Bug 2 — nested claude -p fails on Windows and the failure is scored as "no trigger" (critical)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug 1 — `parse_skill_md` crashes on any non-cp1252 SKILL.md (blocking)

Bug 2 — nested `claude -p` fails on Windows and the failure is scored as "no trigger" (critical)