Make live-test gate non-negotiable + run-without-asking before releases

phuryn · phuryn · commit ea4f8d4f4cbe · 2026-06-15T08:18:34.000+02:00
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -49,9 +49,9 @@ There are **three** kinds of tests, and it matters which is which:
 
 1. **`npm test` — grok-free unit/DOM/integration suite (414 tests).** Pure logic, happy-dom tests that drive the real `media/chat.js`, a real-`/bin/sh` TerminalManager smoke, and a fake-CLI ACP integration suite (`test/fixtures/fake-grok-acp.cjs`). **Never spawns the real `grok` binary.** Runs in <2s with no network, no login, no subscription. This is the floor — every change keeps it green.
 2. **CI — the *same* suite.** `.github/workflows/ci.yml` runs `npm ci && npm test && npm run package` on a clean Ubuntu box. **CI ≡ layer 1, verbatim** — there is no separate CI-only set. CI has no `grok` binary, no auth, no SuperGrok subscription, so it *cannot* run anything that touches the real CLI. That's the whole reason layer 1 is grok-free.
-3. **`npm run test:live` — on-demand pre-release suite against REAL grok (`scripts/live-tests.cjs`).** Spawns the actual `grok agent stdio` and exercises the surfaces layers 1–2 can't: the real ACP handshake, a prompt round-trip, session restore, plan-mode enforcement, and the v1.4.x generative features (image gen, video gen; the subagent path is exercised opportunistically and SKIPs when grok doesn't delegate — it's deferred/research-only). It **reuses the real compiled modules** (`out/acp-dispatch.js`, `out/plan-gate.js`, `media/webview-helpers.js`) — it feeds genuine wire output through the same `isMediaGenToolCall`/`extractGeneratedMediaPaths`/`isSubagentToolCall`/`shouldBlockWrite` the extension uses, not a re-implementation. **Run it manually before every release-to-`main`** (the user triggers it; it needs a logged-in grok + subscription and burns credits, so it must never be in `npm test` or CI). Flags: `--quick` (skip the slow generative tests), `--only=`, `--skip=`, `GROK_BIN=…`. A SKIP (no subscription, grok chose not to delegate, etc.) does not fail the gate — only a FAIL does. Real-grok **diagnostic probes** (`research/*.cjs`) remain manual one-offs for capturing wire shapes; the live suite is the repeatable gate.
+3. **`npm run test:live` — on-demand pre-release suite against REAL grok (`scripts/live-tests.cjs`).** Spawns the actual `grok agent stdio` and exercises the surfaces layers 1–2 can't: the real ACP handshake, a prompt round-trip, session restore, plan-mode enforcement, and the v1.4.x generative features (image gen, video gen; the subagent path is exercised opportunistically and SKIPs when grok doesn't delegate — it's deferred/research-only). It **reuses the real compiled modules** (`out/acp-dispatch.js`, `out/plan-gate.js`, `media/webview-helpers.js`) — it feeds genuine wire output through the same `isMediaGenToolCall`/`extractGeneratedMediaPaths`/`isSubagentToolCall`/`shouldBlockWrite` the extension uses, not a re-implementation. **Always run it before every release-to-`main` — it's a non-negotiable, standing part of the release gate; run it without asking** (it needs a logged-in grok + subscription and burns credits, so it must never be in `npm test` or CI, but it is mandatory before any tag/release). Flags: `--quick` (skip the slow generative tests), `--only=`, `--skip=`, `GROK_BIN=…`. A SKIP (no subscription, grok chose not to delegate, etc.) does not fail the gate — only a FAIL does. Real-grok **diagnostic probes** (`research/*.cjs`) remain manual one-offs for capturing wire shapes; the live suite is the repeatable gate.
 
-**So:** local == CI (both grok-free). The real-grok tests are a separate, manual, pre-release gate — run on request, not on commit.
+**So:** local == CI (both grok-free). The real-grok tests are a separate, mandatory pre-release gate — always run before a tag/release (no need to ask), never on every commit.
 
 ### grok CLI version + updating
 
@@ -122,7 +122,7 @@ It reads the version from `package.json`, runs the gate, builds the vsix, commit
 What the script encodes, step by step:
 
 1. Bump `version` in `package.json` (user-initiated) and add the dated section to `CHANGELOG.md`.
-2. `npm test` (414-test floor, all green) + `tsc -p . --noEmit` clean.
+2. `npm test` (414-test floor, all green) + `tsc -p . --noEmit` clean, **and `npm run test:live` against real grok — mandatory, run without asking** (the `release.*` scripts don't run it, so run it by hand before invoking them).
 3. Commit + push to `main` (direct-to-main, no feature branches).
 4. **Annotated git tag** `vX.Y.Z` at the release commit → `git tag -a vX.Y.Z -m "Release vX.Y.Z"` → `git push origin vX.Y.Z`.
 5. **GitHub Release** for that tag → `gh release create vX.Y.Z --title "Release vX.Y.Z" --notes-file <notes> <vsix>` (notes = the new changelog section(s); include any earlier version that was bumped but never released). **Always attach the built `grok-vscode-phuryn-X.Y.Z.vsix` as a release asset** so the exact installable build is downloadable from the release.