Three layers:
- Grok-free automated tests (Vitest) — pure-logic unit tests plus happy-dom DOM tests that drive the real
media/chat.js, plus a fast TerminalManager suite that spawns real/bin/shchildren. 409 tests, all passing in a few seconds. The per-file counts below predate several feature releases (voice, ask-question, plan-mode, v1.4.0 media/subagent/logout) and are indicative, not exact —npm testis the source of truth. None of them spawn thegrokbinary, so the whole suite runs in CI on a clean Ubuntu box (.github/workflows/ci.ymlrunsnpm ci && npm test && npm run packageand never installs grok). CI runs this exact suite —npm testlocally ≡ CI, verbatim. - Real-grok pre-release suite (
npm run test:live,scripts/live-tests.cjs) — an on-demand, run-on-request gate that spawns the realgrokbinary and drives it over ACP end-to-end: handshake, prompt round-trip, session restore, plan-mode gate, image gen, video gen (subagent delegation is exercised opportunistically and SKIPs when grok doesn't delegate — it's deferred/research-only). It reuses the real compiled modules (out/acp-dispatch.js,out/plan-gate.js,media/webview-helpers.js) so it tests shipped logic, not re-implementations. Non-deterministic / entitlement-gated outcomes SKIP (don't fail the gate); only a real regression FAILS. It is never run bynpm testor CI — it needs an authenticatedgrok+ network + subscription, and it's the human's pre-release checklist, not a commit gate. Flags:--quick,--only=<name>,--skip=<name>,GROK_BIN=<path>. See CLAUDE.md § Test taxonomy. - VS Code integration tests (deferred to v0.2 with
@vscode/test-electron) — covers command registration, view lifecycle, settings reads, and the diff editor. Deferred because they require a headed VS Code, are slow, and the modules already cover the bug-prone surface.
Separately, grok-dependent probes live as standalone scripts under research/*.cjs. They exercise the real CLI's ACP behavior (e.g. confirming exit_plan_mode treats any client reply as approval, or capturing the native-Windows media/subagent wire shapes) and are run manually — Vitest's include glob is test/**/*.test.ts, so it never collects them. They're non-destructive (ACK writes without touching disk and run in a temp cwd) and require a grok binary on PATH; CI doesn't run them. The probes are the discovery tool (capture an undocumented shape once); layer 2 is the regression tool (re-verify the shapes still hold before each release).
The goal of layers (1)+(2) is to make the protocol surface and UI logic regression-proof. Layer 1 catches logic regressions on every commit; layer 2 catches CLI-contract drift (a new grok version changing a wire shape) before each release.
Includes v1.4.0 generated-media extraction: isMediaGenToolCall / extractGeneratedMediaPaths covering both wire forms — the Linux/macOS JSON-in-text (image_gen, image_to_video) and the native-Windows prose-in-text (Image/Video generated and saved to \\?\C:\…, tool names image_gen / video_gen, variants ImageGen / VideoGen) — with image-vs-video classification, \\?\ extended-path stripping, the trailing-period-not-swallowed guard, and the collapsed-resume shape. Plus the ACP-standard extractImageContent/collectToolImages fallback.
The wire format is the highest-value test surface: ACP changes break everything else if we miss them.
parseAcpLine- Returns
nullfor empty / whitespace-only input - Flags non-JSON lines as
{kind:"non-json"}so the host can log them - Recognizes responses by
id+ missingmethod - Carries
errorthrough on error responses - Recognizes
session/updatenotifications by method name - Recognizes server→client requests with both
methodandid
- Returns
routeSessionUpdate— every documented update tag has an explicit routeagent_message_chunk→{event:"messageChunk", text}agent_thought_chunk→{event:"thoughtChunk", text}tool_callandtool_call_updatecurrent_mode_update→ carriesmodeId(drives the bottom-toolbar mode button)available_commands_update→ carriescommands- Unknown tags fall through to
{event:"update"}(forward-compat) - Missing
content.textdefaults to empty string (defensive)
extractPromptMeta— pulls token counts out of_metafor the donut and handles missing_metagracefully- Response builders —
makePermissionResponse,makeExitPlanResponse,makeAckResponse,makeRequest. These encode the exact shapes the agent expects. Bugs here are silent.
- Implicit chips have stable ids (so the active-editor watcher can replace them)
- Explicit chips have unique ids even when created in the same millisecond (regression: original
Date.now()impl collided) removeChip/toggleChipare pure (don't mutate inputs)clearImplicitChipsleaves explicit chips intact
- Bare text passes through
- File-only chip →
@relPathreference - Selection chip → fenced code block with the right language tag and line range
- Hidden chips are skipped
- Falls back to
@refwhen the file can't be read - Multiple chips concatenate cleanly
- Files without extensions get an empty fence language
getSlashQueryonly activates after/at line-start or newline (no false positives onpath/foo/bar)- Empty query returns the full command list
- Prefix filter is case-insensitive
applySlashPickreplaces only the slash token, preserves trailing text, returns the new caret position
These actually spawn real /bin/sh children — fast enough to keep in the unit suite.
- Captures stdout from a quick command + exit code
- Captures stderr and nonzero exit
- Honors
outputByteLimitand sets thetruncatedflag - Returns
exitStatus: nullwhile still running - Injects env from ACP-style
[{name, value}]pairs - Honors
cwd waitForExitresolves on repeated calls after exit- Throws on unknown terminalId
kill/releaseon missing id is a no-opdisposeAllkills outstanding terminals
- Configured path wins if it exists
- Returns
undefinedwhen configured path is missing - Falls back to PATH lookup;
~/.grok/bin/grokis also accepted when present - Returns
undefinedwhen nothing is found extensionWasUpgraded— true on any version change (incl. a downgrade), false on a fresh install / unchanged version / empty stored version; gates the silentgrok updatethe extension runs once when its own version changes
- Lists sessions from grok's on-disk layout (
~/.grok/sessions/<urlencoded-cwd>/<id>/) for the current cwd only - Display name falls back to the first user message, then to the id, when no customName override exists
- customName overrides (stored in VS Code
globalState) win over the disk-derived name - Sorts by most-recently-updated; tolerates malformed/missing session files without throwing
- Delete removes the right entry and leaves others intact
The pure heart of client-side plan enforcement. No spawn, no fs — just the classification logic the two choke points call.
- Workspace-write containment — a write path that resolves inside the workspace cwd is blocked; grok's own
~/.grok/sessions/<…>/plan.md(outside the workspace) is allowed; relative paths,..traversal, and sym_link-style escapes are normalized before the containment check - Read-only command allowlist —
isReadOnlyCommandpasses only when every|-separated stage is on the read-only head list (cat,ls,grep,head, PowerShellget-childitem/gci/get-content/select-object/test-path/…); a single mutating stage fails the whole pipeline - Shell-metachar rejection — redirection (
>), chaining (;,&&,||), background (&), command substitution ($(…), backticks), process substitution (<(…)), and script-block braces ({}) are rejected outright, so a read-only head can't smuggle a side effect - Permission / plan-file classification — recognizes grok's plan-file write so it can be allowed-and-snooped rather than blocked
Includes the deferred/research-only subagent classifier isSubagentToolCall / subagentLabel (the forward-compat spawn_subagent + subagent_type shape, name/kind/rawInput fallbacks, and the regression guard that grok's get_command_or_subagent_output poller is NOT carded — its name contains "subagent" but it's a background-task output reader, not a delegation). The classifier is kept tested as forward-compat scaffolding, but grok 0.2.x doesn't emit spawn_subagent over ACP so the card rarely fires; see research/subagents.md.
Shared between the shipped webview and the tests (media/webview-helpers.js).
- File-ref detection: recognizes
@pathmentions and bare path-looking tokens, ignores prose - Relative-time formatting: "just now" / "Nm" / "Nh" / "Nd" buckets, singular/plural, far-future and far-past edges
happy-dom test (see Webview DOM tests below). Drives the shipped media/chat.js, dispatches the messages sidebar.ts posts, clicks the rendered buttons, asserts on the postMessage payload that goes back to the host.
- Renders the card with plan body, feedback textarea, and three buttons: Approve & implement / Reject / Cancel
- All three verdicts carry the trimmed
commentwhen the textarea has text, and omit thecommentkey when it's empty: "Reject" →verdict:"rejected", "Approve & implement" →verdict:"approved", "Cancel" →verdict:"abandoned" - A click resolves the card, highlights the chosen button (
.chosen), shows the verdict label, and disables both buttons + the textarea (no double-submit) - The plan body's plan-link opens the plan snapshot without resolving the approval card (live and restored-plan variants)
planNotice/planBlocked(command + write variants) render a.plan-noticewith the right text- Read-only plan-history card renders with the persisted verdict label
- Request timer lifecycle — a resolved
request()clears its timeout (no leaked timer). - Spawn argv —
buildGrokAgentArgs()returns["agent", "stdio"]with no effort, and["agent", "--reasoning-effort", <value>, "stdio"](flag before the subcommand) for a valid effort.
Spawns the fake grok agent stdio from test/fixtures/fake-grok-acp.cjs (a ~190-line ACP server encoding only what the protocol requires, not grok version quirks), and drives src/acp.ts AcpClient against it over real JSON-RPC stdio. Cross-platform: .cmd wrapper on Windows, .sh wrapper elsewhere; subprocess startup adds ~50–100ms per test (same order as terminal-manager).
- Lifecycle — spawn → initialize → session/new succeeds; a basic prompt round-trips with
_meta.totalTokens. - Startup effort forwarding — with a valid
effortconfigured, the fake CLI (which exits 2 on any unexpected argv) acceptsagent --reasoning-effort <value> stdioand the session starts, proving the forwarded arg shape. - Plan-snoop — grok's plan.md write (outside the workspace) is allowed AND emits
planFileContentwith the snooped text; the host'sexitPlanRequestevent fires with that content; the file actually lands on disk. - Workspace-write gate — with
planActive=true,fs/write_text_filefor a path inside the workspace is refused with PLAN_BLOCKED, emitsmutationBlocked, no file lands. - Workspace-write gate (off) — with
planActive=false, the same write succeeds end-to-end. - Terminal-create gate (mutating) — with
planActive=true,terminal/createforrm -rfis refused; the host's terminal handler is never called. - Terminal-create gate (read-only) — with
planActive=true,terminal/createforls -lais allowed and reaches the terminal handler.
Pure helpers extracted into src/plan-restore.ts specifically for unit testing: no vscode, no fs, no ACP client to mock.
appendPlanEntry— chronological append; creates a new list fromundefined; doesn't mutate input; preserves plan text verbatim (regression:lastPlanTextwas being wiped before persist, so saved entries showed"(empty plan)"); tolerates legacy entries with noafterUserMessagedecideRestoreState— given the saved log, returns whether to raise the gate and what mode to set on the CLI. Last verdictrejected→ restore Plan mode;approved/abandoned→ Agent mode; no log / undefined → Agent mode (legacy session, safe default)- End-to-end scenarios — user rejects then closes VS Code → restore in Plan mode; rejects then approves → Agent mode; rejects then cancels → Agent mode (the regression where Cancel kept restoring into Plan mode); legacy session → Agent mode with no surprise gate
happy-dom test driving the shipped webview through a planHistoryQueue + session/load replay sequence. This is the visual side of the state machine — what actually renders, in what order, after the host sends saved plans plus a stream of replayed messages.
- Empty queue → no plan-history cards
- Positioned plan (
afterUserMessage: N) → interleaved at the right user-message boundary, not dumped at the bottom - Plan positioned after the last replayed user message → flushed at end of replay
- Legacy plans without
afterUserMessage→ always flushed at end (back-compat with sessions saved before per-plan persistence) - Multiple plans at distinct positions → each lands at its boundary
- Multiple plans at the same position → drain together before the next user message
- Live user message after restore → still drains queued plans inline (no replay required)
- Fresh session edge case (queue arrives without
historyReplaytoggle) → drained on the first live message clearMessages→ queue + counter reset- All three verdict buttons (Approve / Reject / Cancel) → produce matching status labels +
.chosenhighlight agentResetremoves the in-flight agent bubble- Subsequent
messageChunkafteragentResetcreates a fresh bubble (the false-approval text doesn't leak through)
happy-dom test locking in the native-Windows regressions this build fixed (plus later busy/version/dedup behavior), so they can't silently come back:
- History popover — opens on the history button (and requests the session list), toggles closed on re-click, closes on an outside click but stays open on a click inside it
- Session rows — whole row resumes (clicking the meta area, not just the label, posts
resumeSession); the delete and rename action buttonsstopPropagationso they don't also resume - Mode picker — offers Agent / Plan / YOLO, posts
setModewith the chosen id, closes on select, toggles closed on re-click - Reasoning trace — a thought chunk renders a collapsed thinking block whose header click toggles the body open/closed (chevron ▶/▼)
- Gear settings lock — the model button shows the friendly name (not the raw id); model + effort controls are disabled while busy/priming and re-enable when busy clears
- User-message dedup — a
user_message_chunkechoed live (grok ≥0.2.33) never doubles the optimistic bubble; only asession/loadreplay drives user bubbles - Welcome version lifecycle — flips to "Connected · v" only when session start finishes, not at the bare ACP handshake; later busy toggles don't overwrite it
- Gear menu — the Other group's About sub-view (extension + CLI versions, update check) and Config & debug sub-view render and route correctly
parseFileRefparsespath#L<n>/path#L<a>-<b>open-file refs (single line + range), tolerating a bare pathshouldReadFileInlineguards against inlining a too-large file, so a huge file is referenced by@pathinstead of pasted into the prompt
- STT request/response/error shaping for the batch (REST) and streaming (WebSocket) endpoints
- Per-platform
ffmpegarg construction (DirectShow/dshow on Windows, others elsewhere) + DirectShow device-list parsing - API-key resolution order (
grok.voiceApiKey→GROK_VOICE_API_KEY→XAI_API_KEY) parseVoiceCommand/ trailing send-phrase detection — the two-word "grok send", tolerant of the "send"→"sent" mishearing, with trailing punctuation kept-not-doubled
- The mic-button state machine (idle → connecting → listening → stopped), animated waves, and the brief "connecting…" spinner
- A live partial transcript accumulates into the composer; the trailing send-phrase is highlighted via the backdrop overlay
- "grok send" submits and flushes messages dictated while Grok was responding (hands-free continuous listening)
isPrimerTextmatches the marker at the start of a message for any primer version (v1, v2, …), tolerates leading whitespace, and rejects normal text / a marker pasted mid-message — used on restore to hide the lazily-sent primer and keep it out of the plan-position count
planReviewFileBaseName/sanitizePlanReviewFilePartgenerate a safe Markdown filename for the "open plan as an editor tab" action (strips path-hostile chars, bounds length)
addGeneratedMediarenders an image as<img>and a video as<video controls>from the host'smediamessage, wires the Copy-path / Open-in-VS-Code hover actions (pinned to the media), and falls back to an open-link button for a remote URL- the (deferred) subagent classifier renders a Subagent: <type> card when fed a delegation shape
- Renders each question's options (single-question single-select resolves on one click; multi → pick-then-Submit; Skip → cancel), replies
{outcome:"accepted", answers, annotations}(or cancelled), collapses to the question + a green✓ <choice>, and rebuilds a read-only "You answered" card from the resume replay
test/plan-card.dom.test.ts and test/webview-ui.dom.test.ts run the real shipped media/chat.js inside a happy-dom Window, via the shared test/webview-harness.ts. The trick: happy-dom doesn't execute inline <script> text synchronously, but window.eval(src) runs in the window's realm and shares its globals — so the harness evals webview-helpers.js then chat.js, stubs acquireVsCodeApi to capture postMessage payloads, and dispatches MessageEvents exactly as the extension host would. This tests the webview logic (event wiring, payload shapes, and show/hide state) without VS Code; it does not replace real GUI click-through, CSS, or the live acquireVsCodeApi bridge — those wait for the @vscode/test-electron suite (roadmap item #1).
AcpClient.spawnand child process I/O. This is exercised by the manual probes underresearch/*.cjs(hit the realgrokbinary) and is what the v0.2@vscode/test-electronintegration tests will cover.sidebar.tsend-to-end. It's mostly glue between VS Code APIs and the modules above; the modules carry the logic. A regression-prone area here is the diff editor invocation — that's better tested with@vscode/test-electronthan with mocks.- Real VS Code rendering & CSS. The happy-dom tests cover webview logic, but pixel/layout regression on the cards is better caught by manual smoke + the future integration suite.
npm test # layer 1 — grok-free, what CI runs
npm run test:watch # TDD loop
npm run test:live # layer 2 — real grok, on-demand pre-release gate (run on request)Layer 1 runs in a few seconds with no network, no grok binary, and no fixtures, so it's suitable for pre-commit hooks and CI. Layer 2 needs an authenticated grok on PATH (or GROK_BIN=<path>), network, and a subscription for the media tests — it's the pre-release checklist, run on request, never on commit.
@vscode/test-electronsuite — open the test workspace, activate the extension, assert the Grok view is registered, send a fakepermissionRequestthrough the webview message channel, verify a permission card renders.AcpClient integration test — fixture script pretending to beDone — shipped asgrok agent stdio.test/acp-integration.test.ts(driven bytest/fixtures/fake-grok-acp.cjs) and now runs in layer 1; see its section above.- Webview snapshot test — Playwright loads the webview HTML in isolation, sends representative messages, snapshots the DOM. Catches CSS/layout regressions.
- Permission round-trip — fake permission request from a fixture, click card button, assert correct
respondPermissionJSON written to fixture's stdin.