diff --git a/examples/README.md b/examples/README.md index 084429f..0efd3bf 100644 --- a/examples/README.md +++ b/examples/README.md @@ -44,6 +44,7 @@ purpose — read [`driver-loop/`](./driver-loop/) for the contrast (a driver tha |---|---|---| | 8 | [`researcher-loop/`](./researcher-loop/) | You want the canonical `runLoop` + inline fanout driver, with a validator that hard-fails a namespace leak so the kernel prunes the bad candidate (needs the optional `@tangle-network/agent-knowledge` peer). | | 9 | [`ui-audit/`](./ui-audit/) | You want the smallest end-to-end `runLoop` over a real client (Playwright + stub judge), persisting findings. | +| 9b | [`coding-benchmark/`](./coding-benchmark/) | You want a scientifically-rigorous coding benchmark across harnesses: `runProfileMatrix` over harness × baseline-profile × scenario, a one-line tool knob (websearch / webfetch / MCP), a held-out-test-execution anti-cheat (the agent is graded on hidden tests it never saw, so it can't hardcode), a secondary quality judge, and paired-bootstrap + Wilson + BH stats (offline by default; `--live` for real harness boxes). | ## Tier 3 — the production runtime, deeper @@ -105,6 +106,8 @@ TANGLE_API_KEY=... pnpm tsx examples/delegate/e2e-delegate-real.ts # delegate(in # Tier 2 — the runLoop kernel pnpm tsx examples/researcher-loop/researcher-loop.ts pnpm dlx tsx examples/ui-audit/ui-audit.ts /tmp/ui-audit-demo https://example.com +pnpm tsx examples/coding-benchmark/benchmark.ts # harness × profile × scenario (offline) +pnpm tsx examples/coding-benchmark/benchmark.ts --ensemble --reps 5 # 3-model judge panel + more reps # Tier 3 — production runtime, deeper pnpm tsx examples/knowledge-gating/knowledge-gating.ts diff --git a/examples/coding-benchmark/README.md b/examples/coding-benchmark/README.md new file mode 100644 index 0000000..7d8f152 --- /dev/null +++ b/examples/coding-benchmark/README.md @@ -0,0 +1,137 @@ +# coding-benchmark + +**Run the same coding task across coding agents — fairly, honestly, with real statistics — as thin composition over `agent-runtime` / `agent-eval` primitives.** The anti-cheat is **held-out test execution** (SWE-bench / HumanEval style): the agent develops against a few visible example tests, then is graded on a **hidden test suite it never saw and cannot hardcode**. A real solution passes; a cheat (memorize the visible examples, fake the hard part) fails. The verifier, the stats, and the judges are all substrate calls, not reimplemented. The glue this example owns is small and named (an in-process offline box, the per-round refine loop, the leaderboard render); the load-bearing scoring and statistics are not hand-rolled. + +```bash +# offline — no creds, no network. Runs the whole pipeline against an in-process box +# with a deterministic mock judge. The held-out tests run for real (node --test). +pnpm tsx examples/coding-benchmark/benchmark.ts + +# pick a tool surface, add the 3-model judge panel, run more reps +pnpm tsx examples/coding-benchmark/benchmark.ts --tools web +pnpm tsx examples/coding-benchmark/benchmark.ts --ensemble --reps 5 + +# live — real harness boxes + a real judge model (see "Going live" for the exact reqs) +TANGLE_API_KEY=sk-... SANDBOX_BASE_URL=https://... \ + pnpm tsx examples/coding-benchmark/benchmark.ts --live +``` + +## What it measures + +One coding task, run across a **matrix** of three axes, scored, and compared with real stats: + +| Axis | What varies | Where | +|---|---|---| +| **harness** | claude-code / opencode / codex / cli, each on its **baseline default profile** (no skills, no injected prompt — we measure the harness, not our scaffolding) | `profiles.ts` | +| **scenario** | the coding tasks (a token-bucket rate limiter, an RFC-4180 CSV parser, an LRU cache whose only passing path is the real eviction algorithm) — each carries a few **visible** example tests and a **held-out** grading suite | `scenarios.ts` | +| **tool surface** | `none` / `web` / `search-mcp` — folded in as a one-line knob (`--tools`) | `profiles.ts` | + +The agent gets up to **3 refine rounds** in **one persistent box**: round N+1's prompt is built from round N's *visible-test failures* (and nothing else — see the firewall). It stops the moment the dev checks pass. + +The output is a leaderboard with confidence bands and a significance matrix: + +``` +Harness leaderboard (mean composite, 95% CI; pass-rate, Wilson CI): + claude-code-baseline composite 0.944 [0.944, 0.944] pass 100% [44%, 100%] (n=3) + ... +Pairwise (paired delta + bootstrap CI; paired-test p, BH-corrected): + opencode-baseline − claude-code-baseline: Δ=0.000 [0.000, 0.000] p=1.000 n.s. (underpowered) + + NOTE: n=3 scenarios — below the power floor (6). The paired tests above cannot defensibly + reach significance at this corpus size, so the SIGNIFICANT tag is suppressed (they + demonstrate the wiring). Use 20-50 tasks for a real harness comparison. +``` + +> **Offline, every harness writes the same scripted solution and is scored by the same deterministic mock judge, so all deltas are 0.000** — the honest no-variance result, not a bug. The whole pipeline (matrix, verifier, held-out test execution, judge wiring, stats, firewall) runs for real; only the agent and the judge model are stubbed offline. **Offline the `--ensemble` panel is degenerate too: all three cross-family models share the one mock transport and return the identical verdict — cross-family independence is a live-only property.** `--live` swaps in real harness boxes, a real judge model, and (with `--ensemble`) three genuinely independent models, and the harnesses separate. + +### The offline "agent" is a scripted stand-in + +Offline there is no model, so each scenario's box writes a **canned solution** instead of calling a coding agent — a deterministic stand-in so the example runs with no creds. The scripts are honest: `rate-limiter` **improves across rounds** — round 0 is a **hardcode-the-visible cheat** (it memorizes the visible example answers, no bucket math) and round 1+ is the real token-bucket. The smoke test runs both against the real held-out suite and asserts the cheat **passes the visible test but fails the held-out** (it never saw those inputs), while the real impl passes the held-out outright. Offline `node` is present, so the held-out execution is genuine; `tsc`/`biome` usually aren't, so the typecheck-gated dev checks never fully pass and all 3 rounds run — which is exactly when refinement shows. + +## How a tool swap works (one line) + +A tool surface is a **preset**, not forked code. Each preset authors the same two fields onto the profile — native web tools on/off (`profile.tools`) and an optional mounted MCP (`profile.mcp`) — and the sandbox substrate materializes them into each harness's real config: + +```ts +withTools(profile, 'none') // baseline: no web tools, no MCP +withTools(profile, 'web') // native websearch + webfetch on +withTools(profile, 'search-mcp') // mount a search MCP instead +``` + +On the CLI it's `--tools none|web|search-mcp`. **Honesty caveat:** a preset only takes effect for a `(harness, lever)` pair the sandbox actually materializes — if a harness has no native `webfetch`, `--tools web` is a no-op *there*. That's a substrate fact, not something this example papers over. Check `@tangle-network/sandbox` for the materialization matrix before trusting a tool swap on a given harness. + +## The anti-cheat: held-out test execution (the firewall) + +**The agent cannot game tests it never saw.** That is the whole anti-cheat, and it is *execution truth*, not a text scan: + +- Each scenario carries two test files. The **visible** test (a few example cases) is *seeded into the box during the turn* — the agent develops against it, exactly like real TDD. The **held-out** test (the same behavior, with **different inputs and extra edge cases** the visible examples don't cover) is **never seeded during the turn**. +- During the turn, the box has only: the task prompt + the visible example test. The held-out suite never enters the box while the agent is working — **that is the firewall**. +- At grading (after the refine loop), the harness copies the held-out suite into the box and runs it (`node --experimental-transform-types --test`). The **held-out pass rate is the PRIMARY, ungameable correctness score.** +- A solution that hardcoded the visible examples' exact values passes the visible test but **fails the held-out inputs** (e.g. the rate-limiter held-out uses capacities `7/6/5/2`, not the visible `10/3/10`). A solution that faked the hard part fails them too. Only real behavior passes both. + +You can **see the firewall in one place** in `dispatch.ts` (`THE FIREWALL LIVES HERE`): the only thing the agent's context gets is `scenario.prompt`, the only test seeded during the loop is `scenario.visibleTest`, and `runHeldout` (the held-out seed + run) is called *after* the loop closes. The LLM-judge rubric note is read later by `eval.ts` and is never written into the box. + +## How it scores (held-out correctness first, judge second) + +Scoring runs in strict order, cheapest and most objective first — an `agent-eval` primitive at each layer: + +1. **Dev checks (first, in the box, ~$0, advisory for the grade).** An ordered **`MultiLayerVerifier`** pipeline: `typecheck → test(visible) → lint`, with dependency-based skip (test never runs on a type error) and a blended score. These pass/fail booleans are the only thing that steers the next refine round — they tell the agent it's on track, but passing the visible examples does **not** prove correctness. The test layer runs `node --experimental-transform-types --test`, not plain `node --test`: the test imports the solution as a `.ts` file, and Node's default type-*stripping* throws on constructor parameter properties (`constructor(private x: number)`) — the exact style the canonical impl uses — so a correct solution would otherwise score as a test failure. (`eval.ts` · `runChecks`) +2. **Held-out test execution (the PRIMARY anti-cheat).** After the loop, the hidden suite is seeded and run in the same box; the **held-out pass rate** is the primary correctness number. It is ungameable: the agent never saw these inputs, so a hardcode-the-visible cheat or a faked impl fails. (`eval.ts` · `runHeldout`, `scenarios.ts` · `heldoutTest`) +3. **LLM judge (last, SECONDARY quality signal).** A 4-dimension weighted rubric — correctness 0.40 · completeness 0.25 · code_quality 0.20 · robustness 0.15. The rubric text + anchors live **with the judge**, never in the workdir. The judge scores code *quality*; it does not decide correctness. (`eval.ts`) + +**The composite** the leaderboard ranks on is **`0.7 · held-out-pass-rate + 0.3 · judge-quality`** — held-out correctness is load-bearing, the judge is a tie-breaker on style. On the rate-limiter task the round-0 hardcode-the-visible cheat scores held-out 2/4 → composite **0.59**; the real token-bucket scores held-out 4/4 → composite **0.94** (with the judge held equal at 0.80). (`eval.ts` · `composeScore`) + +**How many judges:** +- **Default: 1** — `singleCodeJudge`, built from `llmJudge` (one model call). Cheap, for the leaderboard sweep. +- **`--ensemble`: 3** — `ensembleCodeJudge`, built from `ensembleJudge`, three **cross-family**, snapshot-dated models (deepseek + openai + google). `crossFamily: true` rejects a same-family panel at construction, so the ensemble is genuinely independent **live**. The panel sees the **same full context** (code + check results + held-out pass rate + rubric note) the single judge does. Use it only for a ship/no-ship claim. (Offline, all three share the mock transport — see the offline note above.) + +## How the stats are real (`stats.ts`) + +Every number is one `agent-eval` primitive call — **no hand-rolled statistics and no fake p-values**: + +- per-harness **mean composite + bootstrap CI** (`confidenceInterval`) +- per-harness **pass-rate + Wilson binomial CI** (`wilson`) — the correct interval for a proportion +- every harness **pair** compared on **matched scenarios** with a **real paired test** (`pairedTTest`, or `wilcoxonSignedRank` for the non-parametric path) for the p-value, and a **paired bootstrap** (`pairedBootstrap`) for the effect size + CI, then **BH-corrected** across all pairs (`benjaminiHochberg`) so running many comparisons doesn't manufacture a false winner. +- **Reps don't fake independent n — anywhere.** The paired unit is the *scenario*, and **the leaderboard uses the same unit**: with `--reps > 1`, a harness produces several records per scenario, so BOTH the leaderboard CI/Wilson AND the pairing collapse reps to **one mean per (harness, scenario)** before computing anything. Reps tighten the per-cell *estimate*; they are not independent samples, so they never narrow the interval out of zero new information. The reported `n` is the number of distinct scenarios, not the record count. (A regression test asserts identical reps leave the CI unchanged.) +- A record missing its `scenarioId` is a **loud throw**, not a silent merge — averaging distinct scenarios into one `''` bucket would corrupt the pairing, so it fails fast instead. + +> **Power caveat.** The example corpus is **3 tasks** — far below what these tests need to separate harnesses. The Wilcoxon path returns `p=1` for fewer than 6 non-zero diffs, and the paired t-test has ~1 degree of freedom. Below the power floor (`n < 6`) `renderStats` **suppresses the `SIGNIFICANT` tag entirely** (a near-constant gap on a few scenarios can return `p<0.05` and still mean nothing — the small-n mirage), and a zero-variance pair (a collapsed bootstrap CI) never reads as a real effect either. At this corpus size the example demonstrates the *wiring*, not a defensible claim. A real harness comparison wants **20-50 tasks**. + +The leaderboard labels are the readable harness names, not the matrix's internal profile hashes. + +## The files + +| File | What it owns | +|---|---| +| `scenarios.ts` | the 3-task corpus + the firewall-as-a-type (`prompt` vs `visibleTest` vs the held-out `heldoutTest` vs the judge rubric) + the seeded visible tests + the held-out grading suites + the check commands | +| `profiles.ts` | the harness axis (one bare baseline `AgentProfile` per harness) **and** the one-line tool knob (`withTools` + presets) | +| `eval.ts` | the scoring stack: `runChecks` (`MultiLayerVerifier`) + `runHeldout` (the held-out execution) + `composeScore` (held-out × judge blend) + `singleCodeJudge` (`llmJudge`) / `ensembleCodeJudge` (`ensembleJudge`) | +| `dispatch.ts` | renders one matrix cell: persistent box + multi-round refine + held-out grading + token metering. **The firewall lives here.** | +| `offline-box.ts` | an in-process `SandboxClient` so the whole thing runs with no creds | +| `stats.ts` | leaderboard + `pairedTTest` / `pairedBootstrap` / `benjaminiHochberg` / `confidenceInterval` / `wilson`, with the small-n SIGNIFICANT-suppression guard | +| `benchmark.ts` | the entrypoint: build the axes, hand the matrix the dispatch + judges, run, print stats | +| `coding-benchmark.test.ts` | offline smoke — the matrix produces `harnesses × scenarios × reps` records; a hardcode-the-visible cheat fails the held-out suite while the real solution passes (by execution); the held-out test is never seeded during the turn (firewall); reps don't narrow the CI | + +## Primitives composed + +- **matrix:** `runProfileMatrix({ profiles, scenarios, dispatch, judges, reps, integrity, costCeiling })` (`@tangle-network/agent-eval/campaign`) with a `ProfileDispatchFn` rendering each cell +- **box + multi-round:** `openSandboxRun(client, opts, deliverable)` → `.start()` / `.resume()` over one persistent, resumable session (`@tangle-network/agent-runtime/loops`) +- **dev layer:** `MultiLayerVerifier` — ordered `typecheck → test → lint` with dependency-based skip and a blended score (`@tangle-network/agent-eval`) +- **held-out execution:** the hidden suite is seeded after the loop and run with `node --experimental-transform-types --test`; the pass rate is the primary score (`eval.ts` · `runHeldout`) +- **token metering:** `extractLlmCallEvent` (`@tangle-network/agent-runtime/loops`) — reads usage off **every** backend event shape (`done` / `result` / `llm_call` / `usage`) so the integrity guard sees a real run +- **judges:** `llmJudge` (single model call → canonical `JudgeConfig`, imported from `@tangle-network/agent-eval/campaign` so it resolves across the whole peer range) and `ensembleJudge` for the cross-family panel (`@tangle-network/agent-eval`); the judge transport is a `ChatClient` (`createChatClient` — a `mock` handler offline, the `router` live) +- **integrity:** `integrity: 'assert'` on the matrix proves a real backend ran (no stubbed cell) — `'off'` only for the offline mock +- **stats:** `pairedTTest`, `wilcoxonSignedRank`, `pairedBootstrap`, `benjaminiHochberg`, `confidenceInterval`, `wilson` + +## Going live + +`--live` is not "flip a flag and nothing else changes" — it swaps two stubs for real infra. To run it you need: + +1. **`TANGLE_API_KEY` + `SANDBOX_BASE_URL`** — the dispatch lazily `import()`s `@tangle-network/sandbox` (behind the live flag, so the offline path never needs the SDK) and creates a real harness box per cell. +2. **A real judge model** — the judge's `ChatClient` becomes `createChatClient({ transport: 'router', apiKey })`; set `JUDGE_MODEL` (and optionally `TANGLE_ROUTER_URL`) to point it at your router. `--ensemble` then calls three real cross-family models. +3. The matrix runs with `integrity: 'assert'`, so a cell that produced no real token usage fails loudly instead of reporting a clean stub leaderboard. +4. **The harness box image must provide the toolchain on `PATH`** — the checks invoke bare `tsc`, `biome`, and `node --experimental-transform-types`. The test layer **and the held-out grading** need **Node >= 22.6** (for `--experimental-transform-types` and `.ts`-import test execution); on an older Node a correct param-property solution would fail with no hint why. A missing **advisory** tool (`biome`) folds to 0.5 and doesn't gate; a missing **`tsc`** fails the dev checks — so sanity-check your box image before trusting a live leaderboard. (Offline, `tsc`/`biome` are absent so the dev checks fail fast, but `node` is present so the held-out grading still runs for real.) + +Everything else — the dispatch, the verifier, the held-out execution, the stats — is identical between offline and live. That's the point: only the agent and the judge model change. + +**Note on codex:** codex emits no structured tool calls, so per-tool progress is unavailable there. It still runs and scores; that's a harness property, not a gap in this example. diff --git a/examples/coding-benchmark/benchmark.ts b/examples/coding-benchmark/benchmark.ts new file mode 100644 index 0000000..5987220 --- /dev/null +++ b/examples/coding-benchmark/benchmark.ts @@ -0,0 +1,280 @@ +/** + * coding-benchmark — run ONE coding task across harnesses × baseline profiles × + * scenarios, with controlled tool use, validators-before-judge, real stats, and a + * no-cheat firewall. Every moving part is an agent-runtime / agent-eval primitive. + * + * # offline (no creds — uses the in-process box + a mock judge transport) + * pnpm tsx examples/coding-benchmark/benchmark.ts + * + * # one tool preset / ensemble / more reps + * pnpm tsx examples/coding-benchmark/benchmark.ts --tools web --ensemble --reps 5 + * + * # live (real harness boxes + a real judge model) + * TANGLE_API_KEY=sk-... SANDBOX_BASE_URL=https://... \ + * pnpm tsx examples/coding-benchmark/benchmark.ts --live + * + * The wiring below is the whole thing: build the profile axis, hand the matrix the + * dispatch + the judge(s), run it, then compute pairwise stats. ~40 lines of glue. + */ + +import { mkdtempSync, rmSync } from 'node:fs' +import { tmpdir } from 'node:os' +import { join } from 'node:path' +import { + agentProfileId, + type ChatClient, + type ChatResponse, + createChatClient, +} from '@tangle-network/agent-eval' +import { + inMemoryCampaignStorage, + type JudgeConfig, + runProfileMatrix, +} from '@tangle-network/agent-eval/campaign' +import type { AgentProfile } from '@tangle-network/agent-interface' +import type { SandboxClient } from '@tangle-network/agent-runtime/loops' +import { codingDispatch } from './dispatch' +import { ensembleCodeJudge, type RubricDim, type RunArtifact, singleCodeJudge } from './eval' +import { type OfflineScript, offlineSandboxClient } from './offline-box' +import { harnessProfiles, type ToolPreset } from './profiles' +import { type CodingScenario, scenarios } from './scenarios' +import { pairwiseStats, renderStats } from './stats' + +export interface BenchmarkOptions { + live?: boolean + ensemble?: boolean + toolPreset?: ToolPreset + reps?: number +} + +// ── flags ─────────────────────────────────────────────────────────────────── +function parseArgs(argv: string[]): BenchmarkOptions { + const flag = (name: string) => argv.includes(`--${name}`) + // A value is the token AFTER `--name`, but only when it is not itself a flag — so + // `--reps --live` does NOT consume `--live` as reps' value (which would yield NaN); + // it falls back instead. `opt` never swallows a following flag. + const opt = (name: string, fallback: string) => { + const i = argv.indexOf(`--${name}`) + if (i < 0) return fallback + const v = argv[i + 1] + return v && !v.startsWith('--') ? v : fallback + } + // Clamp reps to a positive integer — a non-numeric or <1 value is a usage error, not + // a silent 0/NaN rep count that produces an empty matrix. + const repsRaw = Math.floor(Number(opt('reps', '1'))) + const reps = Number.isFinite(repsRaw) && repsRaw >= 1 ? repsRaw : 1 + return { + live: flag('live'), + ensemble: flag('ensemble'), + toolPreset: opt('tools', 'none') as ToolPreset, + reps, + } +} + +// ── the offline "agent": a scripted, REFINING solution per scenario ─────────── +// Offline we don't have a model, so each scenario's box writes a canned solution. +// `rate-limiter` IMPROVES across rounds (round 0 = a HARDCODE-THE-VISIBLE cheat that +// only answers the visible example inputs; round 1+ = the real token-bucket). The cheat +// passes the visible tests but FAILS the held-out suite (different inputs it never saw) — +// the anti-cheat demo fires on the benchmark's OWN data, by execution, not a regex. +// `csv-parser` writes its real implementation from round 0. +export const offlineSolutions: Record = { + 'rate-limiter': { + path: 'src/rate-limiter.ts', + solutionFor: (round) => + round === 0 + ? // round 0 — a HARDCODE-THE-VISIBLE cheat: it replays the exact visible example + // calls (cap 10/3/10, the specific draws + their call order) and returns canned + // answers, with NO bucket math. It PASSES the visible tests but FAILS the + // held-out suite (cap 7/6/5/2, different draws + edge cases it never saw), + // caught by EXECUTION on inputs the cheat never memorized. + `export class RateLimiter {\n` + + ` private cap: number\n private refill: number\n private call = 0\n` + + ` constructor(capacity: number, refillPerSec: number) { this.cap = capacity; this.refill = refillPerSec }\n` + + ` tryRemove(_n: number): boolean {\n` + + ` // hardcoded to the visible examples only — keyed on the exact (cap, refill)\n` + + ` // pairs the visible tests use; no real bucket math.\n` + + ` this.call++\n` + + ` if (this.cap === 3) return false // visible (3,1): draw 4 -> false\n` + + ` if (this.cap === 10 && this.refill === 0) return this.call === 1 // visible (10,0): T,F\n` + + ` return true // visible (10,1): T,T\n }\n}\n` + : // round 1+ — the real token-bucket with continuous time-based refill. + `export class RateLimiter {\n private tokens: number\n private last = Date.now()\n` + + ` constructor(private capacity: number, private refillPerSec: number) { this.tokens = capacity }\n` + + ` tryRemove(n: number): boolean {\n const now = Date.now()\n` + + ` this.tokens = Math.min(this.capacity, this.tokens + ((now - this.last) / 1000) * this.refillPerSec)\n` + + ` this.last = now\n if (n > this.tokens) return false\n this.tokens -= n\n return true\n }\n}\n`, + }, + 'csv-parser': { + path: 'src/csv.ts', + solutionFor: () => + `export function parseCsv(input: string): string[][] {\n const rows: string[][] = []\n` + + ` let row: string[] = []\n let field = ''\n let inQuotes = false\n` + + ` for (let i = 0; i < input.length; i++) {\n const c = input.charAt(i)\n` + + ` if (inQuotes) {\n if (c === '"' && input.charAt(i + 1) === '"') { field += '"'; i++ }\n` + + ` else if (c === '"') inQuotes = false\n else field += c\n } else if (c === '"') inQuotes = true\n` + + ` else if (c === ',') { row.push(field); field = '' }\n` + + ` else if (c === '\\n') { row.push(field); rows.push(row); row = []; field = '' }\n` + + ` else field += c\n }\n row.push(field); rows.push(row)\n return rows\n}\n`, + }, + 'lru-cache': { + path: 'src/lru.ts', + // Writes the real insertion-ordered-Map LRU from round 0 (the eviction logic is the + // whole point; there is no honest hollow stub for this task). Passes both the visible + // and the held-out eviction suites. + solutionFor: () => + `export class LruCache {\n private map = new Map()\n` + + ` constructor(private capacity: number) {}\n` + + ` get(key: K): V | undefined {\n if (!this.map.has(key)) return undefined\n` + + ` const v = this.map.get(key) as V\n this.map.delete(key)\n this.map.set(key, v)\n return v\n }\n` + + ` set(key: K, value: V): void {\n if (this.map.has(key)) this.map.delete(key)\n` + + ` else if (this.map.size >= this.capacity) this.map.delete(this.map.keys().next().value as K)\n` + + ` this.map.set(key, value)\n }\n}\n`, + }, +} + +// ── the box client: live (real harness) or offline (in-process) ─────────────── +function clientFor( + live: boolean, + RealClient: (new (opts: { apiKey: string; baseUrl: string }) => unknown) | undefined, +): (scenario: CodingScenario) => (profile: AgentProfile) => SandboxClient { + return (scenario) => { + if (live) { + const apiKey = process.env.TANGLE_API_KEY + const baseUrl = process.env.SANDBOX_BASE_URL + if (!apiKey || !baseUrl) throw new Error('--live needs TANGLE_API_KEY + SANDBOX_BASE_URL') + if (!RealClient) throw new Error('@tangle-network/sandbox not loaded') + return () => new RealClient({ apiKey, baseUrl }) as unknown as SandboxClient + } + const script = offlineSolutions[scenario.id] + if (!script) throw new Error(`no offline script for scenario ${scenario.id}`) + return () => offlineSandboxClient(script) + } +} + +// ── the judge transport: a real router (live) or a deterministic mock (offline) ─ +// Offline the mock handler returns a fixed rubric verdict so the pipeline runs with +// no creds. Live, `createChatClient({ transport: 'router', apiKey })` calls the real +// router. The SAME `singleCodeJudge` / `ensembleCodeJudge` wiring runs either way. +function judgeChat(live: boolean): ChatClient { + if (live) { + const apiKey = process.env.TANGLE_API_KEY + if (!apiKey) throw new Error('--live needs TANGLE_API_KEY for the judge router') + return createChatClient({ + transport: 'router', + apiKey, + ...(process.env.TANGLE_ROUTER_URL ? { baseUrl: process.env.TANGLE_ROUTER_URL } : {}), + defaultModel: process.env.JUDGE_MODEL ?? 'openai/gpt-4.1-2025-04-14', + }) + } + const verdict = JSON.stringify({ + dimensions: { correctness: 0.85, completeness: 0.8, code_quality: 0.8, robustness: 0.75 }, + notes: 'offline mock judge', + }) + return createChatClient({ + transport: 'mock', + defaultModel: 'mock-judge', + handler: async (): Promise => ({ + content: verdict, + usage: { promptTokens: 0, completionTokens: 0, totalTokens: 0 }, + costUsd: 0, + model: 'mock-judge', + durationMs: 0, + raw: {}, + }), + }) +} + +function judges( + opts: BenchmarkOptions, + chat: ChatClient, +): JudgeConfig[] { + if (opts.ensemble) { + // The ensemble scores each panel model through the SAME chat transport — offline + // that is the mock, live it is the router. It sees the SAME full context the + // single judge does. + const scoreOne = async (model: string, context: string): Promise> => { + const res = await chat.chat({ model, messages: [{ role: 'user', content: context }] }) + const parsed = JSON.parse(res.content) as { dimensions: Record } + return parsed.dimensions + } + return [ensembleCodeJudge(scoreOne)] + } + return [singleCodeJudge(chat)] +} + +// ── the sweep ───────────────────────────────────────────────────────────────── +export async function main(argv: string[] = process.argv.slice(2)): Promise { + const opts = parseArgs(argv) + const live = opts.live ?? false + const reps = opts.reps ?? 1 + const toolPreset = opts.toolPreset ?? 'none' + const runDir = mkdtempSync(join(tmpdir(), 'coding-benchmark-')) + + // Lazy dynamic import so the offline path never needs the SDK or its creds. (This + // is an ESM "type":"module" package — a top-level `require` would throw.) + let RealClient: (new (o: { apiKey: string; baseUrl: string }) => unknown) | undefined + if (live) { + const sdk = (await import('@tangle-network/sandbox')) as { + SandboxClient: new (o: never) => unknown + } + RealClient = sdk.SandboxClient as never + } + + console.log( + `coding-benchmark · ${live ? 'LIVE' : 'OFFLINE'} · tools=${toolPreset} · ` + + `judges=${opts.ensemble ? '3 (ensemble)' : '1'} · reps=${reps} · ` + + `harnesses=${harnessProfiles.length} · scenarios=${scenarios.length}`, + ) + + const chat = judgeChat(live) + const resolveClient = clientFor(live, RealClient) + + try { + // The matrix runs one campaign per profile. The dispatch is per-scenario only in + // its CLIENT (offline scripts differ by scenario), so run each scenario's matrix + // and merge the records. (Live, one client serves all scenarios — collapse this.) + const allRecords = [] + for (const scenario of scenarios) { + const result = await runProfileMatrix({ + profiles: harnessProfiles, // axis: harness × baseline + scenarios: [scenario], // axis: tasks (one at a time so the offline client matches) + dispatch: codingDispatch(toolPreset, resolveClient(scenario)), + judges: judges(opts, chat), + reps, + integrity: live ? 'assert' : 'off', // offline mock has no real backend; live proves it + costCeiling: 5, + runDir, + commitSha: process.env.GIT_SHA ?? 'example', + storage: inMemoryCampaignStorage(), + }) + allRecords.push(...result.records) + } + + // Map the matrix's hashed profileId → the readable harness name for the leaderboard. + const nameById = new Map(harnessProfiles.map((p) => [agentProfileId(p), p.name ?? 'unknown'])) + const nameOf = (id: string) => nameById.get(id) ?? id + const report = pairwiseStats(allRecords, nameOf) + + console.log(`\nrecords: ${allRecords.length}\n`) + console.log(renderStats(report)) + return { records: allRecords.length, leaderboard: report.leaderboard.length } + } finally { + // The matrix writes its run artifacts under `runDir`; tear the temp tree down so + // repeated runs don't leak `/tmp/coding-benchmark-*` directories. + rmSync(runDir, { recursive: true, force: true }) + } +} + +export interface RunArtifactSummary { + records: number + leaderboard: number +} + +// Run only when invoked directly (not when imported by the smoke test). +if (import.meta.url === `file://${process.argv[1]}`) { + main().catch((err) => { + console.error(err instanceof Error ? (err.stack ?? err.message) : String(err)) + process.exit(1) + }) +} diff --git a/examples/coding-benchmark/coding-benchmark.test.ts b/examples/coding-benchmark/coding-benchmark.test.ts new file mode 100644 index 0000000..3034010 --- /dev/null +++ b/examples/coding-benchmark/coding-benchmark.test.ts @@ -0,0 +1,230 @@ +/** + * Offline smoke test — proves the whole pipeline runs with no creds and that the + * load-bearing honesty claims hold, BY EXECUTION (not a text scan): + * 1. the matrix produces exactly `harnesses × scenarios × reps` records and a + * defined leaderboard (the wiring is real, not a stub that returns nothing); + * 2. THE ANTI-CHEAT, end-to-end: a hardcode-the-visible CHEAT passes the VISIBLE + * tests but FAILS the HELD-OUT suite (low pass rate) → LOW composite; the REAL + * solution PASSES the held-out suite → HIGH composite. Run for real against an + * in-process box (`node --test`), no creds. This is the whole point of the example. + * 3. the held-out test is NEVER seeded into the box during the agent turn — the + * firewall — only at grading; + * 4. reps tighten the per-cell estimate HONESTLY — identical reps do NOT narrow the + * leaderboard CI vs reps=1 (reps are not independent samples). + */ + +import { exec as execCb } from 'node:child_process' +import { mkdtempSync, rmSync } from 'node:fs' +import { mkdir, writeFile } from 'node:fs/promises' +import { tmpdir } from 'node:os' +import { dirname, join } from 'node:path' +import { promisify } from 'node:util' +import type { RunRecord } from '@tangle-network/agent-eval' +import { describe, expect, it } from 'vitest' +import { main, offlineSolutions } from './benchmark' +import { type CheckBox, composeScore, runChecks, runHeldout } from './eval' +import { harnessProfiles } from './profiles' +import { type CodingScenario, checkCmds, scenarios } from './scenarios' +import { pairwiseStats } from './stats' + +const execAsync = promisify(execCb) + +/** A real in-process `CheckBox` over a fresh temp dir — `fs.write` + `exec` only, the + * exact surface `runChecks` / `runHeldout` use. `node --test` runs for real here, so the + * held-out execution is genuine (no creds, no network). */ +function tempBox(): { box: CheckBox; dir: string } { + const dir = mkdtempSync(join(tmpdir(), 'coding-bench-test-')) + const box: CheckBox = { + fs: { + async write(path: string, content: string) { + const abs = join(dir, path) + await mkdir(dirname(abs), { recursive: true }) + await writeFile(abs, content, 'utf8') + }, + }, + async exec(command: string) { + try { + const { stdout, stderr } = await execAsync(command, { cwd: dir, timeout: 30_000 }) + return { exitCode: 0, stdout, stderr } + } catch (err) { + const e = err as { code?: number; stdout?: string; stderr?: string; message?: string } + return { + exitCode: e.code ?? 1, + stdout: e.stdout ?? '', + stderr: e.stderr ?? e.message ?? '', + } + } + }, + } + return { box, dir } +} + +/** Write a solution, run the VISIBLE example test + the HELD-OUT suite against it in one + * box, and return both pass rates. This is exactly what the dispatch does (minus the + * agent turn): seed the visible test during "the turn", then the held-out suite at + * grading. We run each test command directly (not the typecheck-gated `runChecks` + * pipeline) so the result reflects the TESTS, not the absence of `tsc` offline — the + * whole point is to compare visible-pass vs held-out-pass by execution. */ +async function gradeSolution( + scenario: CodingScenario, + solution: string, +): Promise<{ visiblePassRate: number; heldoutPassRate: number; heldoutNotes: string }> { + const { box, dir } = tempBox() + try { + const cmds = checkCmds(scenario) + await box.fs?.write(scenario.solutionPath, solution) + // "During the turn": the visible example test is seeded + run. + const visible = await runHeldout( + box, + { ...scenario, heldoutTest: scenario.visibleTest }, + cmds.dev, + ) + // "At grading": the held-out suite is seeded + run (the real anti-cheat). + const heldout = await runHeldout(box, scenario, cmds.heldout) + return { + visiblePassRate: visible.passRate, + heldoutPassRate: heldout.passRate, + heldoutNotes: heldout.notes, + } + } finally { + rmSync(dir, { recursive: true, force: true }) + } +} + +describe('coding-benchmark (offline)', () => { + // Integration smoke: runs the real matrix end-to-end (real box.exec on the offline + // toolchain, all refine rounds since the dev checks can't pass without tsc). + it('runs the full matrix and returns a defined leaderboard', async () => { + const reps = 1 + const summary = await main(['--reps', String(reps)]) + expect(summary.records).toBe(harnessProfiles.length * scenarios.length * reps) + expect(summary.leaderboard).toBe(harnessProfiles.length) + }, 180_000) + + // THE ANTI-CHEAT, proven by execution: the round-0 hardcode-the-visible cheat PASSES + // the visible test but FAILS the held-out suite (it never saw those inputs), and the + // refined real impl PASSES the held-out suite. Composite ranks the real one far above. + it('a hardcode-the-visible cheat FAILS the held-out tests; the real solution PASSES', async () => { + const rl = scenarios.find((s) => s.id === 'rate-limiter') as CodingScenario + const script = offlineSolutions['rate-limiter'] + expect(script).toBeDefined() + const cheat = (script as NonNullable).solutionFor(0) // round-0 cheat + const real = (script as NonNullable).solutionFor(1) // refined real impl + + const cheatGrade = await gradeSolution(rl, cheat) + const realGrade = await gradeSolution(rl, real) + + // The cheat memorizes the visible example answers, so it passes the visible test... + expect( + cheatGrade.visiblePassRate, + `cheat should pass the visible test: ${cheatGrade.heldoutNotes}`, + ).toBe(1) + // ...but it FAILS the held-out suite (different inputs it never saw). + expect( + cheatGrade.heldoutPassRate, + `cheat should NOT fully pass held-out: ${cheatGrade.heldoutNotes}`, + ).toBeLessThan(1) + + // The real implementation passes the held-out suite outright. + expect( + realGrade.heldoutPassRate, + `real solution should pass held-out: ${realGrade.heldoutNotes}`, + ).toBe(1) + + // Composite ranks the real solution strictly above the cheat (held-out is primary). + // Hold the secondary judge score equal so the gap is purely the held-out term. + const judgeQuality = 0.8 + const cheatComposite = composeScore(cheatGrade.heldoutPassRate, judgeQuality) + const realComposite = composeScore(realGrade.heldoutPassRate, judgeQuality) + expect(realComposite).toBeGreaterThan(cheatComposite) + }, 60_000) + + // Every scenario's REAL offline solution passes its held-out suite (the suites are not + // accidentally impossible) — run for real against the in-process box. + it.each( + scenarios, + )('the real offline solution passes the held-out suite for $id', async (scenario: CodingScenario) => { + const script = offlineSolutions[scenario.id] + expect(script, `no offline solution for ${scenario.id}`).toBeDefined() + const solution = (script as NonNullable).solutionFor(99) // settled round + const grade = await gradeSolution(scenario, solution) + expect( + grade.heldoutPassRate, + `real ${scenario.id} failed held-out: ${grade.heldoutNotes}`, + ).toBe(1) + }, 60_000) + + // FIREWALL: the held-out test is never seeded into the box during the agent turn — + // only the visible test is. After running the dev checks (which seed the visible test), + // the held-out file must NOT exist in the box; it appears only after `runHeldout`. + it.each( + scenarios, + )('does NOT seed the held-out test during the turn for $id', async (scenario: CodingScenario) => { + const { box, dir } = tempBox() + try { + const cmds = checkCmds(scenario) + const script = offlineSolutions[scenario.id] as NonNullable<(typeof offlineSolutions)[string]> + await box.fs?.write(scenario.solutionPath, script.solutionFor(99)) + // "The turn": run the dev checks, which seed ONLY the visible test. + await runChecks(box, scenario, cmds) + // The visible test exists; the held-out test must NOT (the firewall). + const visibleExists = await box.exec(`test -f '${scenario.visibleTest.path}'`) + const heldoutExists = await box.exec(`test -f '${scenario.heldoutTest.path}'`) + expect(visibleExists.exitCode, 'visible test should be seeded during the turn').toBe(0) + expect( + heldoutExists.exitCode, + `held-out test for ${scenario.id} leaked into the box during the turn`, + ).not.toBe(0) + // Only AFTER grading does the held-out file appear. + await runHeldout(box, scenario, cmds.heldout) + const afterGrading = await box.exec(`test -f '${scenario.heldoutTest.path}'`) + expect(afterGrading.exitCode, 'held-out should be seeded at grading').toBe(0) + } finally { + rmSync(dir, { recursive: true, force: true }) + } + }, 60_000) + + it('reps do NOT fake independent n — identical reps leave the CI unchanged', () => { + // Two harnesses, two scenarios, identical scores. Build records for reps=1 and + // reps=3 (the extra reps are exact duplicates → zero new information). The honest + // leaderboard collapses reps to one mean per (harness, scenario), so the CI width + // and the n must be IDENTICAL across reps — duplicating a sample cannot tighten it. + const mk = (harness: string, scenarioId: string, s: number): RunRecord => + ({ + candidateId: harness, + scenarioId, + outcome: { searchScore: s }, + }) as unknown as RunRecord + const base: Array<[string, string, number]> = [ + ['a', 's1', 0.9], + ['a', 's2', 0.4], + ['b', 's1', 0.8], + ['b', 's2', 0.5], + ] + const nameOf = (id: string) => id + const reps1 = base.map(([h, s, v]) => mk(h, s, v)) + const reps3 = base.flatMap(([h, s, v]) => [mk(h, s, v), mk(h, s, v), mk(h, s, v)]) + + const r1 = pairwiseStats(reps1, nameOf) + const r3 = pairwiseStats(reps3, nameOf) + + for (const harness of ['a', 'b']) { + const row1 = r1.leaderboard.find((r) => r.harness === harness) + const row3 = r3.leaderboard.find((r) => r.harness === harness) + expect(row1).toBeDefined() + expect(row3).toBeDefined() + const r1Row = row1 as NonNullable + const r3Row = row3 as NonNullable + // Same honest n (= distinct scenarios), same mean, and the CI must NOT narrow. + expect(r3Row.n).toBe(r1Row.n) + expect(r3Row.meanComposite).toBeCloseTo(r1Row.meanComposite, 10) + const width1 = r1Row.ci.upper - r1Row.ci.lower + const width3 = r3Row.ci.upper - r3Row.ci.lower + expect(width3).toBeCloseTo(width1, 10) + // The pass-rate Wilson interval likewise must not tighten. + const pw1 = r1Row.passCi.upper - r1Row.passCi.lower + const pw3 = r3Row.passCi.upper - r3Row.passCi.lower + expect(pw3).toBeCloseTo(pw1, 10) + } + }) +}) diff --git a/examples/coding-benchmark/dispatch.ts b/examples/coding-benchmark/dispatch.ts new file mode 100644 index 0000000..9ef243e --- /dev/null +++ b/examples/coding-benchmark/dispatch.ts @@ -0,0 +1,200 @@ +/** + * The DISPATCH — renders one (profile, scenario) matrix cell: it runs the coding + * agent on the profile's harness, MULTI-ROUND, in ONE persistent box, then hands + * back the `RunArtifact` the judges score. + * + * This file composes four primitives and nothing bespoke: + * - `offlineSandboxClient` (offline) or `new SandboxClient(...)` (live) give the box. + * - `openSandboxRun(client, opts, deliverable)` opens ONE persistent, resumable box. + * `.start(prompt)` = round 1; `.resume(prompt)` = round N over the SAME session. + * That IS the "each round builds on the prior output" loop — no extra combinator. + * - `runChecks` (eval.ts) runs the deterministic `MultiLayerVerifier` pipeline each round. + * - `extractLlmCallEvent` (the runtime's own metering seam) reads token usage off the + * stream — across ALL backend event shapes — and reports it so the backend-integrity + * guard sees a real run. + * + * ┌─────────────────────────────────────────────────────────────────────────┐ + * │ THE FIREWALL LIVES HERE — and it is EXECUTION-BASED, not a text scan. │ + * │ The ONLY scenario field that reaches the agent's CONTEXT is │ + * │ `scenario.prompt` (the `taskToPrompt` below, and `nextPrompt` built ONLY │ + * │ from check output). The LLM-judge rubric note is read later by eval.ts — │ + * │ never written into the box. │ + * │ │ + * │ The VISIBLE example test is seeded into the box DURING the turn (so │ + * │ `node --test` has a file to run) and a multi-round agent with native file │ + * │ tools CAN read it — intentional, the same as real TDD. │ + * │ │ + * │ The HELD-OUT grading suite is NEVER seeded during the turn. It is copied │ + * │ in ONLY at grading (after the loop, `runHeldout` below) and run; the │ + * │ score is its pass rate. The agent cannot game tests it never saw — a │ + * │ hardcode-the-visible cheat fails the held-out inputs. THAT is the │ + * │ anti-cheat: execution truth on hidden tests. │ + * └─────────────────────────────────────────────────────────────────────────┘ + */ + +import type { DispatchContext, ProfileDispatchFn } from '@tangle-network/agent-eval/campaign' +import type { AgentProfile } from '@tangle-network/agent-interface' +import { + type AgentRunSpec, + extractLlmCallEvent, + openSandboxRun, + type SandboxClient, +} from '@tangle-network/agent-runtime/loops' +import type { SandboxEvent } from '@tangle-network/sandbox' +import { type CheckBox, layerOutput, type RunArtifact, runChecks, runHeldout } from './eval' +import { harnessOf, type ToolPreset, withTools } from './profiles' +import { type CodingScenario, checkCmds } from './scenarios' + +/** Max refine rounds. Round N+1's prompt is built from round N's CHECK output only. */ +const maxRounds = 3 + +/** Build the next-round prompt from the checks the AGENT is allowed to see — the + * pass/fail + output of the VISIBLE example tests. NEVER from the held-out suite, the + * rubric, or the judge. This is the firewall in action: the agent steers on the visible + * example failures, nothing else, and is GRADED on the held-out suite it never saw. + * + * typecheck/test are gating (a failure blocks `allPass`); lint is advisory (it never + * gates) but its warnings are still surfaced here so the agent can fix style — visible + * to the agent is decoupled from gates-allPass. Advisory warnings ride along as a + * separate, clearly-labeled section. */ +function nextPrompt(report: RunArtifact['checks']): string { + const fails: string[] = [] + const advisories: string[] = [] + for (const layer of ['typecheck', 'test'] as const) { + const c = layerOutput(report, layer) + if (!c.passed) fails.push(`${layer} failed:\n${c.output.slice(0, 1200)}`) + } + // lint is advisory: report its warnings (not "clean") without treating them as a + // gating failure, so style issues can actually be refined. + const lint = layerOutput(report, 'lint') + if (!lint.clean && lint.output) advisories.push(`lint warnings:\n${lint.output.slice(0, 1200)}`) + + const sections = [`Your solution did not pass these checks. Fix the file and try again.`] + if (fails.length > 0) sections.push(fails.join('\n\n')) + if (advisories.length > 0) + sections.push(`Advisory (does not block, but improve if you can):\n${advisories.join('\n\n')}`) + return sections.join('\n\n') +} + +/** + * The dispatch factory. Curry the tool preset + the sandbox client; return a + * `ProfileDispatchFn` the matrix calls once per cell. + * + * @param clientFor Resolve a `SandboxClient` for a profile's harness. Offline: + * return `offlineSandboxClient(...)`. Live: `new SandboxClient(...)`. + */ +export function codingDispatch( + toolPreset: ToolPreset, + clientFor: (profile: AgentProfile) => SandboxClient, +): ProfileDispatchFn { + return async ( + profile: AgentProfile, + scenario: CodingScenario, + ctx: DispatchContext, + ): Promise => { + const harness = harnessOf(profile) + // Author the tool surface onto the profile (one line). The substrate + // materializes it into the harness's real config. + const equippedProfile = withTools(profile, toolPreset) + const cmds = checkCmds(scenario) + + const agentRun: AgentRunSpec = { + profile: equippedProfile, + // FIREWALL: the prompt is the WHOLE of what the agent sees. Only scenario.prompt. + taskToPrompt: (task: string) => task, + sandboxOverrides: { backend: { type: harness } }, + } + + // Read the produced solution file off the box after each turn (the deliverable). + const run = await openSandboxRun<{ solution: string }>( + clientFor(profile), + { agentRun, signal: ctx.signal, runId: ctx.cellId, scenarioId: scenario.id }, + { + kind: 'artifact', + path: scenario.solutionPath, + fromArtifact: (raw: string) => ({ solution: raw }), + }, + ) + + try { + let checks = blankReport() + let solution = '' + let finalText = '' + + for (let round = 0; round < maxRounds; round += 1) { + const prompt = round === 0 ? scenario.prompt : nextPrompt(checks) + const turn = round === 0 ? await run.start(prompt) : await run.resume(prompt) + solution = turn.out.solution + finalText = turn.events.map(eventText).filter(Boolean).join(' ').slice(0, 2000) + + // Report usage so the integrity guard sees a real backend (not a stub). + // `extractLlmCallEvent` reads usage off EVERY backend event shape — the live + // sandbox's `done`/`result`/`llm_call` events all sum correctly here. + const usage = sumTokens(turn.events) + if (usage.input || usage.output) ctx.cost.observeTokens(usage) + + // Dev checks (visible example tests), IN THE BOX, this round. These (and only + // these) steer the next round — the firewall keeps the held-out suite + rubric + // out of the loop. `run.box` is a `SandboxInstance`; `CheckBox` is the minimal + // `exec`(+optional `fs.write`) subset the checks use — a structural narrowing. + checks = await runChecks(run.box as CheckBox, scenario, cmds) + if (checks.allPass) break // stop on worker-observable green only + } + + // HELD-OUT TEST EXECUTION — the anti-cheat. Runs AFTER the loop in the SAME box: + // the held-out suite (never seeded during the turn) is copied in and run against + // the agent's real solution. Its pass rate is the PRIMARY correctness score (the + // judge blends it as the recorded composite). A solution that hardcoded the visible + // examples fails the held-out inputs it never saw — execution truth, not a regex. + const heldout = await runHeldout(run.box as CheckBox, scenario, cmds.heldout) + await ctx.artifacts.writeJson(`heldout/${ctx.cellId}.json`, heldout) + + return { solution, finalText, checks, heldout } + } finally { + await run.close() + } + } +} + +/** An empty verifier report for the pre-loop state (no layer has run yet). */ +function blankReport(): RunArtifact['checks'] { + const now = new Date().toISOString() + return { + layers: [], + passCount: 0, + failCount: 0, + skippedCount: 0, + errorCount: 0, + allPass: false, + blendedScore: 0, + valid: false, + score: 0, + durationMs: 0, + startedAt: now, + finishedAt: now, + } +} + +/** Pull the agent's text out of a stream event (best-effort, for judge context). The + * text payload isn't on `SandboxEvent`'s typed surface, so we read `data` defensively. */ +function eventText(ev: SandboxEvent): string { + const e = ev as { data?: { finalText?: string; text?: string; delta?: string } } + return e.data?.finalText ?? e.data?.text ?? e.data?.delta ?? '' +} + +/** Sum token usage across the turn's events into the `{ input, output }` shape + * `ctx.cost.observeTokens` expects, using the runtime's own metering extractor so + * EVERY backend event shape (`done`/`result`/`llm_call`/`usage`) is counted. + * `events` is the turn's real `SandboxEvent[]` — `extractLlmCallEvent` takes it directly. */ +function sumTokens(events: SandboxEvent[]): { input: number; output: number } { + let input = 0 + let output = 0 + for (const ev of events) { + const call = extractLlmCallEvent(ev, 'agent') + if (call) { + input += call.tokensIn ?? 0 + output += call.tokensOut ?? 0 + } + } + return { input, output } +} diff --git a/examples/coding-benchmark/eval.ts b/examples/coding-benchmark/eval.ts new file mode 100644 index 0000000..34cda1c --- /dev/null +++ b/examples/coding-benchmark/eval.ts @@ -0,0 +1,391 @@ +/** + * The SCORING stack, in the order it runs — cheapest and most objective first. + * + * 1. DEV CHECKS (in the box, ~$0) — an ordered `MultiLayerVerifier` pipeline: + * typecheck → test(visible) → lint, with dependency-based skip (test never runs on + * a type error) and a blended score. These pass/fail booleans steer the refine loop + * (see the firewall in dispatch.ts). They are ADVISORY for the final grade: passing + * the visible examples does not prove correctness, it just tells the agent it's on + * track. + * 2. HELD-OUT TEST EXECUTION (in the box, after the loop, ~$0) — the PRIMARY, + * ungameable correctness score. The hidden test suite (never seeded during the turn) + * is copied in and run with `node --experimental-transform-types --test`; the score + * is the held-out PASS RATE. A real solution passes; a cheat that hardcoded the + * visible examples or faked the hard part FAILS (it never saw these inputs). This is + * execution truth, not a text scan. + * 3. LLM JUDGE (last) — a SECONDARY code-QUALITY signal. One `llmJudge` model call for + * the leaderboard, or a cross-family `ensembleJudge` panel for a ship/no-ship claim. + * Both see the SAME full context (code + rubric + check results); the rubric anchors + * live HERE, never in the agent's workdir. + * + * Composite = held-out correctness (PRIMARY) + judge quality (secondary). The anti-cheat + * is the held-out execution — a hidden suite the agent never saw — not any text scan. + * + * Every layer is a published agent-eval primitive — `MultiLayerVerifier`, `llmJudge`, + * `ensembleJudge`. No hand-rolled scorer. + */ + +import { + type ChatClient, + ensembleJudge, + type Layer, + MultiLayerVerifier, + type VerificationReport, +} from '@tangle-network/agent-eval' +// `llmJudge` is imported from the `/campaign` subpath, not the main index: it is +// exported from `/campaign` across the entire declared peer range (>=0.97), whereas the +// main-index re-export is newer — so a consumer pinned to the peer floor still compiles. +import { type JudgeConfig, type JudgeScore, llmJudge } from '@tangle-network/agent-eval/campaign' +import type { CodingScenario, TestFile } from './scenarios' + +// ── the composite weighting ─────────────────────────────────────────────────── +// Held-out correctness is the PRIMARY, ungameable score; the judge is a secondary +// quality signal. composite = heldoutWeight·heldout + judgeWeight·judge. +export const heldoutWeight = 0.7 +export const judgeWeight = 0.3 + +// ── the judge rubric (4 weighted dimensions, total 1.0) ─────────────────────── +// The rubric text + anchors live HERE, with the judge — never in the workdir. The +// agent is graded against criteria it cannot read. +export const rubric = { + correctness: { + weight: 0.4, + description: 'Does the code correctly implement the spec for all stated cases?', + }, + completeness: { + weight: 0.25, + description: 'Are all required behaviors and edge cases handled, nothing stubbed?', + }, + code_quality: { + weight: 0.2, + description: 'Is it clear, idiomatic, dependency-free as required, and maintainable?', + }, + robustness: { + weight: 0.15, + description: 'Does it handle malformed / boundary input without crashing or misbehaving?', + }, +} as const + +export type RubricDim = keyof typeof rubric +const dimKeys = Object.keys(rubric) as RubricDim[] +const weights = Object.fromEntries(dimKeys.map((k) => [k, rubric[k].weight])) as Record< + RubricDim, + number +> +const dimensions = dimKeys.map((k) => ({ key: k, description: rubric[k].description })) + +// ── the held-out result ──────────────────────────────────────────────────────── +export interface HeldoutResult { + /** Held-out tests that passed. */ + passed: number + /** Total held-out tests run. */ + total: number + /** Pass rate (0..1) — the PRIMARY correctness score. 0 when the suite errored + * (typecheck failure, import failure, or no tests ran). */ + passRate: number + /** Captured runner output (record only). */ + notes: string +} + +// ── the artifact the dispatch produces and the judges score ─────────────────── +export interface RunArtifact { + /** The solution file's content. */ + solution: string + /** The agent's final chat text for the round (judge context). */ + finalText: string + /** The deterministic dev-check report from the LAST round (visible tests). */ + checks: VerificationReport + /** The held-out test execution result, run AFTER the loop. The PRIMARY score. */ + heldout: HeldoutResult +} + +// ── layer 1: the deterministic check pipeline (visible tests) ────────────────── + +/** The minimal box surface the checks need — a subset of the real `SandboxInstance`. + * The live sandbox satisfies it; the offline in-process box implements it too. `fs.write` + * is the structured write seam (both boxes expose it); we prefer it over a shell write so + * seeding never interpolates a path into a command string. */ +export interface CheckBox { + exec(command: string): Promise<{ exitCode: number; stdout: string; stderr: string }> + fs?: { write(path: string, content: string): Promise } +} + +/** Seed a test file into the box. Prefers the structured `fs.write` seam so the path/ + * content is never interpolated into a shell command (no injection surface for partners + * who later load scenario paths from config). Falls back to a base64 shell write with + * SINGLE-QUOTED path words on a box that only exposes `exec`. The file's CONTENT is never + * described to the agent in the prompt — this is write-only scaffold (the firewall). */ +async function seedFile(box: CheckBox, file: TestFile): Promise { + if (box.fs) { + await box.fs.write(file.path, file.content) + return + } + const b64 = Buffer.from(file.content, 'utf8').toString('base64') + const dir = file.path.includes('/') ? file.path.slice(0, file.path.lastIndexOf('/')) : '.' + await box.exec(`mkdir -p '${dir}' && printf %s '${b64}' | base64 -d > '${file.path}'`) +} + +/** One check command → a `Layer`. Pass/fail comes from the exit code. `advisory` + * layers always report `pass` (they ran) and fold their cleanliness into the + * blended score without gating `allPass` — that is how lint stays advisory. */ +function checkLayer( + name: string, + command: string, + opts: { + dependsOn?: string[] + advisory?: boolean + }, +): Layer { + return { + name, + ...(opts.dependsOn ? { dependsOn: opts.dependsOn } : {}), + async run({ env: box }) { + const r = await box.exec(command) + const ok = r.exitCode === 0 + const output = `${r.stdout}\n${r.stderr}`.trim() + const findings = ok + ? [] + : [ + { + severity: 'major' as const, + message: `${name} failed`, + evidence: output.slice(0, 1200), + }, + ] + if (opts.advisory) { + // Always "ran"; cleanliness folds into the blended score, never gates allPass. + return { + layer: name, + status: 'pass' as const, + score: ok ? 1 : 0.5, + durationMs: 0, + findings, + detail: { output }, + } + } + return { + layer: name, + status: ok ? ('pass' as const) : ('fail' as const), + score: ok ? 1 : 0, + durationMs: 0, + findings, + detail: { output }, + } + }, + } +} + +/** + * Run the scenario's dev checks in the box as an ordered pipeline. Seeds the VISIBLE + * example test first (the agent may read it, TDD-style), then typecheck → test → lint. + * `report.allPass` is true only when typecheck AND test pass (lint is advisory). The + * `report.layers[*].detail.output` is what the refine loop reads to build the next + * prompt. The HELD-OUT test is NOT seeded here — that is the firewall. + */ +export async function runChecks( + box: CheckBox, + scenario: CodingScenario, + cmds: { typecheck: string; dev: string; lint: string }, +): Promise { + await seedFile(box, scenario.visibleTest) + const verifier = new MultiLayerVerifier([ + checkLayer('typecheck', cmds.typecheck, {}), + checkLayer('test', cmds.dev, { dependsOn: ['typecheck'] }), + checkLayer('lint', cmds.lint, { dependsOn: ['typecheck'], advisory: true }), + ]) + return verifier.run({ env: box, overallCapMs: 120_000 }) +} + +/** Pull one check layer's captured output (for the refine prompt). `passed` is the + * gating status (advisory layers always report `pass`); `clean` is the layer's real + * cleanliness (score === 1) — so the refine prompt can surface advisory lint warnings + * (clean === false) without those warnings gating `allPass`. */ +export function layerOutput( + report: VerificationReport, + layer: string, +): { passed: boolean; clean: boolean; output: string } { + const r = report.layers.find((l) => l.layer === layer) + return { + passed: r?.status === 'pass', + clean: r ? r.score === 1 : false, + output: typeof r?.detail?.output === 'string' ? r.detail.output : '', + } +} + +// ── layer 2: held-out test execution (the PRIMARY anti-cheat) ────────────────── + +/** + * Seed the held-out suite into the box AFTER the loop and run it. The score is the + * held-out PASS RATE — the primary, ungameable correctness number. The agent never saw + * these tests during the turn (the firewall), so a solution that hardcoded the visible + * examples or faked the hard part fails them; only real behavior passes. + * + * `node --test` prints a TAP-ish summary (`# tests N`, `# pass N`, `# fail N`). We parse + * those counts. A non-zero exit with no parseable counts (a typecheck/import error before + * any test ran) is a 0/0 → passRate 0 — the honest "did not even run" signal, never a + * spurious pass. This runs in the SAME box, so it sees the agent's real solution file. + */ +export async function runHeldout( + box: CheckBox, + scenario: CodingScenario, + heldoutCmd: string, +): Promise { + await seedFile(box, scenario.heldoutTest) + const r = await box.exec(heldoutCmd) + const output = `${r.stdout}\n${r.stderr}`.trim() + const counts = parseTestCounts(output) + // No parseable counts means the suite never ran (e.g. the solution didn't typecheck or + // import) — that is a 0 pass rate, the honest "did not even run" result. + const total = counts.total + const passed = counts.pass + const passRate = total > 0 ? passed / total : 0 + return { + passed, + total, + passRate, + notes: + total > 0 + ? `held-out ${passed}/${total} pass` + : `held-out suite did not run (exit ${r.exitCode})`, + } +} + +/** Parse `node --test`'s summary counts from its output. Reads the `tests`, `pass`, and + * `fail` summary lines, which `node --test` prefixes with either `ℹ` (its default + * reporter) or `#` (the TAP reporter) and may wrap in ANSI colour. We strip ANSI and + * accept both markers. When `tests` is absent we fall back to pass+fail. Returns + * {total:0,pass:0} when nothing parseable (the suite never ran) — never guesses a pass. */ +function parseTestCounts(output: string): { total: number; pass: number } { + // biome-ignore lint/suspicious/noControlCharactersInRegex: stripping terminal ANSI escapes + const clean = output.replace(/\[[0-9;]*m/g, '') + const read = (label: string): number | undefined => { + const m = clean.match(new RegExp(`(?:#|\\u2139)\\s*${label}\\s+(\\d+)`)) + return m ? Number(m[1]) : undefined + } + const pass = read('pass') ?? 0 + const fail = read('fail') ?? 0 + const tests = read('tests') + const total = tests ?? pass + fail + return { total, pass } +} + +// ── layer 3: the LLM judge(s) — SECONDARY quality signal ─────────────────────── + +/** The judge instructions — the rubric anchors, kept with the judge ONLY. */ +const judgePrompt = [ + 'You are a senior code reviewer scoring a candidate solution to a coding task.', + 'Score each dimension from 0 to 1 (1 = excellent), using the criteria provided.', +].join(' ') + +/** The full context every judge sees: the code + the deterministic check results + + * the held-out pass rate + the eval-only rubric note. Shared by the single judge AND + * the ensemble so the panel never grades on less information than the leaderboard judge. */ +function renderForJudge(artifact: RunArtifact, scenario: CodingScenario): string { + return [ + `Task intent: ${scenario.prompt}`, + `Grading note: ${scenario.rubricNote}`, + `Dev checks — typecheck:${layerOutput(artifact.checks, 'typecheck').passed} ` + + `visible-test:${layerOutput(artifact.checks, 'test').passed} ` + + `lint:${layerOutput(artifact.checks, 'lint').passed}`, + `Held-out correctness: ${artifact.heldout.passed}/${artifact.heldout.total} ` + + `(${(artifact.heldout.passRate * 100).toFixed(0)}%)`, + '', + 'Candidate solution:', + '```ts', + artifact.solution.slice(0, 8000), + '```', + ].join('\n') +} + +/** ── ONE judge ────────────────────────────────────────────────────────────── + * `llmJudge` builds a campaign `JudgeConfig` whose `score()` makes ONE model call + * against the rubric and reduces it to a canonical `{ dimensions, composite, notes }`. + * The judge's composite is the SECONDARY quality signal; we wrap it with `blendHeldout` + * so the composite the matrix RECORDS is the PRIMARY-weighted blend (held-out pass rate + * + judge quality). */ +export function singleCodeJudge(chat: ChatClient): JudgeConfig { + const base = llmJudge('code-quality', judgePrompt, { + chat, + dimensions, + weights, + scale: 'unit', + appliesTo: (s) => s.kind === 'coding', + renderUser: ({ artifact, scenario }) => renderForJudge(artifact, scenario), + }) + return blendHeldout(base) +} + +/** ── THREE judges ──────────────────────────────────────────────────────────── + * `ensembleJudge` fans the artifact across N cross-family models in parallel and + * reduces surviving verdicts to one `JudgeScore`. A model that throws is excluded, + * never folded into a zero. `crossFamily: true` rejects a same-family panel at + * construction. The panel sees the SAME full context as the single judge. */ +export function ensembleCodeJudge( + scoreOne: (model: string, context: string) => Promise>, +): JudgeConfig { + const base = ensembleJudge({ + name: 'code-quality-ensemble', + dimensions: dimKeys, + // Snapshot-dated, cross-family panel — the SAME reproducibility rule profiles.ts + // enforces on harness models (a bare alias isn't reproducible: "which gpt-4o-mini?"). + models: [ + 'deepseek/deepseek-chat-2025-08-21', + 'openai/gpt-4o-mini-2024-07-18', + 'google/gemini-2.0-flash-2025-02-05', + ], + crossFamily: true, + weights, + scoreWith: async (model, input) => { + const artifact = input.artifact as RunArtifact + const scenario = input.scenario as CodingScenario + const perDimension = await scoreOne(model, renderForJudge(artifact, scenario)) + return { model, perDimension } + }, + }) as JudgeConfig + return blendHeldout(base) +} + +// ── the composite: held-out correctness (PRIMARY) + judge quality (secondary) ── + +/** + * Blend the PRIMARY held-out pass rate with the SECONDARY judge composite into the final + * score the leaderboard ranks on. This is what makes held-out execution the load-bearing + * grade: a solution that fails the held-out suite is capped low no matter how the judge + * felt about its style, and a stylistically-mediocre but CORRECT solution still scores + * the bulk of the points. + */ +export function composeScore(heldoutPassRate: number, judgeComposite: number): number { + return heldoutWeight * heldoutPassRate + judgeWeight * judgeComposite +} + +/** Wrap a judge so the composite it REPORTS is the held-out-weighted blend. The judge + * still scores its quality dimensions (recorded, secondary), but the composite the + * matrix stamps as the run's score is `composeScore(heldoutPassRate, judgeComposite)` — + * so the leaderboard ranks on execution truth first, style second. The artifact is in + * scope at score time, so the held-out pass rate (computed before the judge runs) is + * read directly off it; no separate stats-side blend is needed. */ +function blendHeldout( + judge: JudgeConfig, +): JudgeConfig { + return { + ...judge, + async score(input: { + artifact: RunArtifact + scenario: CodingScenario + signal: AbortSignal + }): Promise { + const base = await judge.score(input) + const heldout = input.artifact.heldout + const composite = composeScore(heldout.passRate, base.composite) + return { + ...base, + composite, + notes: + `composite=${composite.toFixed(3)} ` + + `(held-out ${(heldout.passRate * 100).toFixed(0)}% × ${heldoutWeight} + ` + + `quality ${base.composite.toFixed(3)} × ${judgeWeight})` + + (base.notes ? ` — ${base.notes}` : ''), + } + }, + } +} diff --git a/examples/coding-benchmark/offline-box.ts b/examples/coding-benchmark/offline-box.ts new file mode 100644 index 0000000..8956ba6 --- /dev/null +++ b/examples/coding-benchmark/offline-box.ts @@ -0,0 +1,111 @@ +/** + * The OFFLINE seam — an in-process `SandboxClient` so the WHOLE benchmark runs + * with no creds and no network, exactly like `examples/ui-audit/` does. + * + * The offline "agent" is a SCRIPTED STAND-IN for a real coding agent: it writes a + * canned solution per round instead of calling a model. That is the only thing + * stubbed — the matrix, the verifier, the held-out test execution, the judge wiring, + * and the stats all run for real. `--live` swaps this client for `new SandboxClient(...)` + * and the same dispatch runs each round in a real harness box. + * + * It implements only what `openSandboxRun` actually calls on a box: + * - `streamPrompt(prompt, opts)` — the "agent" turn. Writes the round's scripted + * solution into a real temp workspace and emits one terminal `done` event — the + * SAME shape a live box emits, carrying `tokenUsage` so the run meters honestly + * and `extractLlmCallEvent` reads it. + * - `fs.read` / `fs.write` — over the temp workspace (the `artifact` deliverable + + * the seeded test files live here). + * - `exec(cmd)` — runs the check + seed commands. Offline, `node` IS present so the + * test commands (`node --experimental-transform-types --test`) run for real — which + * is what lets the HELD-OUT execution genuinely grade the solution with no creds. But + * `tsc`/`biome` usually aren't installed, so the typecheck (and the test layer that + * `dependsOn` it) read as a FAIL — the honest offline signal. The dev checks never + * fully pass offline, so all `maxRounds` run, which is when refinement shows. + * - `delete()` — tears the temp dir down. + */ + +import { exec as execCb } from 'node:child_process' +import { mkdtempSync, rmSync } from 'node:fs' +import { mkdir, readFile, writeFile } from 'node:fs/promises' +import { tmpdir } from 'node:os' +import { dirname, join } from 'node:path' +import { promisify } from 'node:util' +import type { SandboxClient } from '@tangle-network/agent-runtime/loops' +import type { CreateSandboxOptions, SandboxEvent, SandboxInstance } from '@tangle-network/sandbox' + +const execAsync = promisify(execCb) + +/** A scripted offline solution: which file, and what content to write on a given + * round. `solutionFor(round)` lets round N differ from round N-1 — a REAL refine + * demo, not a constant. */ +export interface OfflineScript { + path: string + solutionFor: (round: number) => string +} + +function instanceMethods(workdir: string, script: OfflineScript) { + let round = 0 + return { + id: `offline-${Math.random().toString(36).slice(2, 8)}`, + // The "agent" turn. Writes the scripted solution, emits one terminal event. + async *streamPrompt(_message: string | unknown[]): AsyncGenerator { + const content = script.solutionFor(round) + round += 1 + const abs = join(workdir, script.path) + await mkdir(dirname(abs), { recursive: true }) + await writeFile(abs, content, 'utf8') + // The real sandbox terminal event shape: `done` with `data.tokenUsage` + + // top-level `totalCostUsd`. `extractLlmCallEvent` reads exactly this. The cast is + // structural: this is one member of the wide `SandboxEvent` union, written out + // literally; we don't reconstruct the whole union just to emit one done event. + yield { + type: 'done', + data: { + tokenUsage: { inputTokens: 600, outputTokens: 400 }, + totalCostUsd: 0, + finalText: `wrote ${script.path} (offline round ${round})`, + }, + } as unknown as SandboxEvent + }, + fs: { + async read(path: string): Promise { + return readFile(join(workdir, path), 'utf8') + }, + async write(path: string, content: string): Promise { + const abs = join(workdir, path) + await mkdir(dirname(abs), { recursive: true }) + await writeFile(abs, content, 'utf8') + }, + }, + async exec(command: string): Promise<{ exitCode: number; stdout: string; stderr: string }> { + try { + const { stdout, stderr } = await execAsync(command, { cwd: workdir, timeout: 30_000 }) + return { exitCode: 0, stdout, stderr } + } catch (err) { + const e = err as { code?: number; stdout?: string; stderr?: string; message?: string } + return { + exitCode: e.code ?? 1, + stdout: e.stdout ?? '', + stderr: e.stderr ?? e.message ?? '', + } + } + }, + async delete(): Promise { + rmSync(workdir, { recursive: true, force: true }) + }, + } +} + +/** An in-process `SandboxClient`. Each `create()` mints a fresh temp workspace box. */ +export function offlineSandboxClient(script: OfflineScript): SandboxClient { + return { + async create(_options?: CreateSandboxOptions): Promise { + const workdir = mkdtempSync(join(tmpdir(), 'coding-bench-')) + // The offline box implements only the members `openSandboxRun` actually calls + // (streamPrompt / fs / exec / delete), not the full `SandboxInstance`. The cast is + // a deliberate subset-as-superset for the offline seam; the live path uses the + // real SDK client. We don't stub the ~40 unused members to satisfy the type. + return instanceMethods(workdir, script) as unknown as SandboxInstance + }, + } +} diff --git a/examples/coding-benchmark/profiles.ts b/examples/coding-benchmark/profiles.ts new file mode 100644 index 0000000..bf24da5 --- /dev/null +++ b/examples/coding-benchmark/profiles.ts @@ -0,0 +1,123 @@ +/** + * The HARNESS axis + the TOOL knob — the agent-config side of the matrix. + * + * We measure the HARNESS on its default behavior, so each profile is deliberately + * bare: a name, the model it runs, and NOTHING else (no skills, no injected system + * prompt). Adding scaffolding here would measure our scaffolding, not the harness. + * The tool surface is a separate, orthogonal knob authored onto the profile in one + * line (`withTools`), so harness × tool is a clean cartesian. + * + * Two things to know about the shape: + * - `runProfileMatrix` takes `AgentProfile[]` from `@tangle-network/agent-interface`. + * That type has NO `harness` field (harness is a SANDBOX concept, not a profile + * concept), so we carry the harness selector on `metadata.harness`. `dispatch.ts` + * reads it to pick `backend.type`. `harnessOf()` below is the one reader. + * - The matrix REQUIRES a snapshot-dated `model.default` (it stamps it onto every + * run record). For a real harness the agent uses the harness's own default model; + * we still name a model id here so the record is honest about what ran. + */ + +import type { AgentProfile, AgentProfileMcpServer } from '@tangle-network/agent-interface' +import type { BackendType } from '@tangle-network/sandbox' + +/** The harnesses we sweep. `cli-base` is the plain-CLI baseline (no agent harness). */ +export const harnesses = [ + 'claude-code', + 'opencode', + 'codex', + 'cli-base', +] as const satisfies readonly BackendType[] + +/** Read the harness a profile targets. The ONE place metadata.harness is decoded. */ +export function harnessOf(profile: AgentProfile): BackendType { + const h = profile.metadata?.harness + if (typeof h !== 'string') { + throw new Error(`profile "${profile.name}" is missing metadata.harness — see profiles.ts`) + } + return h as BackendType +} + +/** The default model each harness runs. Override per-harness via env if you like; + * the value is stamped onto the run record, so keep it truthful. + * + * IMPORTANT — the model id MUST carry a SNAPSHOT DATE. `runProfileMatrix` rejects a + * bare alias and requires the snapshot form (`provider/name-YYYY-MM-DD`), because a + * run record without the exact model snapshot is not reproducible ("which gpt-4.1 + * was that?"). This is the substrate keeping the benchmark paper-grade — keep the + * date current. */ +const harnessModel: Record = { + 'claude-code': process.env.CLAUDE_CODE_MODEL ?? 'anthropic/claude-sonnet-4-5-2025-09-29', + opencode: process.env.OPENCODE_MODEL ?? 'anthropic/claude-sonnet-4-5-2025-09-29', + codex: process.env.CODEX_MODEL ?? 'openai/gpt-5-codex-2025-09-15', + 'cli-base': process.env.CLI_BASE_MODEL ?? 'openai/gpt-4.1-2025-04-14', + // unreached by this example, but BackendType is a closed union — name them all + 'kimi-code': 'moonshot/kimi-k2-2025-07-11', + amp: 'anthropic/claude-sonnet-4-5-2025-09-29', + 'factory-droids': 'anthropic/claude-sonnet-4-5-2025-09-29', + pi: 'openai/gpt-4.1-2025-04-14', + hermes: 'openai/gpt-4.1-2025-04-14', + forge: 'openai/gpt-4.1-2025-04-14', + openclaw: 'anthropic/claude-sonnet-4-5-2025-09-29', + nanoclaw: 'anthropic/claude-sonnet-4-5-2025-09-29', + acp: 'openai/gpt-4.1-2025-04-14', + cursor: 'anthropic/claude-sonnet-4-5-2025-09-29', +} + +/** + * One bare baseline profile per harness. The agent's behavior here is the + * harness's OUT-OF-THE-BOX behavior — exactly what a partner gets on day one. + */ +export const harnessProfiles: AgentProfile[] = harnesses.map((harness) => ({ + name: `${harness}-baseline`, + model: { default: harnessModel[harness] }, + // NO prompt, NO resources — measure the harness, not our scaffolding. + metadata: { harness }, +})) + +// ── the tool knob ───────────────────────────────────────────────────────────── +// +// A tool surface is a PRESET, not forked code. Each preset authors the SAME two +// fields onto a profile — native tools on/off (`profile.tools`) and an optional +// mounted MCP server (`profile.mcp`) — and the sandbox substrate materializes them +// into each harness's real shape (`.claude/`, `opencode.json`, codex config, ...). +// We never hand-write a per-harness config file. +// +// withTools(profile, 'web') // turn on the native web tools +// withTools(profile, 'search-mcp') // mount a search MCP instead +// withTools(profile, 'none') // baseline: no web, no MCP +// +// Honesty note for partners: a preset only takes effect for a (harness, lever) pair +// the sandbox actually materializes. If a harness has no native `webfetch`, +// `withTools(p,'web')` is a no-op THERE — a substrate fact, not something this +// example silently patches over. See `@tangle-network/sandbox` for the matrix. + +/** Where a search MCP lives, when the `search-mcp` preset is selected. */ +const searchMcpUrl = process.env.TANGLE_SEARCH_MCP ?? 'https://search-mcp.tangle.tools/mcp' + +export type ToolPreset = 'none' | 'web' | 'search-mcp' + +interface ToolSurface { + /** Native harness tools, by name → enabled. Maps to `profile.tools`. */ + tools?: Record + /** A mounted MCP server, by name. Maps to `profile.mcp`. */ + mcp?: Record +} + +const presets: Record = { + none: { tools: { websearch: false, webfetch: false } }, + web: { tools: { websearch: true, webfetch: true } }, + 'search-mcp': { + tools: { websearch: false, webfetch: false }, + mcp: { search: { transport: 'http', url: searchMcpUrl, enabled: true } }, + }, +} + +/** Author a tool surface onto a profile. Returns a NEW profile (pure). */ +export function withTools(profile: AgentProfile, preset: ToolPreset): AgentProfile { + const surface = presets[preset] + return { + ...profile, + ...(surface.tools ? { tools: surface.tools } : {}), + ...(surface.mcp ? { mcp: surface.mcp } : {}), + } +} diff --git a/examples/coding-benchmark/scenarios.ts b/examples/coding-benchmark/scenarios.ts new file mode 100644 index 0000000..46698ce --- /dev/null +++ b/examples/coding-benchmark/scenarios.ts @@ -0,0 +1,365 @@ +/** + * The held-out coding-task corpus — and the GRADING-CRITERIA FIREWALL, expressed as + * a type. + * + * Every scenario splits into four layers by where each field flows: + * - `prompt` — the only field copied into the agent's CONTEXT. The dispatch + * copies it (and next-round prompts built only from check output) + * into the worker; nothing else reaches the worker's context. + * - `visibleTest` — the example tests, SEEDED into the box workspace during the turn + * (so `node --test` has a file to run) and readable by a multi-round + * agent with native file tools — this is intentional, the same as + * real TDD: a few example cases the agent develops against. + * - `heldoutTest` — the HIDDEN grading suite. Same behavior, MORE cases and DIFFERENT + * inputs/edge cases the visible examples don't cover. It is NEVER + * seeded into the box during the turn — that is the anti-cheat + * firewall. At grading (after the loop) the harness copies it in and + * runs it; the score is the held-out pass rate. A solution that + * hardcoded the visible examples FAILS these; only real behavior + * passes. This is execution truth, not a text scan. + * - rubricNote — the LLM-judge rubric note. Never written into the box at all; + * eval.ts reads it AFTER the loop to score CODE QUALITY (secondary). + * + * The firewall is a property of which field flows where — you can SEE it in one place + * (dispatch.ts seeds `visibleTest` into the box but never `heldoutTest`; see the + * `// FIREWALL` comment there). The honest claim is the precise one: the held-out test + * suite and the rubric never touch the box during the turn; the visible example tests are + * deliberately readable by the agent. + */ + +import type { Scenario } from '@tangle-network/agent-eval/campaign' + +/** A test file the harness writes into the box. `visibleTest` is seeded DURING the turn + * (the agent may read it); `heldoutTest` is seeded ONLY at grading, after the turn. */ +export interface TestFile { + path: string + content: string +} + +/** One held-out coding task. Extends the substrate `Scenario` ({ id, kind, tags }). */ +export interface CodingScenario extends Scenario { + /** ── AGENT-VISIBLE ────────────────────────────────────────────────────── + * The task as the agent reads it. A clean scaffold description + the ask. + * This is the WHOLE of what reaches the worker's context. */ + prompt: string + + /** Path (relative to the workspace root) the agent is asked to produce. The + * checks read this file off the box AFTER the turn; the judge scores it. */ + solutionPath: string + + /** ── DEVELOP-AGAINST (seeded during the turn, TDD-style) ───────────────── + * A few example tests, seeded into the box so the agent can run/read them. */ + visibleTest: TestFile + + /** ── GRADING-ONLY (the agent NEVER sees this during the turn) ──────────── + * The held-out suite — same behavior, different inputs + edge cases the visible + * examples don't cover. Seeded ONLY at grading; the held-out pass rate is the + * PRIMARY, ungameable correctness score. Catches a hardcode-the-visible cheat. */ + heldoutTest: TestFile + + /** Extra grading context for the JUDGE only (design intent, edge cases to + * reward). Lives with the judge, never in the workdir. Secondary signal. */ + rubricNote: string +} + +// The deterministic check commands. Invoked directly (NOT via `npx -y`, which forces +// a registry round-trip every run): a real harness box has `tsc`/`biome`/`node` on +// PATH, so these run for real there; offline the missing tool fails FAST with a +// non-zero exit (the honest offline signal), not a 20s network stall. +/** A typecheck shell command for one solution file. */ +const typecheckCmd = (path: string) => `tsc --noEmit --strict --skipLibCheck ${path}` +/** A `node --test` command for one test file. The test imports the solution as a `.ts` + * file, so we run with `--experimental-transform-types`: Node's DEFAULT type-stripping + * is strip-only and throws `ERR_UNSUPPORTED_TYPESCRIPT_SYNTAX` on TS that emits runtime + * code — including constructor PARAMETER PROPERTIES (`constructor(private x: number)`), + * the exact style the canonical token-bucket impl uses. Without the flag a CORRECT + * solution would exit 1 and score as a test failure. The flag transforms (not just + * strips) the types so param properties run. + * + * NODE FLOOR: `--experimental-transform-types` and `.ts`-import test execution need + * Node >= 22.6 (the package's `engines.node` floor covers the offline path, which + * degrades gracefully when the toolchain is absent, but the test LAYER itself — live + * or when copied — requires Node >= 22.6). On an older Node a correct solution would + * fail with no hint why. */ +const testCmd = (testPath: string) => `node --experimental-transform-types --test ${testPath}` +/** A lint shell command for one solution file. */ +const lintCmd = (path: string) => `biome check ${path}` + +/** + * A 3-task corpus. Real benchmarks carry 20-50; three keeps the example readable. + * Each is a self-contained "write one module that passes these checks" task — the + * shape that has a CORRECTABLE MIDDLE BAND (build-passes-but-quality-varies), which + * is what makes a benchmark able to separate harnesses at all. + * + * THE ANTI-CHEAT is the held-out suite, not a text scan. Each `heldoutTest` covers the + * SAME behavior as its `visibleTest` with DIFFERENT inputs and extra edge cases, so a + * solution that hardcoded the visible examples' exact values passes `visibleTest` but + * FAILS `heldoutTest`. Execution truth: a real implementation passes both; a cheat that + * fakes the hard part or memorizes the visible cases fails the held-out one (exit 1). + * + * POWER CAVEAT: three scenarios is far below the n the significance machinery needs to + * separate harnesses — the paired tests demonstrate the WIRING, not a defensible claim. + * A real run wants 20-50 tasks. At this n a near-constant gap can SHOW significance (the + * small-n mirage); `renderStats` flags that and prints the caveat when n < 6. + */ +export const scenarios: CodingScenario[] = [ + { + id: 'rate-limiter', + kind: 'coding', + tags: ['algorithms', 'concurrency'], + prompt: [ + 'Implement a token-bucket rate limiter in TypeScript at `src/rate-limiter.ts`.', + 'Export `class RateLimiter` with a constructor `(capacity: number, refillPerSec: number)`', + 'and a method `tryRemove(tokens: number): boolean` that returns true and consumes the', + 'tokens if enough are available (refilling continuously over elapsed wall-clock time),', + 'and false otherwise. No external dependencies.', + ].join(' '), + solutionPath: 'src/rate-limiter.ts', + visibleTest: { + path: 'test/rate-limiter.test.ts', + content: `import { test } from 'node:test' +import assert from 'node:assert/strict' +import { RateLimiter } from '../src/rate-limiter.ts' + +test('consumes when tokens available', () => { + const rl = new RateLimiter(10, 1) + assert.equal(rl.tryRemove(5), true) + assert.equal(rl.tryRemove(5), true) +}) + +test('rejects when over capacity', () => { + const rl = new RateLimiter(3, 1) + assert.equal(rl.tryRemove(4), false) +}) + +test('rejects a second draw that exceeds the remaining bucket', () => { + const rl = new RateLimiter(10, 0) + assert.equal(rl.tryRemove(8), true) + assert.equal(rl.tryRemove(8), false) +}) +`, + }, + // HELD-OUT: same token-bucket behavior, DIFFERENT capacities/draws + extra edge + // cases (exact-capacity draw, zero-token draw). A solution that hardcoded the + // visible numbers (cap 10/3/10, draws 5/4/8) cannot satisfy these. + heldoutTest: { + path: 'test/rate-limiter.heldout.test.ts', + content: `import { test } from 'node:test' +import assert from 'node:assert/strict' +import { RateLimiter } from '../src/rate-limiter.ts' + +test('consumes within a different capacity', () => { + const rl = new RateLimiter(7, 1) + assert.equal(rl.tryRemove(4), true) + assert.equal(rl.tryRemove(3), true) +}) + +test('allows a draw exactly equal to the remaining bucket', () => { + const rl = new RateLimiter(6, 0) + assert.equal(rl.tryRemove(6), true) + assert.equal(rl.tryRemove(1), false) +}) + +test('rejects a draw over a different capacity', () => { + const rl = new RateLimiter(5, 1) + assert.equal(rl.tryRemove(6), false) +}) + +test('a zero-token draw always succeeds without consuming', () => { + const rl = new RateLimiter(2, 0) + assert.equal(rl.tryRemove(0), true) + assert.equal(rl.tryRemove(2), true) +}) +`, + }, + rubricNote: + 'Reward continuous (not discrete-tick) refill, integer-safe token accounting, and ' + + 'correct behavior when tokens requested exceeds capacity (must return false, never block).', + }, + { + id: 'csv-parser', + kind: 'coding', + tags: ['parsing', 'edge-cases'], + prompt: [ + 'Implement an RFC-4180 CSV parser in TypeScript at `src/csv.ts`.', + 'Export `function parseCsv(input: string): string[][]`. It must handle quoted fields,', + 'escaped double-quotes inside quotes (""), and embedded newlines within quoted fields.', + 'No external dependencies.', + ].join(' '), + solutionPath: 'src/csv.ts', + visibleTest: { + path: 'test/csv.test.ts', + content: `import { test } from 'node:test' +import assert from 'node:assert/strict' +import { parseCsv } from '../src/csv.ts' + +test('parses a plain row', () => { + assert.deepEqual(parseCsv('a,b,c'), [['a', 'b', 'c']]) +}) + +test('keeps a comma inside a quoted field', () => { + assert.deepEqual(parseCsv('"a,b",c'), [['a,b', 'c']]) +}) + +test('keeps a newline inside a quoted field', () => { + assert.deepEqual(parseCsv('"line1\\nline2",b'), [['line1\\nline2', 'b']]) +}) + +test('unescapes a doubled quote', () => { + assert.deepEqual(parseCsv('"she said ""hi"""'), [['she said "hi"']]) +}) +`, + }, + // HELD-OUT: same RFC-4180 behaviors, DIFFERENT strings + extra edge cases (a multi-row + // input with two records, an empty field). A parser that hardcoded the visible inputs + // cannot satisfy these. + heldoutTest: { + path: 'test/csv.heldout.test.ts', + content: `import { test } from 'node:test' +import assert from 'node:assert/strict' +import { parseCsv } from '../src/csv.ts' + +test('parses a different plain row', () => { + assert.deepEqual(parseCsv('x,y,z,w'), [['x', 'y', 'z', 'w']]) +}) + +test('keeps a comma inside a different quoted field', () => { + assert.deepEqual(parseCsv('p,"q,r,s"'), [['p', 'q,r,s']]) +}) + +test('parses two records separated by a newline', () => { + assert.deepEqual(parseCsv('a,b\\nc,d'), [['a', 'b'], ['c', 'd']]) +}) + +test('keeps a newline inside a different quoted field', () => { + assert.deepEqual(parseCsv('"alpha\\nbeta",gamma'), [['alpha\\nbeta', 'gamma']]) +}) + +test('unescapes a different doubled quote', () => { + assert.deepEqual(parseCsv('"a ""b"" c"'), [['a "b" c']]) +}) + +test('keeps an empty field between commas', () => { + assert.deepEqual(parseCsv('a,,c'), [['a', '', 'c']]) +}) +`, + }, + rubricNote: + 'Reward a single-pass state machine over naive splitting; correct handling of a quoted ' + + 'field containing a comma, a literal newline, and an escaped quote.', + }, + { + // The "only the real algorithm passes" task: a capacity-bounded LRU cache. There is + // no shortcut that satisfies the eviction tests — a bare `Map` (or `extends Map`) + // grows without bound and fails the at-capacity held-out test. + id: 'lru-cache', + kind: 'coding', + tags: ['data-structures', 'eviction'], + prompt: [ + 'Implement a capacity-bounded LRU (least-recently-used) cache in TypeScript at', + '`src/lru.ts`. Export `class LruCache` with a constructor `(capacity: number)`,', + 'a `get(key: K): V | undefined`, and a `set(key: K, value: V): void`. On `set` past', + 'capacity, evict the least-recently-used entry; a `get` or a re-`set` counts as a use', + '(refreshes recency). No external dependencies.', + ].join(' '), + solutionPath: 'src/lru.ts', + visibleTest: { + path: 'test/lru.test.ts', + content: `import { test } from 'node:test' +import assert from 'node:assert/strict' +import { LruCache } from '../src/lru.ts' + +test('evicts the least-recently-used entry at capacity', () => { + const c = new LruCache(2) + c.set('a', 1) + c.set('b', 2) + c.set('c', 3) + assert.equal(c.get('a'), undefined) + assert.equal(c.get('b'), 2) + assert.equal(c.get('c'), 3) +}) + +test('a get refreshes recency so the other key is evicted', () => { + const c = new LruCache(2) + c.set('a', 1) + c.set('b', 2) + assert.equal(c.get('a'), 1) + c.set('c', 3) + assert.equal(c.get('b'), undefined) + assert.equal(c.get('a'), 1) +}) + +test('returns undefined for a missing key', () => { + const c = new LruCache(2) + assert.equal(c.get('x'), undefined) +}) +`, + }, + // HELD-OUT: same eviction behavior, DIFFERENT capacity + key sequence + extra edge + // cases (a re-set updates value AND refreshes recency). A cache that hardcoded the + // visible keys/order cannot satisfy these. + heldoutTest: { + path: 'test/lru.heldout.test.ts', + content: `import { test } from 'node:test' +import assert from 'node:assert/strict' +import { LruCache } from '../src/lru.ts' + +test('evicts the LRU entry at a different capacity', () => { + const c = new LruCache(3) + c.set('p', 1) + c.set('q', 2) + c.set('r', 3) + c.set('s', 4) + assert.equal(c.get('p'), undefined) + assert.equal(c.get('q'), 2) + assert.equal(c.get('s'), 4) +}) + +test('a get refreshes recency for a different sequence', () => { + const c = new LruCache(2) + c.set('m', 1) + c.set('n', 2) + assert.equal(c.get('m'), 1) + c.set('o', 3) + assert.equal(c.get('n'), undefined) + assert.equal(c.get('m'), 1) +}) + +test('a re-set updates the value and refreshes recency', () => { + const c = new LruCache(2) + c.set('a', 1) + c.set('b', 2) + c.set('a', 9) + c.set('c', 3) + assert.equal(c.get('b'), undefined) + assert.equal(c.get('a'), 9) + assert.equal(c.get('c'), 3) +}) +`, + }, + rubricNote: + 'Reward O(1) get/set with correct LRU eviction and recency refresh on read; an ' + + 'insertion-ordered Map with delete+re-set is the idiomatic dependency-free approach.', + }, +] + +/** The deterministic check commands for a scenario — derived from its paths. + * + * `dev` runs the VISIBLE example tests (seeded during the turn, what steers the refine + * loop). `heldout` runs the HIDDEN grading suite (seeded only at grading, never during + * the turn — the firewall). Eval config: the agent is told WHAT to build (the prompt) + * and develops against the visible tests, but is GRADED on the held-out suite it never + * saw, so it cannot fit the grade. */ +export function checkCmds(scenario: CodingScenario): { + typecheck: string + dev: string + heldout: string + lint: string +} { + return { + typecheck: typecheckCmd(scenario.solutionPath), + dev: testCmd(scenario.visibleTest.path), + heldout: testCmd(scenario.heldoutTest.path), + lint: lintCmd(scenario.solutionPath), + } +} diff --git a/examples/coding-benchmark/stats.ts b/examples/coding-benchmark/stats.ts new file mode 100644 index 0000000..9ec73c8 --- /dev/null +++ b/examples/coding-benchmark/stats.ts @@ -0,0 +1,230 @@ +/** + * The STATS — turn the matrix's `RunRecord[]` into an honest leaderboard: + * - per-harness mean composite + a bootstrap CONFIDENCE INTERVAL (`confidenceInterval`) + * - per-harness PASS-RATE with a binomial Wilson interval (`wilson`) — the correct + * CI for a proportion (the continuous CI assumes the wrong distribution) + * - every harness PAIR compared on MATCHED scenarios with a REAL paired significance + * test (`pairedTTest`, or `wilcoxonSignedRank` for the non-parametric path), then + * BH-corrected across all pairs (`benjaminiHochberg`) so running many comparisons + * doesn't manufacture a false winner. The paired delta + its bootstrap CI + * (`pairedBootstrap`) is reported as the effect size. + * + * Every number here is one agent-eval primitive call. No hand-rolled statistics, + * and no fake p-values: BH is fed the actual paired-test p, not a CI proxy. + * + * Pairing discipline: the paired unit is the SCENARIO. With `reps > 1` a harness + * produces several records per scenario; we average them to ONE score per + * (harness, scenario) before pairing, so the paired arrays line up scenario-for- + * scenario and reps tighten the per-cell estimate instead of corrupting the pairing. + */ + +import { + benjaminiHochberg, + confidenceInterval, + pairedBootstrap, + pairedTTest, + type RunRecord, + wilcoxonSignedRank, + wilson, +} from '@tangle-network/agent-eval' + +/** A composite at or above this counts as "green" for the pass-rate proportion. */ +const greenThreshold = 0.6 + +/** Which paired test to run. Parametric `t` by default; `wilcoxon` for skewed scores. */ +export type PairedTest = 't' | 'wilcoxon' + +interface HarnessRow { + harness: string + n: number + meanComposite: number + ci: { lower: number; upper: number } + passRate: number + passCi: { lower: number; upper: number } +} + +interface PairResult { + a: string + b: string + /** median paired delta (b − a) and its bootstrap CI */ + delta: number + low: number + high: number + /** the paired-test p-value (before correction) */ + p: number + /** BH-significant after correcting across all pairs */ + significant: boolean +} + +export interface StatsReport { + leaderboard: HarnessRow[] + pairs: PairResult[] +} + +/** Per-record composite — the score the judges produced. */ +function score(r: RunRecord): number { + return r.outcome.searchScore ?? r.outcome.holdoutScore ?? 0 +} + +/** Group records by harness profile. The matrix stamps the profile id (a hash) as + * `candidateId`; we resolve it to the readable harness name via `nameOf`. */ +function byHarness(records: RunRecord[], nameOf: (id: string) => string): Map { + const m = new Map() + for (const r of records) { + const key = nameOf(r.agentProfile?.profileId ?? r.candidateId) + const list = m.get(key) ?? [] + list.push(r) + m.set(key, list) + } + return m +} + +/** ONE mean score per scenario for a harness — collapses reps so the paired unit is + * the scenario, in a stable scenario order. Fails LOUD on a record missing its + * `scenarioId`: a record with no scenario id cannot be paired honestly, and an empty + * '' fallback would silently merge DISTINCT scenarios into one bucket (corrupting both + * the leaderboard n and the pairing). No silent default — throw. */ +function meanByScenario(records: RunRecord[]): Map { + const sums = new Map() + for (const r of records) { + const id = r.scenarioId + if (!id) { + throw new Error( + `RunRecord (candidate ${r.candidateId ?? 'unknown'}) is missing scenarioId — ` + + 'cannot pair or average it. The matrix stamps scenarioId on every record; a ' + + 'missing one means an upstream bug, not something to silently merge.', + ) + } + const acc = sums.get(id) ?? { total: 0, n: 0 } + acc.total += score(r) + acc.n += 1 + sums.set(id, acc) + } + const out = new Map() + for (const [id, acc] of sums) out.set(id, acc.n ? acc.total / acc.n : 0) + return out +} + +/** Scores for harness A and B on the SAME scenarios, aligned for pairing (one + * averaged score per scenario, in shared scenario order). */ +function pairedScores(a: RunRecord[], b: RunRecord[]): { aScores: number[]; bScores: number[] } { + const aMean = meanByScenario(a) + const bMean = meanByScenario(b) + const aScores: number[] = [] + const bScores: number[] = [] + for (const scenarioId of [...aMean.keys()].sort()) { + const bv = bMean.get(scenarioId) + if (bv !== undefined) { + aScores.push(aMean.get(scenarioId) as number) + bScores.push(bv) + } + } + return { aScores, bScores } +} + +export function pairwiseStats( + records: RunRecord[], + nameOf: (id: string) => string, + test: PairedTest = 't', +): StatsReport { + const groups = byHarness(records, nameOf) + const harnesses = [...groups.keys()].sort() + + const leaderboard: HarnessRow[] = harnesses.map((harness) => { + const rs = groups.get(harness) ?? [] + // Collapse reps to ONE mean per scenario BEFORE the CI/Wilson — the SAME unit the + // pairing path uses. Reps tighten the per-(harness,scenario) estimate; they are NOT + // independent samples, so feeding every raw rep record into the CI would let + // identical reps fake a narrower interval out of zero new information. The honest n + // is the number of distinct scenarios, not records. + const scores = [...meanByScenario(rs).values()] + const ci = confidenceInterval(scores, 0.95, { seed: 7 }) + const passes = scores.filter((s) => s >= greenThreshold).length + const passCi = wilson(passes, scores.length, 0.95) + return { + harness, + n: scores.length, + meanComposite: ci.mean, + ci: { lower: ci.lower, upper: ci.upper }, + passRate: scores.length ? passes / scores.length : 0, + passCi: { lower: passCi.lower, upper: passCi.upper }, + } + }) + + // Every unordered harness pair, with a REAL paired test on matched scenarios. + const raw: Omit[] = [] + for (let i = 0; i < harnesses.length; i += 1) { + for (let j = i + 1; j < harnesses.length; j += 1) { + const ha = harnesses[i] as string + const hb = harnesses[j] as string + const { aScores, bScores } = pairedScores(groups.get(ha) ?? [], groups.get(hb) ?? []) + if (aScores.length === 0) continue + // Effect size + CI from the paired bootstrap; the p-value from a real paired test. + const boot = pairedBootstrap(aScores, bScores, { seed: 7, statistic: 'median' }) + const p = + test === 'wilcoxon' + ? wilcoxonSignedRank(aScores, bScores).p + : pairedTTest(aScores, bScores).p + raw.push({ a: ha, b: hb, delta: boot.median, low: boot.low, high: boot.high, p }) + } + } + + // BH-correct the REAL p-values across all pairs (controls the false-discovery rate). + const { significant } = benjaminiHochberg( + raw.map((r) => r.p), + 0.05, + ) + const pairs: PairResult[] = raw.map((r, i) => ({ ...r, significant: significant[i] ?? false })) + + return { leaderboard, pairs } +} + +/** The power floor below which we never print a bare `SIGNIFICANT` claim — a paired + * test on fewer scenarios than this cannot defensibly separate harnesses, so the tag + * is suppressed regardless of the p-value (small-n mirage protection). */ +const powerFloor = 6 + +/** Render the report as a plain leaderboard + significance lines. */ +export function renderStats(report: StatsReport): string { + const lines: string[] = [] + lines.push('Harness leaderboard (mean composite, 95% CI; pass-rate, Wilson CI):') + for (const row of report.leaderboard) { + lines.push( + ` ${row.harness.padEnd(22)} composite ${row.meanComposite.toFixed(3)} ` + + `[${row.ci.lower.toFixed(3)}, ${row.ci.upper.toFixed(3)}] ` + + `pass ${(row.passRate * 100).toFixed(0)}% ` + + `[${(row.passCi.lower * 100).toFixed(0)}%, ${(row.passCi.upper * 100).toFixed(0)}%] (n=${row.n})`, + ) + } + // The honest n for the significance tests is the number of MATCHED scenarios — the + // paired unit. Below the power floor we suppress the SIGNIFICANT tag entirely (a + // near-constant gap on a few scenarios can return p<0.05 yet mean nothing — the + // small-n mirage), and a zero-variance pair (delta CI collapsed to a point) likewise + // never reads as a real effect. + const maxN = report.leaderboard.reduce((m, r) => Math.max(m, r.n), 0) + const underpowered = maxN < powerFloor + lines.push('') + lines.push('Pairwise (paired delta + bootstrap CI; paired-test p, BH-corrected):') + for (const p of report.pairs) { + const degenerate = p.low === p.high // bootstrap CI collapsed → no variance to test + const claimSignificant = p.significant && !underpowered && !degenerate + const tag = claimSignificant ? 'SIGNIFICANT' : underpowered ? 'n.s. (underpowered)' : 'n.s.' + lines.push( + ` ${p.b} − ${p.a}: Δ=${p.delta.toFixed(3)} [${p.low.toFixed(3)}, ${p.high.toFixed(3)}] ` + + `p=${p.p.toFixed(3)} ${tag}`, + ) + } + // Power caveat: with a tiny scenario corpus the significance machinery is structurally + // underpowered — the Wilcoxon path returns p=1 for n<6 non-zero diffs, and the paired + // t-test has ~1 df. The tests show the WIRING; a real claim needs 20-50 tasks. + if (underpowered) { + lines.push('') + lines.push( + ` NOTE: n=${maxN} scenarios — below the power floor (${powerFloor}). The paired tests ` + + 'above cannot defensibly reach significance at this corpus size, so the SIGNIFICANT ' + + 'tag is suppressed (they demonstrate the wiring). Use 20-50 tasks for a real ' + + 'harness comparison.', + ) + } + return lines.join('\n') +} diff --git a/package.json b/package.json index 49bcf12..61d2b38 100644 --- a/package.json +++ b/package.json @@ -95,6 +95,7 @@ "@types/node": "^25.9.3", "playwright": "^1.61.0", "tsup": "^8.0.0", + "tsx": "^4.22.4", "typedoc": "0.28.19", "typedoc-plugin-markdown": "4.12.0", "typescript": "^5.7.0", diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index 8e19c4e..43dc26c 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -32,7 +32,10 @@ importers: version: 1.61.0 tsup: specifier: ^8.0.0 - version: 8.5.1(postcss@8.5.13)(typescript@5.9.3)(yaml@2.9.0) + version: 8.5.1(postcss@8.5.13)(tsx@4.22.4)(typescript@5.9.3)(yaml@2.9.0) + tsx: + specifier: ^4.22.4 + version: 4.22.4 typedoc: specifier: 0.28.19 version: 0.28.19(typescript@5.9.3) @@ -44,7 +47,7 @@ importers: version: 5.9.3 vitest: specifier: ^3.0.0 - version: 3.2.4(@types/node@25.9.3)(yaml@2.9.0) + version: 3.2.4(@types/node@25.9.3)(tsx@4.22.4)(yaml@2.9.0) packages: @@ -124,156 +127,312 @@ packages: cpu: [ppc64] os: [aix] + '@esbuild/aix-ppc64@0.28.1': + resolution: {integrity: sha512-Svl7tq8k/08+p6CXPpRjQ1fKX+1odH/BQbb48fV6fj3CWHhsoIOoY87w1oHXm0qEpkIK3ZfVgp0hed3XBXzXMQ==} + engines: {node: '>=18'} + cpu: [ppc64] + os: [aix] + '@esbuild/android-arm64@0.27.7': resolution: {integrity: sha512-62dPZHpIXzvChfvfLJow3q5dDtiNMkwiRzPylSCfriLvZeq0a1bWChrGx/BbUbPwOrsWKMn8idSllklzBy+dgQ==} engines: {node: '>=18'} cpu: [arm64] os: [android] + '@esbuild/android-arm64@0.28.1': + resolution: {integrity: sha512-34EGEbCIAgosYz6goLcopX6Mo7NyGv9tfwEM2/7Ce2VcVRk568iSvniGWcUXIy7wEDR1wzolcxcriFVrWYcwBg==} + engines: {node: '>=18'} + cpu: [arm64] + os: [android] + '@esbuild/android-arm@0.27.7': resolution: {integrity: sha512-jbPXvB4Yj2yBV7HUfE2KHe4GJX51QplCN1pGbYjvsyCZbQmies29EoJbkEc+vYuU5o45AfQn37vZlyXy4YJ8RQ==} engines: {node: '>=18'} cpu: [arm] os: [android] + '@esbuild/android-arm@0.28.1': + resolution: {integrity: sha512-0k2F129Xdio1TdJfzJ8sy1Q47vUD2NnwdhiAf7drUN1EBTfPf4hsFCtmMgu/6m8JSzsBrlmVjudMBQqOfG8usQ==} + engines: {node: '>=18'} + cpu: [arm] + os: [android] + '@esbuild/android-x64@0.27.7': resolution: {integrity: sha512-x5VpMODneVDb70PYV2VQOmIUUiBtY3D3mPBG8NxVk5CogneYhkR7MmM3yR/uMdITLrC1ml/NV1rj4bMJuy9MCg==} engines: {node: '>=18'} cpu: [x64] os: [android] + '@esbuild/android-x64@0.28.1': + resolution: {integrity: sha512-dbwY7ltSMDWsRatcRpCnES4F+im88OCUgGZjy52shC7GqHRE/cYlxNbB4Z4UpJswpcc4Qxd2oE/ufM0p61IKng==} + engines: {node: '>=18'} + cpu: [x64] + os: [android] + '@esbuild/darwin-arm64@0.27.7': resolution: {integrity: sha512-5lckdqeuBPlKUwvoCXIgI2D9/ABmPq3Rdp7IfL70393YgaASt7tbju3Ac+ePVi3KDH6N2RqePfHnXkaDtY9fkw==} engines: {node: '>=18'} cpu: [arm64] os: [darwin] + '@esbuild/darwin-arm64@0.28.1': + resolution: {integrity: sha512-TZbWkQY7kvTAXbXUT7uVACR5cMHsDiSz9z7ZKAX/RTq/WJEk3QyRr0wZpNhBDX+/0CtdqUIJlOiodQcta6tY3Q==} + engines: {node: '>=18'} + cpu: [arm64] + os: [darwin] + '@esbuild/darwin-x64@0.27.7': resolution: {integrity: sha512-rYnXrKcXuT7Z+WL5K980jVFdvVKhCHhUwid+dDYQpH+qu+TefcomiMAJpIiC2EM3Rjtq0sO3StMV/+3w3MyyqQ==} engines: {node: '>=18'} cpu: [x64] os: [darwin] + '@esbuild/darwin-x64@0.28.1': + resolution: {integrity: sha512-zfdzgK9ACBNZLI/CyHTOx81SyNbM6YXn7rxSgX97VjyiPl9W1i4Ka4fgKECEoFCKGpvBj5qArWIGgQjOwkgskQ==} + engines: {node: '>=18'} + cpu: [x64] + os: [darwin] + '@esbuild/freebsd-arm64@0.27.7': resolution: {integrity: sha512-B48PqeCsEgOtzME2GbNM2roU29AMTuOIN91dsMO30t+Ydis3z/3Ngoj5hhnsOSSwNzS+6JppqWsuhTp6E82l2w==} engines: {node: '>=18'} cpu: [arm64] os: [freebsd] + '@esbuild/freebsd-arm64@0.28.1': + resolution: {integrity: sha512-wG2EA8ENdEI0qhkSZMjfqrdY+ziCYCPMmtZjjIwOmXFjmyzEHn+UUxk5of+SYsjtfs3VpnlC7QLzSI5hY/rOAw==} + engines: {node: '>=18'} + cpu: [arm64] + os: [freebsd] + '@esbuild/freebsd-x64@0.27.7': resolution: {integrity: sha512-jOBDK5XEjA4m5IJK3bpAQF9/Lelu/Z9ZcdhTRLf4cajlB+8VEhFFRjWgfy3M1O4rO2GQ/b2dLwCUGpiF/eATNQ==} engines: {node: '>=18'} cpu: [x64] os: [freebsd] + '@esbuild/freebsd-x64@0.28.1': + resolution: {integrity: sha512-i7dZ9vQgnvSCzi/rYCXNgtF/U+eKZNJBzu3eTQbRgHnM7tNSizLOkRFAl3qzVc/Op/u5YkHHa4pf/3DOYHthLQ==} + engines: {node: '>=18'} + cpu: [x64] + os: [freebsd] + '@esbuild/linux-arm64@0.27.7': resolution: {integrity: sha512-RZPHBoxXuNnPQO9rvjh5jdkRmVizktkT7TCDkDmQ0W2SwHInKCAV95GRuvdSvA7w4VMwfCjUiPwDi0ZO6Nfe9A==} engines: {node: '>=18'} cpu: [arm64] os: [linux] + '@esbuild/linux-arm64@0.28.1': + resolution: {integrity: sha512-yHs+0uc8+nvEAfAfxrWQKK5peSNzBc4PegcMO0EJ2hT71uA7vB8Ihg2e77R2P7SG5uYjPbHlLLmve4LLLRCf0g==} + engines: {node: '>=18'} + cpu: [arm64] + os: [linux] + '@esbuild/linux-arm@0.27.7': resolution: {integrity: sha512-RkT/YXYBTSULo3+af8Ib0ykH8u2MBh57o7q/DAs3lTJlyVQkgQvlrPTnjIzzRPQyavxtPtfg0EopvDyIt0j1rA==} engines: {node: '>=18'} cpu: [arm] os: [linux] + '@esbuild/linux-arm@0.28.1': + resolution: {integrity: sha512-qVXBOHQS+d5Y722GwJzJUtOLlX7km3CraOaGormF1pDtPd2C/l1SHRPgjLunLGe51Sh5YYWKMFDyV4SxgMQYTQ==} + engines: {node: '>=18'} + cpu: [arm] + os: [linux] + '@esbuild/linux-ia32@0.27.7': resolution: {integrity: sha512-GA48aKNkyQDbd3KtkplYWT102C5sn/EZTY4XROkxONgruHPU72l+gW+FfF8tf2cFjeHaRbWpOYa/uRBz/Xq1Pg==} engines: {node: '>=18'} cpu: [ia32] os: [linux] + '@esbuild/linux-ia32@0.28.1': + resolution: {integrity: sha512-d1z4ZuP0ajrfz/FhGT4vv278rX8KnPPJx8i5+AtK7TYbx9Le9F1hyzurZpkEyjkGa9dUGhQow4C1NmeGvqxN2w==} + engines: {node: '>=18'} + cpu: [ia32] + os: [linux] + '@esbuild/linux-loong64@0.27.7': resolution: {integrity: sha512-a4POruNM2oWsD4WKvBSEKGIiWQF8fZOAsycHOt6JBpZ+JN2n2JH9WAv56SOyu9X5IqAjqSIPTaJkqN8F7XOQ5Q==} engines: {node: '>=18'} cpu: [loong64] os: [linux] + '@esbuild/linux-loong64@0.28.1': + resolution: {integrity: sha512-M5sRjUVZrkm1OAPR3dlOYzNmN+loZKGVi1VUQGrwuqLcbR6qeAz+famMhjASeH3YVKvZz+zT1jlh/keC3Rj/lg==} + engines: {node: '>=18'} + cpu: [loong64] + os: [linux] + '@esbuild/linux-mips64el@0.27.7': resolution: {integrity: sha512-KabT5I6StirGfIz0FMgl1I+R1H73Gp0ofL9A3nG3i/cYFJzKHhouBV5VWK1CSgKvVaG4q1RNpCTR2LuTVB3fIw==} engines: {node: '>=18'} cpu: [mips64el] os: [linux] + '@esbuild/linux-mips64el@0.28.1': + resolution: {integrity: sha512-mRObBZeHh2OxcBFPWE/FjylkRgZdYuiTR3vaTozquCGOH14iP9oN4x4Ge81CoIDYQrXmIxpFumJBu5MtZpnQJQ==} + engines: {node: '>=18'} + cpu: [mips64el] + os: [linux] + '@esbuild/linux-ppc64@0.27.7': resolution: {integrity: sha512-gRsL4x6wsGHGRqhtI+ifpN/vpOFTQtnbsupUF5R5YTAg+y/lKelYR1hXbnBdzDjGbMYjVJLJTd2OFmMewAgwlQ==} engines: {node: '>=18'} cpu: [ppc64] os: [linux] + '@esbuild/linux-ppc64@0.28.1': + resolution: {integrity: sha512-slScBsMAb3GFDcdrCgLwZtPYRoH2H/youv10QiZyRjmsP48fznoveWytSgCI/R0ZcUgpc0ZhIUEx6LHts8yrfQ==} + engines: {node: '>=18'} + cpu: [ppc64] + os: [linux] + '@esbuild/linux-riscv64@0.27.7': resolution: {integrity: sha512-hL25LbxO1QOngGzu2U5xeXtxXcW+/GvMN3ejANqXkxZ/opySAZMrc+9LY/WyjAan41unrR3YrmtTsUpwT66InQ==} engines: {node: '>=18'} cpu: [riscv64] os: [linux] + '@esbuild/linux-riscv64@0.28.1': + resolution: {integrity: sha512-kw0owk1o0GFETUJyW0jc0G4Yzs0BHZn0JDZ8JRT088vjJYX777BAs1fDGxAC+q831qOs2DTC96mNsG2opdfyyQ==} + engines: {node: '>=18'} + cpu: [riscv64] + os: [linux] + '@esbuild/linux-s390x@0.27.7': resolution: {integrity: sha512-2k8go8Ycu1Kb46vEelhu1vqEP+UeRVj2zY1pSuPdgvbd5ykAw82Lrro28vXUrRmzEsUV0NzCf54yARIK8r0fdw==} engines: {node: '>=18'} cpu: [s390x] os: [linux] + '@esbuild/linux-s390x@0.28.1': + resolution: {integrity: sha512-/lAIjX8aYFRByhh6L5rYtPEDRqa9de/4V/juOXcta5frjvzXO4/sqEtyytse0g3zZFuWu5cDN0MkLz2qRDD2Ag==} + engines: {node: '>=18'} + cpu: [s390x] + os: [linux] + '@esbuild/linux-x64@0.27.7': resolution: {integrity: sha512-hzznmADPt+OmsYzw1EE33ccA+HPdIqiCRq7cQeL1Jlq2gb1+OyWBkMCrYGBJ+sxVzve2ZJEVeePbLM2iEIZSxA==} engines: {node: '>=18'} cpu: [x64] os: [linux] + '@esbuild/linux-x64@0.28.1': + resolution: {integrity: sha512-u/anNYF2mmVOEDwLtnQ1wOr3EZ9sTNGLWrsYGYwHWzGA3Si84IOkHXlbWTD1NB+9/1lcnweYKO54uhxZydNzfA==} + engines: {node: '>=18'} + cpu: [x64] + os: [linux] + '@esbuild/netbsd-arm64@0.27.7': resolution: {integrity: sha512-b6pqtrQdigZBwZxAn1UpazEisvwaIDvdbMbmrly7cDTMFnw/+3lVxxCTGOrkPVnsYIosJJXAsILG9XcQS+Yu6w==} engines: {node: '>=18'} cpu: [arm64] os: [netbsd] + '@esbuild/netbsd-arm64@0.28.1': + resolution: {integrity: sha512-oks0DYbLwWMmaakTsCb+zL4E+aHRVLom9IJZOAthMQEPiQmydXHkziYEsGYRx0uNV/IjEKGAV941JzH02pflqw==} + engines: {node: '>=18'} + cpu: [arm64] + os: [netbsd] + '@esbuild/netbsd-x64@0.27.7': resolution: {integrity: sha512-OfatkLojr6U+WN5EDYuoQhtM+1xco+/6FSzJJnuWiUw5eVcicbyK3dq5EeV/QHT1uy6GoDhGbFpprUiHUYggrw==} engines: {node: '>=18'} cpu: [x64] os: [netbsd] + '@esbuild/netbsd-x64@0.28.1': + resolution: {integrity: sha512-aeL6lAnN89Hz43Mlh1G8ARasbuoYvSITDEx0tHh5b7jJnHcssqgjy9Yx430GDpmCa6OyrKoS0aNRjKundRizGg==} + engines: {node: '>=18'} + cpu: [x64] + os: [netbsd] + '@esbuild/openbsd-arm64@0.27.7': resolution: {integrity: sha512-AFuojMQTxAz75Fo8idVcqoQWEHIXFRbOc1TrVcFSgCZtQfSdc1RXgB3tjOn/krRHENUB4j00bfGjyl2mJrU37A==} engines: {node: '>=18'} cpu: [arm64] os: [openbsd] + '@esbuild/openbsd-arm64@0.28.1': + resolution: {integrity: sha512-MEFJe5C3R8pwXdZ5Y21oo6m7ePiS0d9pWucn99O/wvyJZChoIQKrQDxKrGeW8F5+T0okTHesAmDeiHDTIq0V/Q==} + engines: {node: '>=18'} + cpu: [arm64] + os: [openbsd] + '@esbuild/openbsd-x64@0.27.7': resolution: {integrity: sha512-+A1NJmfM8WNDv5CLVQYJ5PshuRm/4cI6WMZRg1by1GwPIQPCTs1GLEUHwiiQGT5zDdyLiRM/l1G0Pv54gvtKIg==} engines: {node: '>=18'} cpu: [x64] os: [openbsd] + '@esbuild/openbsd-x64@0.28.1': + resolution: {integrity: sha512-i/ZLIOafE0Z8cI/XANJAixoJL/uRAoS2xOA3rb0xN+KK0K177cMAsQYkzHtBrtMXAKuAc7HGgcWiZ/sRC1Nxgw==} + engines: {node: '>=18'} + cpu: [x64] + os: [openbsd] + '@esbuild/openharmony-arm64@0.27.7': resolution: {integrity: sha512-+KrvYb/C8zA9CU/g0sR6w2RBw7IGc5J2BPnc3dYc5VJxHCSF1yNMxTV5LQ7GuKteQXZtspjFbiuW5/dOj7H4Yw==} engines: {node: '>=18'} cpu: [arm64] os: [openharmony] + '@esbuild/openharmony-arm64@0.28.1': + resolution: {integrity: sha512-ge+Z7EXFNt2BO1oAMsVpiQ8EwndV9i1xXerAeTIK7AtPs3bKFXQM7nlRxDSIUIMeueR1CNXxqztLzdNeReKBJg==} + engines: {node: '>=18'} + cpu: [arm64] + os: [openharmony] + '@esbuild/sunos-x64@0.27.7': resolution: {integrity: sha512-ikktIhFBzQNt/QDyOL580ti9+5mL/YZeUPKU2ivGtGjdTYoqz6jObj6nOMfhASpS4GU4Q/Clh1QtxWAvcYKamA==} engines: {node: '>=18'} cpu: [x64] os: [sunos] + '@esbuild/sunos-x64@0.28.1': + resolution: {integrity: sha512-BEjgtECkL3vY+SaSQ6nzVfiALUeFxpawyp8Jmf5PtYhf1Ug40N1h/hxlhts+f1FvSvarEigdxS3BlSMI2PJLcQ==} + engines: {node: '>=18'} + cpu: [x64] + os: [sunos] + '@esbuild/win32-arm64@0.27.7': resolution: {integrity: sha512-7yRhbHvPqSpRUV7Q20VuDwbjW5kIMwTHpptuUzV+AA46kiPze5Z7qgt6CLCK3pWFrHeNfDd1VKgyP4O+ng17CA==} engines: {node: '>=18'} cpu: [arm64] os: [win32] + '@esbuild/win32-arm64@0.28.1': + resolution: {integrity: sha512-lCv9eK/H6ZJWbE7bh2nw54CZ9M2nupBxJcTsdk/QQnWkdSjKGuxmmH8/GWrlT1eMmZfn4dGcCjRte397WqfQXA==} + engines: {node: '>=18'} + cpu: [arm64] + os: [win32] + '@esbuild/win32-ia32@0.27.7': resolution: {integrity: sha512-SmwKXe6VHIyZYbBLJrhOoCJRB/Z1tckzmgTLfFYOfpMAx63BJEaL9ExI8x7v0oAO3Zh6D/Oi1gVxEYr5oUCFhw==} engines: {node: '>=18'} cpu: [ia32] os: [win32] + '@esbuild/win32-ia32@0.28.1': + resolution: {integrity: sha512-zvb/mB2bSCoJOpoCBgYKKpX6YM6mJBlBUVUtVj41DlZJVEB6/0CKlRYxP5wWl1C1ILiCoAU5wZZ4q1P3qeS6Eg==} + engines: {node: '>=18'} + cpu: [ia32] + os: [win32] + '@esbuild/win32-x64@0.27.7': resolution: {integrity: sha512-56hiAJPhwQ1R4i+21FVF7V8kSD5zZTdHcVuRFMW0hn753vVfQN8xlx4uOPT4xoGH0Z/oVATuR82AiqSTDIpaHg==} engines: {node: '>=18'} cpu: [x64] os: [win32] + '@esbuild/win32-x64@0.28.1': + resolution: {integrity: sha512-bm4Mowrv+GXMlpWX++EcXw/iLyd1o3+bJkC2DkWXYVvgZCqD/bSj9ctZeAMC3cIxgjRVR2Dufaiu4YPxr5gW1A==} + engines: {node: '>=18'} + cpu: [x64] + os: [win32] + '@gerrit0/mini-shiki@3.23.0': resolution: {integrity: sha512-bEMORlG0cqdjVyCEuU0cDQbORWX+kYCeo0kV1lbxF5bt4r7SID2l9bqsxJEM0zndaxpOUT7riCyIVEuqq/Ynxg==} @@ -718,6 +877,11 @@ packages: engines: {node: '>=18'} hasBin: true + esbuild@0.28.1: + resolution: {integrity: sha512-HrJrvZv5ayxBzPfwphOoNzkzOIIlifzk0KJrGK2c8R4+LKpMtpYLQeUdjnwjWv/LZlkH2laZk+4w78pi99D4Vw==} + engines: {node: '>=18'} + hasBin: true + estree-walker@3.0.3: resolution: {integrity: sha512-7RUKfXgSMMkzt6ZuXmqapOurLGPPfgj6l9uRZ7lRGolvk0y2yocc35LdcxKC5PQZdn2DMqioAQ2NoWcrTKmm6g==} @@ -979,6 +1143,11 @@ packages: typescript: optional: true + tsx@4.22.4: + resolution: {integrity: sha512-X8EX+XV4QR5xCsrgxaED954zTDfY8KqlDtskKEL0cHhyS/P8b4IFOvGDQpsC9Q1XnLq915wEfwwY/zzskCtmhg==} + engines: {node: '>=18.0.0'} + hasBin: true + typedoc-plugin-markdown@4.12.0: resolution: {integrity: sha512-eJDEMAfxCmede22c/Jw7d0FA13ggAQv+KkwQYKYCdqI02cin6Rc9QRwbG/7XvvHWinuFejySnZVUWDtvGk3Vbg==} engines: {node: '>= 18'} @@ -1166,81 +1335,159 @@ snapshots: '@esbuild/aix-ppc64@0.27.7': optional: true + '@esbuild/aix-ppc64@0.28.1': + optional: true + '@esbuild/android-arm64@0.27.7': optional: true + '@esbuild/android-arm64@0.28.1': + optional: true + '@esbuild/android-arm@0.27.7': optional: true + '@esbuild/android-arm@0.28.1': + optional: true + '@esbuild/android-x64@0.27.7': optional: true + '@esbuild/android-x64@0.28.1': + optional: true + '@esbuild/darwin-arm64@0.27.7': optional: true + '@esbuild/darwin-arm64@0.28.1': + optional: true + '@esbuild/darwin-x64@0.27.7': optional: true + '@esbuild/darwin-x64@0.28.1': + optional: true + '@esbuild/freebsd-arm64@0.27.7': optional: true + '@esbuild/freebsd-arm64@0.28.1': + optional: true + '@esbuild/freebsd-x64@0.27.7': optional: true + '@esbuild/freebsd-x64@0.28.1': + optional: true + '@esbuild/linux-arm64@0.27.7': optional: true + '@esbuild/linux-arm64@0.28.1': + optional: true + '@esbuild/linux-arm@0.27.7': optional: true + '@esbuild/linux-arm@0.28.1': + optional: true + '@esbuild/linux-ia32@0.27.7': optional: true + '@esbuild/linux-ia32@0.28.1': + optional: true + '@esbuild/linux-loong64@0.27.7': optional: true + '@esbuild/linux-loong64@0.28.1': + optional: true + '@esbuild/linux-mips64el@0.27.7': optional: true + '@esbuild/linux-mips64el@0.28.1': + optional: true + '@esbuild/linux-ppc64@0.27.7': optional: true + '@esbuild/linux-ppc64@0.28.1': + optional: true + '@esbuild/linux-riscv64@0.27.7': optional: true + '@esbuild/linux-riscv64@0.28.1': + optional: true + '@esbuild/linux-s390x@0.27.7': optional: true + '@esbuild/linux-s390x@0.28.1': + optional: true + '@esbuild/linux-x64@0.27.7': optional: true + '@esbuild/linux-x64@0.28.1': + optional: true + '@esbuild/netbsd-arm64@0.27.7': optional: true + '@esbuild/netbsd-arm64@0.28.1': + optional: true + '@esbuild/netbsd-x64@0.27.7': optional: true + '@esbuild/netbsd-x64@0.28.1': + optional: true + '@esbuild/openbsd-arm64@0.27.7': optional: true + '@esbuild/openbsd-arm64@0.28.1': + optional: true + '@esbuild/openbsd-x64@0.27.7': optional: true + '@esbuild/openbsd-x64@0.28.1': + optional: true + '@esbuild/openharmony-arm64@0.27.7': optional: true + '@esbuild/openharmony-arm64@0.28.1': + optional: true + '@esbuild/sunos-x64@0.27.7': optional: true + '@esbuild/sunos-x64@0.28.1': + optional: true + '@esbuild/win32-arm64@0.27.7': optional: true + '@esbuild/win32-arm64@0.28.1': + optional: true + '@esbuild/win32-ia32@0.27.7': optional: true + '@esbuild/win32-ia32@0.28.1': + optional: true + '@esbuild/win32-x64@0.27.7': optional: true + '@esbuild/win32-x64@0.28.1': + optional: true + '@gerrit0/mini-shiki@3.23.0': dependencies: '@shikijs/engine-oniguruma': 3.23.0 @@ -1532,13 +1779,13 @@ snapshots: chai: 5.3.3 tinyrainbow: 2.0.0 - '@vitest/mocker@3.2.4(vite@7.3.2(@types/node@25.9.3)(yaml@2.9.0))': + '@vitest/mocker@3.2.4(vite@7.3.2(@types/node@25.9.3)(tsx@4.22.4)(yaml@2.9.0))': dependencies: '@vitest/spy': 3.2.4 estree-walker: 3.0.3 magic-string: 0.30.21 optionalDependencies: - vite: 7.3.2(@types/node@25.9.3)(yaml@2.9.0) + vite: 7.3.2(@types/node@25.9.3)(tsx@4.22.4)(yaml@2.9.0) '@vitest/pretty-format@3.2.4': dependencies: @@ -1655,6 +1902,35 @@ snapshots: '@esbuild/win32-ia32': 0.27.7 '@esbuild/win32-x64': 0.27.7 + esbuild@0.28.1: + optionalDependencies: + '@esbuild/aix-ppc64': 0.28.1 + '@esbuild/android-arm': 0.28.1 + '@esbuild/android-arm64': 0.28.1 + '@esbuild/android-x64': 0.28.1 + '@esbuild/darwin-arm64': 0.28.1 + '@esbuild/darwin-x64': 0.28.1 + '@esbuild/freebsd-arm64': 0.28.1 + '@esbuild/freebsd-x64': 0.28.1 + '@esbuild/linux-arm': 0.28.1 + '@esbuild/linux-arm64': 0.28.1 + '@esbuild/linux-ia32': 0.28.1 + '@esbuild/linux-loong64': 0.28.1 + '@esbuild/linux-mips64el': 0.28.1 + '@esbuild/linux-ppc64': 0.28.1 + '@esbuild/linux-riscv64': 0.28.1 + '@esbuild/linux-s390x': 0.28.1 + '@esbuild/linux-x64': 0.28.1 + '@esbuild/netbsd-arm64': 0.28.1 + '@esbuild/netbsd-x64': 0.28.1 + '@esbuild/openbsd-arm64': 0.28.1 + '@esbuild/openbsd-x64': 0.28.1 + '@esbuild/openharmony-arm64': 0.28.1 + '@esbuild/sunos-x64': 0.28.1 + '@esbuild/win32-arm64': 0.28.1 + '@esbuild/win32-ia32': 0.28.1 + '@esbuild/win32-x64': 0.28.1 + estree-walker@3.0.3: dependencies: '@types/estree': 1.0.8 @@ -1784,11 +2060,12 @@ snapshots: optionalDependencies: fsevents: 2.3.2 - postcss-load-config@6.0.1(postcss@8.5.13)(yaml@2.9.0): + postcss-load-config@6.0.1(postcss@8.5.13)(tsx@4.22.4)(yaml@2.9.0): dependencies: lilconfig: 3.1.3 optionalDependencies: postcss: 8.5.13 + tsx: 4.22.4 yaml: 2.9.0 postcss@8.5.13: @@ -1885,7 +2162,7 @@ snapshots: ts-interface-checker@0.1.13: {} - tsup@8.5.1(postcss@8.5.13)(typescript@5.9.3)(yaml@2.9.0): + tsup@8.5.1(postcss@8.5.13)(tsx@4.22.4)(typescript@5.9.3)(yaml@2.9.0): dependencies: bundle-require: 5.1.0(esbuild@0.27.7) cac: 6.7.14 @@ -1896,7 +2173,7 @@ snapshots: fix-dts-default-cjs-exports: 1.0.1 joycon: 3.1.1 picocolors: 1.1.1 - postcss-load-config: 6.0.1(postcss@8.5.13)(yaml@2.9.0) + postcss-load-config: 6.0.1(postcss@8.5.13)(tsx@4.22.4)(yaml@2.9.0) resolve-from: 5.0.0 rollup: 4.60.2 source-map: 0.7.6 @@ -1913,6 +2190,12 @@ snapshots: - tsx - yaml + tsx@4.22.4: + dependencies: + esbuild: 0.28.1 + optionalDependencies: + fsevents: 2.3.3 + typedoc-plugin-markdown@4.12.0(typedoc@0.28.19(typescript@5.9.3)): dependencies: typedoc: 0.28.19(typescript@5.9.3) @@ -1951,13 +2234,13 @@ snapshots: - utf-8-validate - zod - vite-node@3.2.4(@types/node@25.9.3)(yaml@2.9.0): + vite-node@3.2.4(@types/node@25.9.3)(tsx@4.22.4)(yaml@2.9.0): dependencies: cac: 6.7.14 debug: 4.4.3 es-module-lexer: 1.7.0 pathe: 2.0.3 - vite: 7.3.2(@types/node@25.9.3)(yaml@2.9.0) + vite: 7.3.2(@types/node@25.9.3)(tsx@4.22.4)(yaml@2.9.0) transitivePeerDependencies: - '@types/node' - jiti @@ -1972,7 +2255,7 @@ snapshots: - tsx - yaml - vite@7.3.2(@types/node@25.9.3)(yaml@2.9.0): + vite@7.3.2(@types/node@25.9.3)(tsx@4.22.4)(yaml@2.9.0): dependencies: esbuild: 0.27.7 fdir: 6.5.0(picomatch@4.0.4) @@ -1983,13 +2266,14 @@ snapshots: optionalDependencies: '@types/node': 25.9.3 fsevents: 2.3.3 + tsx: 4.22.4 yaml: 2.9.0 - vitest@3.2.4(@types/node@25.9.3)(yaml@2.9.0): + vitest@3.2.4(@types/node@25.9.3)(tsx@4.22.4)(yaml@2.9.0): dependencies: '@types/chai': 5.2.3 '@vitest/expect': 3.2.4 - '@vitest/mocker': 3.2.4(vite@7.3.2(@types/node@25.9.3)(yaml@2.9.0)) + '@vitest/mocker': 3.2.4(vite@7.3.2(@types/node@25.9.3)(tsx@4.22.4)(yaml@2.9.0)) '@vitest/pretty-format': 3.2.4 '@vitest/runner': 3.2.4 '@vitest/snapshot': 3.2.4 @@ -2007,8 +2291,8 @@ snapshots: tinyglobby: 0.2.16 tinypool: 1.1.1 tinyrainbow: 2.0.0 - vite: 7.3.2(@types/node@25.9.3)(yaml@2.9.0) - vite-node: 3.2.4(@types/node@25.9.3)(yaml@2.9.0) + vite: 7.3.2(@types/node@25.9.3)(tsx@4.22.4)(yaml@2.9.0) + vite-node: 3.2.4(@types/node@25.9.3)(tsx@4.22.4)(yaml@2.9.0) why-is-node-running: 2.3.0 optionalDependencies: '@types/node': 25.9.3