Skip to content

docs(examples): simplify + sharpen the example set + clean up the coding benchmark#374

Merged
drewstone merged 1 commit into
mainfrom
examples-simplify
Jun 24, 2026
Merged

docs(examples): simplify + sharpen the example set + clean up the coding benchmark#374
drewstone merged 1 commit into
mainfrom
examples-simplify

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Applies the real DX wins from an example-set + coding-benchmark review. Every change cuts duplication or restatement prose; no behavior change. The load-bearing properties of the coding benchmark are preserved and execution-verified: the held-out test-execution anti-cheat, the firewall (held-out tests never in the box during the turn), the stats power-floor guard, and the catalog/docs gate.

examples set

  • README: a 3-example Quickstart (driver-loop / supervise / improve) now leads, above the full 6-tier table — the "run these first" path is no longer buried.
  • strategy-suite + strategy-evolution: the verbatim ~60-line counterEnv fixture (token-bucket counter Environment) is extracted to strategy-suite/counter-env.ts and imported by both; each example now shows only its distinct concept (compare strategies vs. the holdout gate).
  • fleet-delegation: renamed UPPERCASE module-globals (SIBLING_ENV/FLEET_ENVsiblingEnv/fleetEnv) — models the documented publish-gotcha house rule instead of teaching the banned pattern.
  • self-improving-loop + supervisor-loop/shared: stop re-teaching the shot/round vocabulary in 3 places; point at driver-loop (the canonical home). The "do NOT mistake this scripted brain for the pattern" warning is kept.
  • supervisor-loop/README: leads with the one-knob sentence + a 3-row table (runner → backend → command); the per-backend exposition is demoted under a ## Details section.
  • agents-of-all-shapes/README: the Python agno snippet is now flagged as illustrative — not run by run.ts, not typechecked.

coding benchmark — 9 files, 2067 → 1978 LOC (file count unchanged)

  • delete the dead wilcoxon/PairedTest path in stats.ts (no caller or CLI flag ever passed 'wilcoxon'); always run the paired t-test. Dropped the dead import + the README/comment mentions.
  • rename RunArtifactSummaryBenchmarkSummary — it counts {records, leaderboard}, unrelated to the per-cell RunArtifact the judges score.
  • de-dup the firewall / scoring-order / offline-degeneracy prose to one canonical home each (the dispatch.ts firewall banner, the README tables); the other copies become one-line pointers.
  • profiles.ts: drop restatement comments (// NO prompt, NO resources etc.), keep the two non-obvious facts (metadata.harness carries the selector; the snapshot-date requirement).
  • note that score()'s field order is searchScore-only in this example.

skipped (with reason)

  • intelligence-recommend importing fixtures from improve.tsimprove.ts runs main() as an import side-effect, so importing it would execute the improve pipeline twice when running intelligence-recommend; sharing would require guarding main() behind import.meta.url + exporting the pieces, which is more churn than the ~26 trivial fixture lines it would save, and it costs each example's self-containedness. Not worth it.

verify (all green, offline, no creds)

  • pnpm run build ✓ · pnpm run typecheck (src + examples) ✓ · pnpm run lint (328 files) ✓
  • pnpm run docs:check (catalog regenerate + git diff --exit-code docs/api + freshness) ✓ — no drift
  • coding-benchmark + supervisor-loop example tests: 10/10 pass
  • offline pnpm tsx examples/coding-benchmark/benchmark.ts ✓ — power-floor NOTE + SIGNIFICANT-suppression intact
  • anti-cheat gap intact (execution-verified): rate-limiter cheat held-out 0.50 → composite 0.590; real held-out 1.00 → composite 0.940; gap +0.350
  • reps don't fake n (execution-verified): reps=1 vs reps=3 leave the leaderboard CI width unchanged (0.5000/0.3000), n stays = distinct scenarios

DO NOT MERGE — operator review.

…ing benchmark

Real DX wins only — every change cuts duplication or restatement prose, with
no behavior change (anti-cheat, firewall, power-floor guard, and gates intact).

examples set:
- README: add a 3-example Quickstart (driver-loop / supervise / improve) above
  the full tier table so the "run these first" path leads.
- strategy-suite + strategy-evolution: extract the verbatim ~60-line counterEnv
  fixture into strategy-suite/counter-env.ts; each example now shows only its
  distinct concept (compare strategies vs the holdout gate).
- fleet-delegation: rename UPPERCASE module-globals (SIBLING_ENV/FLEET_ENV ->
  siblingEnv/fleetEnv) to model the documented publish-gotcha house rule.
- self-improving-loop + supervisor-loop/shared: stop re-teaching the shot/round
  vocabulary; point at driver-loop (the canonical home).
- supervisor-loop/README: lead with the one-knob sentence + a 3-row table, demote
  the per-backend exposition under a Details section.
- agents-of-all-shapes/README: flag the Python agno snippet as illustrative —
  not run by run.ts, not typechecked.

coding benchmark (9 files, 2067 -> 1978 LOC, file count unchanged):
- delete the dead wilcoxon / PairedTest path in stats.ts (no caller ever passed
  'wilcoxon'); always run the paired t-test.
- rename RunArtifactSummary -> BenchmarkSummary (it counted records/leaderboard;
  it never was the per-cell RunArtifact the judges score).
- de-dup the firewall / scoring-order / offline-degeneracy prose to one canonical
  home each (dispatch.ts banner, the README tables); shorten the others to a pointer.
- profiles.ts: drop restatement comments, keep the two non-obvious facts
  (metadata.harness selector, snapshot-date requirement).
- note that the score() field-order is searchScore-only in this example.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 1bdbe4ec

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T13:21:26Z

@drewstone drewstone merged commit 7982c2c into main Jun 24, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants