docs(examples): simplify + sharpen the example set + clean up the coding benchmark#374
Merged
Conversation
…ing benchmark Real DX wins only — every change cuts duplication or restatement prose, with no behavior change (anti-cheat, firewall, power-floor guard, and gates intact). examples set: - README: add a 3-example Quickstart (driver-loop / supervise / improve) above the full tier table so the "run these first" path leads. - strategy-suite + strategy-evolution: extract the verbatim ~60-line counterEnv fixture into strategy-suite/counter-env.ts; each example now shows only its distinct concept (compare strategies vs the holdout gate). - fleet-delegation: rename UPPERCASE module-globals (SIBLING_ENV/FLEET_ENV -> siblingEnv/fleetEnv) to model the documented publish-gotcha house rule. - self-improving-loop + supervisor-loop/shared: stop re-teaching the shot/round vocabulary; point at driver-loop (the canonical home). - supervisor-loop/README: lead with the one-knob sentence + a 3-row table, demote the per-backend exposition under a Details section. - agents-of-all-shapes/README: flag the Python agno snippet as illustrative — not run by run.ts, not typechecked. coding benchmark (9 files, 2067 -> 1978 LOC, file count unchanged): - delete the dead wilcoxon / PairedTest path in stats.ts (no caller ever passed 'wilcoxon'); always run the paired t-test. - rename RunArtifactSummary -> BenchmarkSummary (it counted records/leaderboard; it never was the per-cell RunArtifact the judges score). - de-dup the firewall / scoring-order / offline-degeneracy prose to one canonical home each (dispatch.ts banner, the README tables); shorten the others to a pointer. - profiles.ts: drop restatement comments, keep the two non-obvious facts (metadata.harness selector, snapshot-date requirement). - note that the score() field-order is searchScore-only in this example.
tangletools
approved these changes
Jun 24, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — 1bdbe4ec
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T13:21:26Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Applies the real DX wins from an example-set + coding-benchmark review. Every change cuts duplication or restatement prose; no behavior change. The load-bearing properties of the coding benchmark are preserved and execution-verified: the held-out test-execution anti-cheat, the firewall (held-out tests never in the box during the turn), the stats power-floor guard, and the catalog/docs gate.
examples set
driver-loop/supervise/improve) now leads, above the full 6-tier table — the "run these first" path is no longer buried.counterEnvfixture (token-bucket counterEnvironment) is extracted tostrategy-suite/counter-env.tsand imported by both; each example now shows only its distinct concept (compare strategies vs. the holdout gate).SIBLING_ENV/FLEET_ENV→siblingEnv/fleetEnv) — models the documented publish-gotcha house rule instead of teaching the banned pattern.driver-loop(the canonical home). The "do NOT mistake this scripted brain for the pattern" warning is kept.## Detailssection.run.ts, not typechecked.coding benchmark — 9 files, 2067 → 1978 LOC (file count unchanged)
wilcoxon/PairedTestpath instats.ts(no caller or CLI flag ever passed'wilcoxon'); always run the paired t-test. Dropped the dead import + the README/comment mentions.RunArtifactSummary→BenchmarkSummary— it counts{records, leaderboard}, unrelated to the per-cellRunArtifactthe judges score.dispatch.tsfirewall banner, the README tables); the other copies become one-line pointers.// NO prompt, NO resourcesetc.), keep the two non-obvious facts (metadata.harnesscarries the selector; the snapshot-date requirement).score()'s field order issearchScore-only in this example.skipped (with reason)
improve.tsrunsmain()as an import side-effect, so importing it would execute the improve pipeline twice when running intelligence-recommend; sharing would require guardingmain()behindimport.meta.url+ exporting the pieces, which is more churn than the ~26 trivial fixture lines it would save, and it costs each example's self-containedness. Not worth it.verify (all green, offline, no creds)
pnpm run build✓ ·pnpm run typecheck(src + examples) ✓ ·pnpm run lint(328 files) ✓pnpm run docs:check(catalog regenerate +git diff --exit-code docs/api+ freshness) ✓ — no driftpnpm tsx examples/coding-benchmark/benchmark.ts✓ — power-floor NOTE + SIGNIFICANT-suppression intactDO NOT MERGE — operator review.