docs(examples): simplify + sharpen the example set + clean up the coding benchmark by drewstone · Pull Request #374 · tangle-network/agent-runtime

drewstone · 2026-06-24T13:21:18Z

Applies the real DX wins from an example-set + coding-benchmark review. Every change cuts duplication or restatement prose; no behavior change. The load-bearing properties of the coding benchmark are preserved and execution-verified: the held-out test-execution anti-cheat, the firewall (held-out tests never in the box during the turn), the stats power-floor guard, and the catalog/docs gate.

examples set

README: a 3-example Quickstart (driver-loop / supervise / improve) now leads, above the full 6-tier table — the "run these first" path is no longer buried.
strategy-suite + strategy-evolution: the verbatim ~60-line counterEnv fixture (token-bucket counter Environment) is extracted to strategy-suite/counter-env.ts and imported by both; each example now shows only its distinct concept (compare strategies vs. the holdout gate).
fleet-delegation: renamed UPPERCASE module-globals (SIBLING_ENV/FLEET_ENV → siblingEnv/fleetEnv) — models the documented publish-gotcha house rule instead of teaching the banned pattern.
self-improving-loop + supervisor-loop/shared: stop re-teaching the shot/round vocabulary in 3 places; point at driver-loop (the canonical home). The "do NOT mistake this scripted brain for the pattern" warning is kept.
supervisor-loop/README: leads with the one-knob sentence + a 3-row table (runner → backend → command); the per-backend exposition is demoted under a ## Details section.
agents-of-all-shapes/README: the Python agno snippet is now flagged as illustrative — not run by run.ts, not typechecked.

coding benchmark — 9 files, 2067 → 1978 LOC (file count unchanged)

delete the dead wilcoxon/PairedTest path in stats.ts (no caller or CLI flag ever passed 'wilcoxon'); always run the paired t-test. Dropped the dead import + the README/comment mentions.
rename RunArtifactSummary → BenchmarkSummary — it counts {records, leaderboard}, unrelated to the per-cell RunArtifact the judges score.
de-dup the firewall / scoring-order / offline-degeneracy prose to one canonical home each (the dispatch.ts firewall banner, the README tables); the other copies become one-line pointers.
profiles.ts: drop restatement comments (// NO prompt, NO resources etc.), keep the two non-obvious facts (metadata.harness carries the selector; the snapshot-date requirement).
note that score()'s field order is searchScore-only in this example.

skipped (with reason)

intelligence-recommend importing fixtures from improve.ts — improve.ts runs main() as an import side-effect, so importing it would execute the improve pipeline twice when running intelligence-recommend; sharing would require guarding main() behind import.meta.url + exporting the pieces, which is more churn than the ~26 trivial fixture lines it would save, and it costs each example's self-containedness. Not worth it.

verify (all green, offline, no creds)

pnpm run build ✓ · pnpm run typecheck (src + examples) ✓ · pnpm run lint (328 files) ✓
pnpm run docs:check (catalog regenerate + git diff --exit-code docs/api + freshness) ✓ — no drift
coding-benchmark + supervisor-loop example tests: 10/10 pass
offline pnpm tsx examples/coding-benchmark/benchmark.ts ✓ — power-floor NOTE + SIGNIFICANT-suppression intact
anti-cheat gap intact (execution-verified): rate-limiter cheat held-out 0.50 → composite 0.590; real held-out 1.00 → composite 0.940; gap +0.350
reps don't fake n (execution-verified): reps=1 vs reps=3 leave the leaderboard CI width unchanged (0.5000/0.3000), n stays = distinct scenarios

DO NOT MERGE — operator review.

…ing benchmark Real DX wins only — every change cuts duplication or restatement prose, with no behavior change (anti-cheat, firewall, power-floor guard, and gates intact). examples set: - README: add a 3-example Quickstart (driver-loop / supervise / improve) above the full tier table so the "run these first" path leads. - strategy-suite + strategy-evolution: extract the verbatim ~60-line counterEnv fixture into strategy-suite/counter-env.ts; each example now shows only its distinct concept (compare strategies vs the holdout gate). - fleet-delegation: rename UPPERCASE module-globals (SIBLING_ENV/FLEET_ENV -> siblingEnv/fleetEnv) to model the documented publish-gotcha house rule. - self-improving-loop + supervisor-loop/shared: stop re-teaching the shot/round vocabulary; point at driver-loop (the canonical home). - supervisor-loop/README: lead with the one-knob sentence + a 3-row table, demote the per-backend exposition under a Details section. - agents-of-all-shapes/README: flag the Python agno snippet as illustrative — not run by run.ts, not typechecked. coding benchmark (9 files, 2067 -> 1978 LOC, file count unchanged): - delete the dead wilcoxon / PairedTest path in stats.ts (no caller ever passed 'wilcoxon'); always run the paired t-test. - rename RunArtifactSummary -> BenchmarkSummary (it counted records/leaderboard; it never was the per-cell RunArtifact the judges score). - de-dup the firewall / scoring-order / offline-degeneracy prose to one canonical home each (dispatch.ts banner, the README tables); shorten the others to a pointer. - profiles.ts: drop restatement comments, keep the two non-obvious facts (metadata.harness selector, snapshot-date requirement). - note that the score() field-order is searchScore-only in this example.

tangletools

✅ Auto-approved PR — `1bdbe4ec`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T13:21:26Z}

tangletools approved these changes Jun 24, 2026

View reviewed changes

drewstone merged commit 7982c2c into main Jun 24, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(examples): simplify + sharpen the example set + clean up the coding benchmark#374

docs(examples): simplify + sharpen the example set + clean up the coding benchmark#374
drewstone merged 1 commit into
mainfrom
examples-simplify

drewstone commented Jun 24, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 24, 2026

examples set

coding benchmark — 9 files, 2067 → 1978 LOC (file count unchanged)

skipped (with reason)

verify (all green, offline, no creds)

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 1bdbe4ec

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `1bdbe4ec`