refactor(examples): coding-benchmark consumes the agent-eval hidden-criteria grader port by drewstone · Pull Request #381 · tangle-network/agent-runtime

drewstone · 2026-06-24T18:31:27Z

What

Bumps the @tangle-network/agent-eval devDependency to >=0.100.0 and rewires examples/coding-benchmark off its hand-rolled firewall + blend onto the newly published domain-agnostic hidden-criteria grading port (hidden-criteria-grading.ts). This both dedups the example and proves the port on a real benchmark.

The port, consumed (PART A)

Was hand-rolled	Now the substrate port
firewall = a prose banner in `dispatch.ts` (zero enforcement)	`routeFields` (field routing by destination) + `assertNoHiddenLeak` (throws on a leak), re-asserted at grading via `gradeOnHidden`
`runHeldout` inline pass-rate + notes	the `node --test` executor is the domain `HiddenCriteriaGrader` (`nodeTestGrader`), normalized by `hiddenGrade` (honest-zero on no-run)
`composeScore` arithmetic	`blendHeldout` (renormalized weights, clamped inputs)
local `blendHeldout` `JudgeConfig` wrapper	`withHeldoutBlend`
local `HeldoutResult` interface	the substrate `HiddenGradeResult`

The node --test execution stays coding-local — it's plugged in as the domain grader. scenarios.ts (the corpus + visible/held-out tests), offline-box.ts, and stats.ts are unchanged.

Anti-cheat still holds — by execution

Verified through the port path:

CHEAT (round 0): held-out 2/4 (passRate=0.50) -> composite(blendHeldout 0.7/0.3, judge=0.8) = 0.590
REAL  (round 1): held-out 4/4 (passRate=1.00) -> composite(blendHeldout 0.7/0.3, judge=0.8) = 0.940

A hardcode-the-visible cheat passes the visible tests but fails the held-out suite it never saw; the real token-bucket passes. A new regression test proves the firewall throws when the held-out suite or rubric note leaks into the agent context (clean context passes).

`defineBenchmark` skeleton — deferred (PART B)

A thin defineBenchmark skeleton was considered and deliberately not built. The only other benchmark (examples/product-eval) shares nothing with this one beyond runProfileMatrix itself (already a one-line substrate call) — no held-out split, no firewall, no blend (it scores a transcript with a trivial count). A shared skeleton would have zero common logic to dedup and force two unrelated shapes under a premature abstraction. The deferral, its right home (@tangle-network/agent-runtime, not an example), and the second-benchmark trigger that should unlock it are documented in the README.

Verify (offline, no creds)

pnpm install --frozen-lockfile (CI gate) — green; lockfile pins agent-eval@0.100.0
pnpm run build — green
pnpm run typecheck (src + examples) — 0 errors
pnpm run lint — 0 errors
pnpm run docs:check — green (catalog version-pin regenerated to 0.100.0)
pnpm test — 117 files / 1131 passed / 2 skipped / 0 failed
offline benchmark run — leaderboard renders (12 records, 4 harnesses x 3 scenarios), composite 0.944 each

…riteria grader port Bump the agent-eval devDependency to >=0.100.0 and rewire the coding-benchmark example off its hand-rolled firewall + blend onto the published domain-agnostic hidden-criteria grading port: - scenarios.ts: declare each field's destination (agent-visible / develop-against / grading-only / judge-only) and project a scenario into the substrate's RoutedField[] via routeFields (routeCodingFields). - dispatch.ts: the firewall is now ENFORCED, not a comment — assertNoHiddenLeak runs each round against the assembled agent context (a grading-only/judge-only value in-context throws), and grading goes through gradeOnHidden, which re-asserts the firewall on the real run context before seeding + running the held-out suite. - eval.ts: the node --test executor is the domain HiddenCriteriaGrader (nodeTestGrader); composeScore delegates to blendHeldout; the JudgeConfig blend wrapper delegates to withHeldoutBlend; hiddenGrade normalizes the honest-zero no-run. The local HeldoutResult is the substrate HiddenGradeResult. The anti-cheat still holds by execution: the hardcode-the-visible cheat scores held-out 2/4 -> composite 0.59; the real token-bucket scores 4/4 -> composite 0.94 (judge held at 0.80). A new regression test proves the firewall throws when the held-out suite or rubric note leaks into the agent context. product-eval shares nothing with this bench beyond runProfileMatrix (no firewall, no held-out, no blend), so a defineBenchmark skeleton is deferred — documented in the README with the second-benchmark trigger that should unlock it.

tangletools

✅ Auto-approved PR — `e947344d`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T18:31:35Z}

tangletools approved these changes Jun 24, 2026

View reviewed changes

drewstone merged commit c57ba93 into main Jun 24, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(examples): coding-benchmark consumes the agent-eval hidden-criteria grader port#381

refactor(examples): coding-benchmark consumes the agent-eval hidden-criteria grader port#381
drewstone merged 1 commit into
mainfrom
chore/coding-bench-grader-port

drewstone commented Jun 24, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 24, 2026

What

The port, consumed (PART A)

Anti-cheat still holds — by execution

defineBenchmark skeleton — deferred (PART B)

Verify (offline, no creds)

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — e947344d

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`defineBenchmark` skeleton — deferred (PART B)

✅ Auto-approved PR — `e947344d`