Skip to content

refactor(examples): coding-benchmark consumes the agent-eval hidden-criteria grader port#381

Merged
drewstone merged 1 commit into
mainfrom
chore/coding-bench-grader-port
Jun 24, 2026
Merged

refactor(examples): coding-benchmark consumes the agent-eval hidden-criteria grader port#381
drewstone merged 1 commit into
mainfrom
chore/coding-bench-grader-port

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Bumps the @tangle-network/agent-eval devDependency to >=0.100.0 and rewires examples/coding-benchmark off its hand-rolled firewall + blend onto the newly published domain-agnostic hidden-criteria grading port (hidden-criteria-grading.ts). This both dedups the example and proves the port on a real benchmark.

The port, consumed (PART A)

Was hand-rolled Now the substrate port
firewall = a prose banner in dispatch.ts (zero enforcement) routeFields (field routing by destination) + assertNoHiddenLeak (throws on a leak), re-asserted at grading via gradeOnHidden
runHeldout inline pass-rate + notes the node --test executor is the domain HiddenCriteriaGrader (nodeTestGrader), normalized by hiddenGrade (honest-zero on no-run)
composeScore arithmetic blendHeldout (renormalized weights, clamped inputs)
local blendHeldout JudgeConfig wrapper withHeldoutBlend
local HeldoutResult interface the substrate HiddenGradeResult

The node --test execution stays coding-local — it's plugged in as the domain grader. scenarios.ts (the corpus + visible/held-out tests), offline-box.ts, and stats.ts are unchanged.

Anti-cheat still holds — by execution

Verified through the port path:

CHEAT (round 0): held-out 2/4 (passRate=0.50) -> composite(blendHeldout 0.7/0.3, judge=0.8) = 0.590
REAL  (round 1): held-out 4/4 (passRate=1.00) -> composite(blendHeldout 0.7/0.3, judge=0.8) = 0.940

A hardcode-the-visible cheat passes the visible tests but fails the held-out suite it never saw; the real token-bucket passes. A new regression test proves the firewall throws when the held-out suite or rubric note leaks into the agent context (clean context passes).

defineBenchmark skeleton — deferred (PART B)

A thin defineBenchmark skeleton was considered and deliberately not built. The only other benchmark (examples/product-eval) shares nothing with this one beyond runProfileMatrix itself (already a one-line substrate call) — no held-out split, no firewall, no blend (it scores a transcript with a trivial count). A shared skeleton would have zero common logic to dedup and force two unrelated shapes under a premature abstraction. The deferral, its right home (@tangle-network/agent-runtime, not an example), and the second-benchmark trigger that should unlock it are documented in the README.

Verify (offline, no creds)

  • pnpm install --frozen-lockfile (CI gate) — green; lockfile pins agent-eval@0.100.0
  • pnpm run build — green
  • pnpm run typecheck (src + examples) — 0 errors
  • pnpm run lint — 0 errors
  • pnpm run docs:check — green (catalog version-pin regenerated to 0.100.0)
  • pnpm test — 117 files / 1131 passed / 2 skipped / 0 failed
  • offline benchmark run — leaderboard renders (12 records, 4 harnesses x 3 scenarios), composite 0.944 each

…riteria grader port

Bump the agent-eval devDependency to >=0.100.0 and rewire the coding-benchmark
example off its hand-rolled firewall + blend onto the published domain-agnostic
hidden-criteria grading port:

- scenarios.ts: declare each field's destination (agent-visible / develop-against
  / grading-only / judge-only) and project a scenario into the substrate's
  RoutedField[] via routeFields (routeCodingFields).
- dispatch.ts: the firewall is now ENFORCED, not a comment — assertNoHiddenLeak
  runs each round against the assembled agent context (a grading-only/judge-only
  value in-context throws), and grading goes through gradeOnHidden, which
  re-asserts the firewall on the real run context before seeding + running the
  held-out suite.
- eval.ts: the node --test executor is the domain HiddenCriteriaGrader
  (nodeTestGrader); composeScore delegates to blendHeldout; the JudgeConfig blend
  wrapper delegates to withHeldoutBlend; hiddenGrade normalizes the honest-zero
  no-run. The local HeldoutResult is the substrate HiddenGradeResult.

The anti-cheat still holds by execution: the hardcode-the-visible cheat scores
held-out 2/4 -> composite 0.59; the real token-bucket scores 4/4 -> composite
0.94 (judge held at 0.80). A new regression test proves the firewall throws when
the held-out suite or rubric note leaks into the agent context.

product-eval shares nothing with this bench beyond runProfileMatrix (no firewall,
no held-out, no blend), so a defineBenchmark skeleton is deferred — documented in
the README with the second-benchmark trigger that should unlock it.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — e947344d

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T18:31:35Z

@drewstone drewstone merged commit c57ba93 into main Jun 24, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants