feat: support verifiers v1 eval in prime CLI by mikasenghaas · Pull Request #751 · PrimeIntellect-ai/prime

mikasenghaas · 2026-06-17T18:10:26Z

Summary

Companion PR to verifiers feat/nano-as-v1. Migrates the prime CLI's local eval lifecycle to the new config-driven v1 eval entrypoint and Trace output format, without regressing the v0 paths.

Dependency: pin verifiers to the feat/nano-as-v1 line (rev bbfd564) in both pyprojects; uv sync. Bump prime's requires-python to >=3.11,<3.14 (v1 verifiers requires it).
prime eval run now drives the v1 eval console script (via a -c shim in the workspace venv — verifiers.v1.cli.eval has no __main__). v0 hub/local environments run through the bridge's legacy --id path (which still produces v1 Trace output); a native v1 taskset can be selected by passing --taskset.id. Convenience flags (-m/-b/-k/--sampling-args/-n/-r/--header/...) are translated into a temporary v1 config TOML (handles nested sampling.extra_body and dash-cased headers that dotted CLI flags mangle), and remaining v1 flags (--harness.id, --client.*, @ file.toml, ...) are forwarded verbatim and override the temp config. Model validation + inference billing preflight are preserved.
prime eval view / tui consume the v1 output format: a new prime_cli/utils/v1_results.py adapts a serialized Trace to the v0 record shape the viewer renders (prompt/completion/reward/metrics/info/error) and synthesizes the run-level metadata from config.toml; data.py discovers v1 run dirs (outputs/<taskset>--<model>--<harness>/<uuid> with config.toml + results.jsonl). v0 discovery/rendering is unchanged.
prime lab setup now also adds tasksets + harnesses (the built-in v1 plugin packages that prime eval run resolves, e.g. the default harness).
Removed the dead auto-upload helpers (job-id / push footer) and obsolete hosted-run tests; kept the hosted helper functions and their direct tests.

Breaking

Python floor: prime now requires >=3.11,<3.14 (was >=3.10). v1 verifiers (0.1.15.dev*) drops Python 3.10.
prime eval run --hosted raises NotImplementedError until the platform backend understands the v1 eval format. The run config is still parsed and a HostedEvalConfig built (machinery preserved), but no submission happens. Migration: run locally with prime eval run <env> (omit --hosted).
prime eval push raises an informative error when pointed at a v1 run dir (the platform isn't v1-aware yet) — inspect locally with prime eval view. Pushing v0 result dirs is unchanged.
prime eval run no longer auto-uploads results for v1 runs; results stay local (--skip-upload is now a no-op). Use prime eval view.

Verification

uv sync resolves verifiers==0.1.15.dev305 from the pinned rev.
uv run pytest packages/prime/tests -> 783 passed, 2 skipped.
End-to-end against a real v1 results.jsonl (built via the verifiers library): discover_local_eval_runs finds the run (format=v1), the LazyRunResults adapter yields prompt/completion/reward/metrics, compute_run_overview_stats aggregates rewards, synthesized metadata reports avg_reward, and prime eval push <v1 dir> raises the informative error.
prime eval run --hosted raises the gated NotImplementedError; prime eval --help lists the full command tree.

Note: a full live prime eval run requires a lab workspace provisioned with the v1 stack (verifiers + tasksets + harnesses) and an inference model; this lands alongside the v1 verifiers release.

🤖 Generated with Claude Code

Companion to verifiers feat/nano-as-v1. Migrates the local eval lifecycle to the new config-driven v1 `eval` entrypoint and Trace output format. - Pin verifiers to the feat/nano-as-v1 line (rev bbfd564); bump requires-python to >=3.11,<3.14 (v1 verifiers drops 3.10). - `prime eval run` invokes the v1 `eval` console script. v0 hub/local envs run through the bridge's legacy `--id` path (-> v1 Trace output); convenience flags (-m/-b/-k/--sampling-args/...) become a temp v1 config TOML, with remaining flags forwarded verbatim. Auto-upload is disabled (results stay local). - `prime eval view`/`tui` consume the v1 Trace `results.jsonl`: new utils/v1_results.py adapts a Trace to the v0 record shape the viewer renders, synthesizes run metadata from config.toml, and data.py discovers v1 run dirs. v0 discovery/rendering unchanged. - `prime eval push` errors informatively on a v1 run dir (platform isn't v1-aware yet); v0 push is unchanged. - `--hosted` raises NotImplementedError (the HostedEvalConfig is still built from the parsed run args so the hosted machinery stays wired). - `prime lab setup` also adds tasksets + harnesses (the built-in v1 plugins that `prime eval run` resolves, e.g. the default harness). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…me) (#2831) * chore(v1): add prime CLI as deps/prime submodule (feat/nano-as-v1) Tracks the companion PrimeIntellect-ai/prime#751 branch that adds verifiers v1 eval support to the prime CLI (run/view consume the v1 entrypoint + Trace format). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to feat/nano-as-v1 tip (bbfd5646 -> 220f21d4) New: #1727 (per-rollout isolation for shared writable tool servers) and #1702 (trim verifiers runtime deps — modal/notebook/quest/pdf moved to extras). No prime-rl-side changes needed: the only dropped transitive dep is pymupdf, used solely by verifiers' experimental quest PDF tool via a lazy import behind the quest extra (prime-rl never touches it). Imports resolve on dev307; reverse-text-v1 eval smoke clean (reward 1.0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: validate env-server traces via model_dump (companion to verifiers to_wire removal) verifiers drops Trace.to_wire/from_wire and the derived computed fields (reward, is_truncated, error, duration are plain properties now). Swap wire.to_wire() -> wire.model_dump() when re-typing a returned Trace into ROLLOUT_TYPE; the .reward / .is_truncated the metrics/eval code reads are the Trace properties, so they still work. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: rename vf.Task(instruction=) -> prompt= (verifiers #1732 companion) verifiers #1732 renames Task.instruction -> Task.prompt; update the dispatcher's error-rollout Task construction to match. Pin bump to the merged commit comes when #1732 lands. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to chore/v1-general-cleanup tip (220f21d4 -> d78d7474) Pin #1732 (general v1 cleanup) so this companion's model_dump trace handling and the Task.prompt rename round-trip against a verifiers that actually has them. Re-pin to the feat/nano-as-v1 tip once #1732 merges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to feat/nano-as-v1 tip (d78d7474 -> caaf0ff3) #1732 (general v1 cleanup) merged into feat/nano-as-v1; re-pin off the chore branch onto the integration tip. This companion's model_dump trace handling + Task.prompt rename now run against merged verifiers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: error-rollout placeholder task uses prompt=None An error rollout has no prompt; None is the honest value (Task.prompt is str | Messages | None) rather than an empty string. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat: add vf-nano as submodule Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump deps/vf-nano to feat/env-server (EnvServer) Points the submodule at the vf-nano EnvServer branch so the orchestrator can build on the env-server abstraction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: run orchestrator on a vf-nano env server (reverse-text) Switch prime-rl's env path to vf-nano: the orchestrator spawns a vf-nano EnvServer per env (it never loads an environment), dispatches rollouts by task index, and trains on the returned Trace dicts (branches + renderer tokens). - pyproject: dep verifiers -> vf-nano; drop v1/research env packages; only the vf-nano reverse-text example; override out the transitive v1 verifiers (pulled by the prime CLI) so it can't shadow vf-nano's `verifiers` package; add orjson /pandas/msgspec (were transitive via verifiers). - EnvConfig inherits vf-nano's swappable agent/runtime (+ max_turns). - envs.py: spawn EnvServer child + EnvClient, info() for num_tasks/group-scoring, dispatch by task_idx, adapt Trace -> RolloutOutput-shaped dict. - trajectories.py: trace_to_samples (one sample per Trace branch) + trace_to_output. - train_source: index sampling; client pool builds vf-nano ClientConfig; lag monitor vendored; env-server entrypoint repointed; ~14 files retyped off vf.RolloutOutput / vf.ClientConfig. - configs/debug/vf_nano_reverse_text.toml. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: consume vf-nano Trace natively (branches→samples, shared renderer config) - trace_to_samples stitches each Trace branch's tokens into one TrainingSample (prompt = branch start, then each turn's new context [masked] + generated tokens [trained]); drop the RolloutOutput adapter — read the Trace's native fields directly (reward, error{type,message}, timing generation/scoring, num_turns, branches). - envs returns the raw Trace; eval_sink / train_sink / dispatcher / metrics / orchestrator read native Trace fields (no token_usage/completion/timing.total). - client pool forwards the shared renderers.RendererConfig to the env server's renderer client (so it uses qwen3, not the tool-less default fallback). - debug config: tool_call_parser=hermes (vLLM accepts the agent's tools), max_steps=20. - bump deps/vf-nano. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: pass typed ClientConfig/SamplingConfig to the env client (no timeout) - Env.run_rollout/run_group pass the vf-nano ClientConfig object and a SamplingConfig (built from the env's sampling args) directly — no model_dump, no per-rollout timeout forwarded to the server. - debug config: max_steps=20. - bump deps/vf-nano (typed env-server RPC). * refactor: orchestrator holds a typed vf.Trace[EnvTask] (no dicts) The env server returns a Trace minus its derived fields; the orchestrator resolves the env's Task subclass (from config.id) and validates the wire dict into a strict Trace[EnvTask], so the whole orchestrator works with a real, typed vf.Trace — typed task fields included (e.g. task.answer), nothing subscriptable. - envs.py: resolve_task_type(env_id); run_rollout/run_group validate -> Trace[EnvTask]. - trajectories/types/dispatcher/train_sink/eval_sink/metrics/filters/advantage/utils /orchestrator: attribute access on the typed Trace (reward, error{type,message}, branches, timing.<span>.duration, num_turns, ...); derived fields recompute on the consumer. - Task/Trace/TimeSpan stay strict (StrictBaseModel) — no extra=ignore anywhere. - bump deps/vf-nano. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: depend on vf-nano[serve]; bump submodule The orchestrator spawns the env server, so request the serve extra (zmq/msgpack) explicitly now that vf-nano keeps them out of core. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump vf-nano (client docstring cleanup) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: drop redundant forward-ref quotes in advantage.py `from __future__ import annotations` already defers all annotations to strings, so the quotes + `# noqa: F821` on the TYPE_CHECKING-only `vf.Trace` / `TrainRollout` annotations are unnecessary (no import cycle — verifiers.nano never imports prime_rl). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: rename FinishedRollout.raw -> trace The field holds a typed vf.Trace, so `trace` reads truer than `raw` (which suggested an unparsed dict). Renames the field + every `.raw` access, the `emit_rollout(trace=...)` param/kwarg, the to_dict field filter, and the dispatcher cancel-path locals. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: simplify FinishedRollout, read straight off the typed Trace - Drop the FinishedRollout proxy properties (error/reward/is_truncated and the example_id field); consumers now read r.trace.{reward,is_truncated,task.idx,...} directly. The trace is the single source of truth. - Use vf.Trace.has_error for existence checks instead of `.error is not None`. - Replace the prime-rl trace_* token-length utils with vf.Trace.{completion_len, total_tokens,has_response} (now on the trace); keep trace_to_samples. - Carry task_idx end-to-end (GroupState.task_idx, env.run_rollout/run_group(task_idx), source dict key) instead of the example/example_id dict carrier; identity comes off trace.task.idx. - Mark the local-package env arrangement as a temporary/experimental TODO. - Move the debug config to configs/debug/nano/reverse_text.toml. - Bump deps/vf-nano (Trace/Turn accessors). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: spawn env server on an OS-assigned port, drop the startup poll - The env server binds tcp://127.0.0.1:0 and reports its concrete address back over a queue; the orchestrator connects to that. Removes _get_free_port and its TOCTOU race (the OS assigns the port atomically). - A spawned server has already bound + loaded by the time it reports its address, so the untimed info() is enough — only poll wait_for_server_startup for an external (config.address) server, which has no spawn handshake. - Bump deps/vf-nano (port report + Trace/Branch token-length accessors). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: use vf.task_type instead of a local resolve_task_type The Task-subclass introspection now lives in vf-nano (vf.task_type); drop the prime-rl copy and build the typed Trace via vf.Trace[vf.task_type(env_id)]. Bump deps/vf-nano. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: restore backfill_rollout_tokens for SFT (typed Trace) SFT trains on a teacher served over the chat client, which returns no token ids, so the trace's turns have tokens=None and trace_to_samples yields nothing. Restore backfill: for each tokenless turn, render its prompt + assistant response with the student chat template and split on the longest common prefix to fill TurnTokens (masks/logprobs come from trace_to_samples). train_sink.process_rollout backfills when any turn lacks tokens, before building samples. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix: pass task_idx when building Cancelled traces on off-policy drop drop_group's error_rollout_output calls omitted the required task_idx, so an off-policy cancel (on_new_version) raised TypeError. Use the group's task_idx (or -1 when the group is already gone), mirroring handle_completed_rollout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: consume typed Trace[WireTask]; inline synthetic error traces - envs.py: EnvClient now returns Trace[WireTask]; upgrade to this env's real Task subclass via self.trace_type.model_validate(wire.to_wire()). - dispatcher.py: drop the error_rollout_output helper — inline the synthetic error Trace at each call site using vf.Error's field names (type/message/traceback); the task-exception path carries a real traceback, cancels/empty-trajectory carry none. - Bump deps/vf-nano. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: env-server file logging; align debug config batch/group to canonical - Spawned env servers now route their output (logging + subprocess-runtime output) to <output_dir>/logs/envs/<name>.log via a _run_env_server wrapper that redirects stdout/stderr and sets up logging in the child. Previously the orchestrator-spawned server logged nowhere. - Debug config: batch_size 16->128, group_size 8->16, eval num_examples 8->128 (interval=1), matching configs/debug/training_modes/rl.toml. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix: don't double the envs/ segment in the env-server log path The orchestrator already passes a train/eval-split log_dir (.../logs/envs/train, .../logs/envs/eval), so _spawn must drop the file directly under it (<log_dir>/<name>.log) rather than re-adding an envs/ subdir — which had buried the train/eval split under logs/envs/<kind>/envs/<name>.log. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump vf-nano (Error.traceback str | None) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump vf-nano (to_wire ordering) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: launch env servers as separate processes from the rl entrypoint Instead of the orchestrator sidecar-spawning each env server as an mp child, the rl launcher now spawns one `env-server` process per env (train + eval), each on a free port, with output to logs/envs/{kind}/{name}.log and a crash monitor — same model as inference/trainer. It sets env.address in the orchestrator config so the orchestrator attaches (its existing external path) instead of spawning. Envs that already set address (user-managed external server) are left alone; the orchestrator's mp sidecar stays as the fallback for running `orchestrator` directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: env servers use fixed configurable ports, not get_free_port Add RLConfig.env_server_base_port (default 5000); the i-th launcher-managed env binds base_port + i. Drops the get_free_port dependency in the launcher. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: separate train/eval env-server port blocks Train envs bind base_port + i; eval envs bind base_port + ENV_SERVER_KIND_STRIDE + i (stride 1000), so each kind has headroom for many envs without the blocks colliding (was a single running index — train and eval sat adjacent). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix: env-server logs + sidecar queue cleanup; train-only debug config - env_server entrypoint: intercept vf-nano stdlib logging so the server's own logs (EnvServer up, request failures) land in logs/envs/<kind>/<name>.log — previously only loguru output was captured, swallowing them. - envs.py: close the address-handoff mp.Queue after use (no resource_tracker leaked-semaphore warning on the sidecar path). - configs/debug/nano/reverse_text.toml: drop the eval block, mirroring examples/reverse_text/rl.toml (train-only smoke; eval path validated separately). - bump deps/vf-nano (serve/types docstring trim). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump vf-nano (BaseRequest marker, no request_type field) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: env client uses client= (was client_config=); bump vf-nano Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump vf-nano (drop renderers dep comment) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump vf-nano (configs/ + cli/ split, serve/ runtime-only) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: drop FinishedRollout.to_dict; serialize the Trace to disk directly The I/O boundary (save_rollouts + monitor sample tables) now dumps the typed vf.Trace itself (r.trace.model_dump(mode="json")) instead of a Trace+metadata merge — the on-disk rollout is just the trace. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: track vf-nano agent->harness rename vf-nano renamed its rollout-driver abstraction Agent -> Harness. Update the integration: EnvConfig.agent -> harness (HarnessConfig/DefaultHarnessConfig); env.run_rollout/run_group spawn forwards harness_config; the env-server entrypoint passes harness_config/harness_timeout; debug config uses `harness = {...}`. Bump deps/vf-nano to the renamed branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: track vf-nano plugin reorg (reverse-text path + bump) vf-nano reorganized examples into examples/{tasksets,harnesses}/; point the reverse-text editable source at examples/tasksets/reverse_text and bump deps/vf-nano. No prime-rl code change — EnvConfig.harness (default) resolves via vf-nano's built-in harness registry. Verified: 3-step reverse-text smoke trains (0.26 -> 0.42, 0% error). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: reuse vf.EnvConfig in the orchestrator (typed taskset/harness, drop args) prime-rl's EnvConfig now subclasses vf.EnvConfig and resolves taskset + harness to their specific config types by id (taskset_config_type / harness_config_type), so env-specific fields are validated against the real config — the untyped `args` dict and the top-level `id` are gone (id/stripped_id/resolved_name are now properties off taskset.id). Timeouts come from vf.TimeoutConfig (timeout.rollout / timeout.scoring), superseding prime-rl's flat timeout. The env server is spawned with the typed taskset_config (no env_id/args). - pyproject: install the vf-nano plugin packages (default/rlm harnesses, gsm8k taskset) as path sources; bump deps/vf-nano to the plugin-packages branch. - configs/debug/nano/reverse_text.toml: taskset = { id = ... }, harness.id (was harness.type). Verified: custom taskset (gsm8k.split) + harness (rlm.ref) resolve to typed configs; bad fields are rejected; the migrated TOML loads through RLConfig. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: inherit vf's shared plugin resolution; trim dead EnvConfig fields EnvConfig drops its own _resolve_plugins (now inherited from vf.EnvConfig's shared validator) and the dead v1-forwarding fields: extra_env_kwargs, max_total_completion_tokens, state_columns (no readers on the vf-nano branch). Also drop stripped_id (no hub installs -> no @version, so id == it) — callers use .id. Bump deps/vf-nano. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: require a configured taskset per env (no reverse-text default) EnvConfig no longer auto-defaults its taskset to reverse-text; an env with no taskset id errors at validation. Env-list defaults are empty (eval's non-empty check still fires only when an eval block is configured). Bump deps/vf-nano. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump deps/vf-nano (dashboard taskset.id/harness.id fix) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump deps/vf-nano (sampling max_tokens fix) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(env-server): require a configured env; build EnvServer from the EnvConfig EnvServerConfig.env was a bare EnvConfig() default, which now raises (no taskset) and crashed the env-server at import. Make env required — the orchestrator always passes a real env config. Build the server straight from config.env (vf-nano EnvServer takes the EnvConfig). Bump deps/vf-nano (is_truncated computed field). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(nano): reverse-text uses default harness with enable_bash=false The model answers directly (no bash tool), so the tool_call_parser is no longer needed. Bump deps/vf-nano (enable_bash flag). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(nano): hendrycks-sanity config on the math-env taskset Add configs/debug/nano/hendrycks_sanity.toml (vf-nano analog of configs/debug/hendrycks_sanity): math-env train env on the sanity dataset (R1-distill, default renderer with think parser, single-turn no-bash), aime24 eval env. Register math-env + aime24 in the envs group + sources. Bump deps/vf-nano (math/aime tasksets, enable_bash, SerializeAsAny + is_truncated fixes). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump deps/vf-nano to merged main (math/aime tasksets, fixes) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: drop deps/verifiers, bump deps/vf-nano to merged main vf-nano main (#8) adds the subprocess-runtime env passthrough (inherit host env minus API_KEY, so UV_CACHE_DIR reaches rollout workers and lands the uv cache on local disk) and the math-verify scoring fix (wrap gold + prediction in \boxed, matching v1 math-env — fixes matrix/vector answers scoring 0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(nano): hendrycks-sanity mirrors examples/hendrycks_sanity + slurm overlay nano config is now a verbatim copy of examples/hendrycks_sanity/rl.toml with only the env sections swapped to vf-nano taskset/harness syntax (math-env taskset, default harness, subprocess runtime; aime24 eval). Adds a slurm overlay setting the output dir, wandb run name, and partition. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(nano): fold wandb + slurm into hendrycks_sanity config Move the [wandb] and [slurm] sections from the slurm overlay into the base config, so it submits to slurm by default (run locally with --no-slurm). The overlay now only redirects output_dir. wandb run name is vf-nano-subprocess. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(nano): fold output_dir into hendrycks_sanity config, drop slurm overlay Everything (output_dir, wandb, slurm) now lives in the one config; remove the now-empty slurm overlay. Run locally with --no-slurm. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(vf-v1): run on the unified verifiers package (v0 envs + v1=nano + legacy bridge) Depend on the single verifiers package (deps/verifiers, nesting vf-nano) instead of standalone vf-nano. EnvConfig is dual-mode: v0 envs via id+args (legacy bridge), v1 envs via taskset/harness (native nano). The env-server spawn (standalone + mp) picks LegacyEnvServer for v0 and EnvServer for v1; both return vf.Trace so the orchestrator is unchanged. Plugin sources repointed under deps/verifiers; reverse-text is the v0 env, reverse-text-v1 the nano taskset. Verified: uv run rl @ examples/reverse_text/rl.toml (v0, unchanged) trains via the bridge (reward 0.14->0.79); vf-eval + vf-eval-v1 reverse-text both reward 1.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v0): wire alphabet-sort (multi-turn) + bump verifiers (State scrub) Add the v0 alphabet-sort env as an installable source; bump deps/verifiers to the State-scrub commit. Verified the bridge on a multi-turn v0 env: alphabet-sort runs (Turns 2.0, 128/128 trainable, 0% error) alongside single-turn reverse-text. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v0): wire wordle + add configs/wordle/rl.toml (2-GPU) Wire the v0 wordle TextArena env (multi-turn game) as an installable source; add a 2-GPU (1 trainer + 1 inference) wordle config. Verified through the legacy bridge: wordle trains (Turns ~5.3, reward ~0.81, 128/128 trainable, 0% error). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump deps/verifiers (v1 hygiene: init.py/tests/docs) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): track de-vendored verifiers, rename nano -> v1 - Bump deps/verifiers to the de-vendored commit: vf-nano is now vendored as the verifiers.v1 subpackage (no nested deps/vf-nano submodule). - Repoint env-plugin sources from deps/verifiers/deps/vf-nano/... to deps/verifiers/{examples,packages}/... - Rename verifiers.nano -> verifiers.v1 across the orchestrator/utils/configs; rename configs/debug/nano -> configs/debug/v1. "v1" is the name now (no "nano"). Verification: v0 reverse-text (legacy bridge) and v1 reverse-text-v1 (native) both train over a 3-step smoke (reward present, 0% error). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers (eval/serve, v1 deps in base, bundled plugins) - Bump deps/verifiers: v1 runtime deps moved to base (no `v1` extra), v1 CLIs renamed eval/serve, shipped plugins bundled in the tasksets/harnesses umbrella packages, verifiers/v1/harness.py (flattened harnesses subpackage). - Depend on `verifiers` (was verifiers[v1]) and the `harnesses` umbrella package (was standalone default/rlm). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers (Task.system_prompt); write full rollout jsonl - Bump deps/verifiers to the system_prompt commit: Task.system_prompt + harness APPENDS_SYSTEM_PROMPT support, reverse_text_v1 byte-identical to the v0 env (separate system message), plus this branch's eval/serve, v1-deps-in-base, CI fixes, and the retired semgrep policy. - save_rollouts: drop exclude_keys={"trajectory"} for both train and eval rollouts so the jsonl carries the full trajectory. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: drop semgrep from uv.lock (verifiers retired the policy group) Follow-up to the deps/verifiers bump (00c7b77a removed the `policy` dependency group); regenerates the lock to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix: pin the env-server renderer tokenizer to the base model for LoRA runs Wire `renderer_model_name` (the base model) into the renderer client config so the env-server renderer builds its tokenizer from the base model instead of the per-request LoRA adapter name, which has no published HF tokenizer and 404'd once LoRA training set the served model name to the adapter. Mirrors the `renderer_model_name` wiring already in `setup_clients` on main. Bumps deps/verifiers to the matching fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(orchestrator): request vLLM token ids for MITO training (#2745) On the MITO path (no renderer), set return_token_ids in the train env sampling args so the openai_chat_completions client gets the prompt and completion token ids back from vLLM and can carry them for training instead of re-tokenizing the messages downstream. Scoped to renderer is None so it never reaches the renderer /inference/v1/generate endpoint (which forwards sampling params to vLLM verbatim). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers submodule; restore v0 env catalog + v1 -v1 imports - bump deps/verifiers to the feat/nano-as-v1 tip (token-parse #1585, bridge meta #1586, taskset -v1 rename #1587, native retries #1588, runtime resource cleanup #1590) - restore the full v0 env catalog in optional-dependencies.envs + [tool.uv.sources] (init the research-environments submodule); the taskset -v1 rename freed the v0 names, so v0 envs (legacy bridge) and v1 tasksets coexist - point the v1 taskset sources at their -v1 names / _v1 paths (gsm8k-v1, math-env-v1, aime24-v1; reverse-text-v1 already correct) - update v1 configs to the -v1 taskset ids (hendrycks_sanity) and drop the superseded standalone configs/debug/reverse_text_v1.toml - relock Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers submodule to feat/nano-as-v1 tip (textarena #1592) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers submodule to feat/nano-as-v1 tip Pulls in runtime resources named after the rollout id (#1596) and the alphabet-sort-v1 taskset (#1595). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers submodule to feat/nano-as-v1 tip Brings in env-hub ids (#1597), v0 envs on the eval CLI (#1598), modal runtime (#1594). * chore(v1): add v1 alphabet-sort debug config (port of examples/alphabet_sort) * fix: forward extra_env_kwargs to v0 legacy envs; drop dead trajectory tests (#2749) Restores main's escape hatch for v0 (legacy-bridge) envs: a legacy env's extra_env_kwargs is auto-populated (timeout_seconds <- timeout.rollout, max_total_completion_tokens <- max_output_tokens, max_seq_len <- seq_len) and forwarded to LegacyEnvServer at spawn, so v0 rollouts again honor the wall-clock timeout, multi-turn completion budget, and seq-len truncation (all were silently dropped on this branch). Also removes tests/unit/orchestrator/test_{trajectories,sft_trajectories}.py, which imported the deleted interleave_rollout and broke test collection. * fix: install the bundled `tasksets` package (harbor-v1, textarena-v1) The `envs` extra wired `harnesses` and the individual `*-v1` example tasksets but never the bundled `tasksets` package, so the integration tasksets it ships (`harbor-v1`, `textarena-v1`) couldn't be resolved — `import_taskset("harbor-v1")` raised ModuleNotFoundError ("tried to import 'harbor_v1'"). Add `tasksets` to the `envs` extra + a path source, and bump the verifiers submodule to the feat/nano-as-v1 tip (#1600), where the bundled tasksets live under the `tasksets` namespace package (`tasksets.harbor_v1`) and the loader resolves the namespaced module. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: consume the v1 message-graph trace (#2763) * feat: consume the v1 message-graph trace (graph-walk trace_to_samples) Walk the new message graph (verifiers feat/trace-message-graph, PR #1606): trace_to_samples builds one TrainingSample per branch by concatenating each branch path's node token_ids / sampled_mask / logprobs (graph.branch_token_sequences), splitting prompt|completion at the first sampled token — identical training tensors to the old per-turn stitching, off a trace that is now linear (not quadratic) in turns. backfill_rollout_tokens is a no-op (training is renderer-only; `trajectory` is now a read-only view over the graph). Bumps the verifiers submodule to the graph-trace branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump verifiers submodule (MessageNode.mask rename) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: wire alphabet-sort-v1 taskset Add alphabet-sort-v1 to the `envs` extra + `[tool.uv.sources]` so configs/debug/v1/alphabet_sort.toml resolves (it referenced an example taskset that was never wired into prime-rl). Used to verify graph-based training-sample construction on real RL runs — v0 (legacy bridge) and v1 (native renderer path) both train cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: consume nodes/branches directly (drop Turn/trajectory readers) `trace_to_samples` already walks the graph; the remaining readers move off the removed Turn/trajectory API: the gibberish/repetition filters iterate per-node completions, advantage/dispatcher use `trace.num_turns`/`trace.completion_len`, `get_model_completion_len` is dropped (use `trace.completion_len`), and the renderer-only train_sink drops the backfill path (also removing `backfill_rollout_tokens`). Bumps the verifiers submodule. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump verifiers submodule (merge #1605 multiplex interception) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump verifiers submodule (dead-code cleanup) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers (readme highlight + ruff format) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: enforce renderer, SFT backfill, branch-first-class logging Training is renderer-only now. RL/OPD roll out through the renderer client (exact sampled token ids + logprobs); SFT rolls out against a chat-completions teacher that returns no tokens and re-renders the conversation to backfill them (`backfill_trace`). A renderer is required for every mode (`renderer=None` rejected) — the oai client never produces correct training tokens for the message graph. Drops the MITO no-renderer training path. Logging consumes `trace.branches` as the first-class unit (`branch.token_ids` / `branch.messages`) instead of the removed `trajectory` field; `trace_to_samples` builds one sample per branch from the same accessors. Sample loggers take the rollout objects so env_name/advantage are available. Add configs/v1/training_mode (rl/opd/sft + lora/external) mirroring the v0 debug configs. Fix the v0 SFT debug configs + rlm_swe to validate under the renderer requirement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: flat TrainingSample (token_ids + mask), required renderer Drop the prompt/completion split from TrainingSample — it doesn't fit a multi-turn/agentic branch, where context and model-sampled spans interleave. A sample now carries the branch's flat `token_ids` plus per-token `mask` (True = trainable), `logprobs`, and `temperatures` (all aligned). `prepare_sample` passes them straight into the MicroBatch (already flat), and the packer validates against `token_ids` length. Make `orchestrator.renderer` a non-optional type (drop the `enforce_renderer` validator) — training is renderer-only, so the type carries the requirement. Bump the verifiers submodule to feat/nano-as-v1 (merged #1606 + Branch.branches inlined). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: SFT teacher rolls out through the renderer client (drop backfill) Training is renderer-only across every mode, so the SFT teacher now rolls out through the renderer client too — its rollouts carry tokens directly, the same as RL/OPD. Drops the chat-completions backfill (`backfill_trace` + the SFT path in TrainSink) and the now-unused TrainSink renderer. This requires a self-hosted teacher that shares the student's tokenizer (the student trains on exactly the ids the renderer feeds the teacher); distilling from an external chat API is no longer supported. Remove the `sft_external` debug configs. Validated: SFT on reverse-text-v1 trains cleanly (Trainable 128/128, eval reward ~0.1 -> ~0.82 over 20 steps) with the teacher on the renderer client, no backfill. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: drop configs/v1/training_mode README Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: consolidate rollout types into Rollout(vf.Trace) The trace *is* the rollout: replace the FinishedRollout/TrainRollout/EvalRollout wrappers with a prime-rl Rollout(vf.Trace[TaskT]) subclass that carries the orchestration metadata (kind, env_name, group_id, policy_version, off_policy_steps, samples, advantage, is_filtered, filter_results, eval_step) as exclude=True fields — so dumping a Rollout still yields a plain trace (on-disk results.jsonl unchanged). envs.py validates the wire trace into Rollout; the dispatcher stamps the metadata; train vs eval is the `kind` discriminator (replacing the isinstance check). All consumers read rollout.X directly instead of rollout.trace.X. Drop the monitor's SampleRollout duck-type Protocol — the loggers take the real Rollout (TYPE_CHECKING import) and read branch.token_ids / branch.messages. Also drop the prime monitor's _split_branch_messages and _json helpers: the conversation is the unit (no prompt/completion split — meaningless multi-turn). Fix a latent dispatcher bug surfaced along the way: synthetic error traces used `error=` / `r.error = ` (a read-only computed field) — now `errors=[...]` / `r.errors.append(...)`. Rewrite the (long-stale, dict/`raw`-based) advantage + filters unit tests to build real Rollouts — they now exercise the current trace-based code (previously all failing on import/construction). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci: allow verifiers + datasets in the slim-config dep check The v1 config types (EnvConfig, Task, ...) extend `verifiers.v1`, which is a declared, pure-pydantic dependency of prime-rl-configs (it pulls `datasets` for the taskset/Task types but no GPU/ML deps). Drop `verifiers` and `datasets` from the slim-install forbidden list — keep the real heavy training deps (torch, vllm, transformers, ...). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers submodule to feat/nano-as-v1 tip Picks up the v1 end-to-end eval test suite (#1609) and the v0 legacy env-server group-scoring fix (#1612). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers submodule to feat/nano-as-v1 tip Picks up the v0 legacy-bridge fixes: guard against non-renderer training clients (#1613) and serve the eval split for eval-only v0 envs (#1614). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): fix stale sft.toml teacher comments The SFT teacher rolls out through the renderer client (token-in/out) and must share the student's tokenizer; drop the leftover oai-client / token backfill description removed in the renderer-only refactor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers (v0 eval chat-completions client) Picks up verifiers#1615: the legacy bridge builds a chat-completions client for v0 eval rollouts (renderer for training), instead of raising on the non-renderer eval client. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(v1): scaleswe configs + taskset registration (#2765) - register the scaleswe-v1 taskset (pyproject envs list + uv source) - point the existing rlm-swe config (configs/rlm_swe/qwen35_4b.toml) at the scaleswe taskset (task_type="scaleswe", train + eval) - add configs/debug/v1/scaleswe.toml — a per-env v1 port of that config using the scaleswe-v1 taskset via the rlm harness on the prime runtime Companion to verifiers feat/scaleswe-v1 (scaleswe-v1 taskset + setup/workdir hooks). Needs the deps/verifiers submodule bumped to that branch once it lands. * feat(v1): multimodal training through the message graph + color-codeword-v1 Consume the v1 trace's multimodal sidecar. `trace_to_samples` builds, per branch, `mm_kwargs` (the branch's per-image renderer items concatenated on dim 0 and EncodedTensor-encoded) and `mm_token_type_ids` (the renderer's `mm_token_type_id_map` applied to the branch tokens); `TrainSink` threads the mapping through. The wandb sample logger now renders the task as a Table-safe JSON string with image data elided — an image-bearing instruction crashed wandb's Table type inference on the nested content list. Adds `configs/v1/multimodal_color_codeword.toml` (Qwen3-VL-4B on color-codeword-v1, 2-GPU) and registers the `color-codeword-v1` taskset; bumps the verifiers submodule for the multimodal message-graph support. Verified end-to-end: the VLM trains through the mm path (eval 0.69 -> 0.78, Trainable 256/256 — mm_kwargs reach the Qwen3-VL forward); v0 `color-codeword` eval 0.625 ~= v1 `color-codeword-v1` eval 0.69 (faithful port). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Revert "feat(v1): multimodal training through the message graph + color-codeword-v1" This reverts commit 85b27cfc0be12b84f9b36456675b10387d01dc8a. * feat: multimodal training through the v1 message graph + color-codeword-v1 (#2766) * feat(v1): multimodal training through the message graph + color-codeword-v1 Consume the v1 trace's multimodal sidecar. `trace_to_samples` builds, per branch, `mm_kwargs` (the branch's per-image renderer items concatenated on dim 0 and EncodedTensor-encoded) and `mm_token_type_ids` (the renderer's `mm_token_type_id_map` applied to the branch tokens); `TrainSink` threads the mapping through. The wandb sample logger now renders the task as a Table-safe JSON string with image data elided — an image-bearing instruction crashed wandb's Table type inference on the nested content list. Adds `configs/v1/multimodal_color_codeword.toml` (Qwen3-VL-4B on color-codeword-v1, 2-GPU) and registers the `color-codeword-v1` taskset; bumps the verifiers submodule for the multimodal message-graph support. Verified end-to-end: the VLM trains through the mm path (eval 0.69 -> 0.78, Trainable 256/256 — mm_kwargs reach the Qwen3-VL forward); v0 `color-codeword` eval 0.625 ~= v1 `color-codeword-v1` eval 0.69 (faithful port). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin — multimodal review-pass cleanups Picks up the verifiers feat/v1-multimodal head: the multimodal review-pass (capability-flag docstrings, trimmed mm comments, color-codeword-v1 config validator + module constants) and the merged malloc_trim worker-RSS fix (#1621). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin (content-part mm attribution) + config/test sync - Bump deps/verifiers to the content-part multimodal attribution (drops the unused placeholder offset machinery). - Drop max_turns/seed from the color-codeword-v1 taskset args in the config — the taskset hard-codes them as module constants now, and passing them is rejected. - Update the mm egress unit test to assert mm_items order (the new attribution), not placeholder offsets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(trainer): slice mm_kwargs on truncation so tokens match embeddings When a sample exceeds seq_len, prepare_sample truncated input_ids and mm_token_type_ids but passed mm_kwargs through whole — leaving more image embeddings than surviving image placeholders. Now truncation cuts to a whole-image boundary (never splitting an image's placeholder block) and slices mm_kwargs (pixel_values + image_grid_thw) to the images that fully survive, so image-placeholder count == image-embedding count. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to ruff-formatted graph.py Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): remove test_trajectories_mm.py Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: thread num_workers to the env-server worker pool (#2768) * feat(v1): thread num_workers to the env-server worker pool Wire the verifiers env-server worker pool into prime-rl: the orchestrator's spawned env server (envs.py) and the `env-server` CLI now serve via verifiers' serve_env with num_workers, so requests fan out across N worker processes instead of one event loop. num_workers was already a config field but dropped on the floor; it's now passed through and defaults to 4. Companion to verifiers feat/v1-env-workers; needs deps/verifiers bumped to it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(v1): default num_workers to 4 Make the worker pool the default: num_workers defaults to 4 (was "auto"->1) across the per-env, train, and eval configs, so training/eval env servers fan rollouts across 4 worker processes out of the box. "auto" stays a valid value (scales per concurrency); set num_workers=1 for the old single-process server. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): keep num_workers="auto" default on the orchestrator Revert the orchestrator's per-env / train / eval num_workers defaults back to "auto" (was 4) so they keep scaling 1 worker per 256 concurrent rollouts out of the box. The standalone env server can't scale (no concurrency context — it's driven by external clients), so its resolver collapses "auto" to a fixed 4. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers to be76cbc3 (env-server worker pool) Align the pin with #1623 (env-server worker pool: router + N workers), which the just-merged #2768 (thread num_workers to the pool) requires; the pin had lagged at the pre-#1623 multimodal tip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(v1): register r2e-gym-v1 taskset Add r2e-gym-v1 to the base v1 taskset deps + uv sources (editable from deps/verifiers/examples/tasksets/r2e_gym_v1) so the id resolves through the v1 loader, matching the other -v1 tasksets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(swe): use r2e-gym for rlm_swe configs (v0 + v1) - v0 configs/rlm_swe/qwen35_4b.toml: restore the train env to r2e and the eval env to swebench-verified-quick (as on main), reverting the scaleswe switch - v1: rename configs/debug/v1/scaleswe.toml -> r2e_gym.toml, point the train env at the r2e-gym-v1 taskset, and drop the eval block Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(swe): point rlm_swe configs at r2e-gym (content) Apply the edits the prior rename commit missed: - v0 rlm_swe/qwen35_4b.toml: train -> r2e, eval -> swebench-verified-quick (as on main) - v1 debug/v1/r2e_gym.toml: taskset -> r2e-gym-v1, eval block removed Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): restore env-server worker logs to the env log file (#2770) Env servers spawn their worker pool as fresh `spawn` processes with no logging handlers (verifiers#1626), so per-rollout logs (rollout start/done, context-exceed warnings) were silently dropped. Pass `setup_env_server_logging` to verifiers' `serve_env` as `log_setup`; it runs in the broker and in every worker. A worker inherits the broker's redirected stdout/stderr, so its logs land in the same `envs/{train,eval}/<name>.log` as before — no new files or paths. Bumps deps/verifiers to the worker-logging fix. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers to 955b6cdf (dashboard token usage fallback) Realign the pin onto origin/feat/nano-as-v1 and pick up #1627: the --rich dashboard's token counts fall back to provider usage when the endpoint returns no token ids (no more 0/0). The prior pin 3df34ba5 was a pre-rebase #1626 variant; 955b6cdf already contains the equivalent #1626 (env-server worker logging) plus #1627. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers to 8e4ad735 (clean env-server teardown) Picks up the serve_env SIGTERM-teardown fix: pool/in-process env servers no longer print a spurious KeyboardInterrupt traceback into the env logs on shutdown. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(deps): bump verifiers (renderers floor 0.1.8.dev40) Picks up the verifiers floor bump so the renderers offset-tokenizer fix (dev40, PRs #72/#75) can't be undercut by a pre-fix PyPI resolution. Re-locks uv.lock to the dev40 specifier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers to db82b38a (reap subprocess tree on cancel) Picks up #1628 (reap the whole subprocess tree when a runtime run is cancelled). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: elastic env-server pool (inherit static/elastic pool config) (#2774) * feat(v1): elastic env-server pool (inherit pool config from verifiers) Companion to verifiers#1629. prime-rl's EnvConfig now extends vf.EnvServerConfig, so each env inherits the `pool` discriminated union (static{num_workers=4} | elastic{max_workers=None, multiplex=128}, default elastic) and the orchestrator's env servers scale workers on demand instead of pre-spawning a fixed `auto` count. - Drop the per-env / train-group / eval-group `num_workers` fields + the auto-resolution (ceil(max_inflight/256)); the elastic pool self-sizes from load. - envs.py / env_server.py pass `vf.pool_serve_kwargs(env.pool)` to serve_env. - Bump deps/verifiers to the elastic-pool branch. Breaking: `num_workers` is replaced by `pool`. Configs set `pool = { type = "elastic", multiplex = N }` or `{ type = "static", num_workers = N }`; the rlm_swe + r2e debug configs are migrated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): back-compat shim mapping legacy num_workers -> pool EnvConfig forbids extra fields, so configs still setting the removed `num_workers` would hard-fail. Add a `model_validator(mode="before")` that maps it onto `pool`: an int -> a fixed `static` pool, `"auto"` -> the default `elastic` pool; an explicit `pool` always wins. Keeps existing (incl. out-of-tree) configs parsing without edits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): drop num_workers from rlm_swe + r2e configs (use default elastic pool) The default `pool` is already elastic (multiplex 128), so an explicit `pool` here was redundant — just remove the legacy `num_workers` and inherit the default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers to f404e97f (elastic env-server pool) Realign the pin onto origin/feat/nano-as-v1: the prior pin d0c5bc98 was the unsquashed #1629 feature branch, now squash-merged as f404e97f (content-identical). Picks up #1629 (static/elastic env-server pool config). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers to 88e9bedd Picks up #1631 (per-rollout setup timing as a distinct phase) and #1632 (per-call model + runtime retries). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers to 40a2e89f (fix trace.timing.*.duration to_wire validation) Fixes RunRolloutResponse ValidationError 'trace.timing.setup.duration: Extra inputs are not permitted' that crashed every rollout (#1636 drops computed durations from to_wire). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers to 5dc084f5 Picks up #1638 (add --resume for evals: re-run a previous run's missing/errored rollouts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers to 472622ba Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: stop importing env modules in the orchestrator (always Rollout[WireTask]) (#2781) * chore(v1): stop importing env modules in the orchestrator The orchestrator built its per-env trace_type as Rollout[vf.task_type(env_id)] for v1 envs, and vf.task_type imports the env package just to read its Task subclass for typing the wire trace. Nothing reads typed env task fields - only task.idx and a full task.model_dump - and WireTask (extra="allow") preserves those fields (incl. on disk). Always use Rollout[vf.WireTask], so the orchestrator never imports an env package: the env's type and runtime both live only in the server process. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): hoist the constant Rollout[WireTask] to a module-level ROLLOUT_TYPE It no longer varies per env, so it doesn't belong as a per-instance attribute set in Env.__init__ - lift it to a module constant used directly in run_rollout/run_group. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers to 7270e69b Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers to 66c87d5b Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers to ef45f720 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin (alphabet-sort host user sim, #1645) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin (modal creates_per_sec 5 -> 40, #1646) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix: cap v1 hendrycks-sanity scoring at 10s (#2790) * fix(v1): cap hendrycks-sanity scoring at 10s Without a scoring timeout (the default is no limit), a wedged math verify holds its rollout's permit forever — sympy can spin past the in-script alarm — and at 512 concurrency that starves the pool and stalls long runs. Set timeout.scoring = 10 on the train and eval envs so the framework cancels and the subprocess runtime kills a runaway verify, freeing the permit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: drop inline comment on the scoring timeout Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers (eval/train client rename) + adopt config imports (#2792) Bump deps/verifiers to feat/nano-as-v1 HEAD (8873a740), which includes verifiers#1654 — the v1 interception rework: role-named clients (EvalClient/TrainClient), route-detected wire dialects (chat/responses/anthropic), 1:1 relay + streaming, reasoning preserved. Adopt the renamed client config classes in prime_rl/utils/client.py: OpenAIClientConfig -> EvalClientConfig, RendererClientConfig -> TrainClientConfig. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin (subprocess cached-interpreter uv-scripts, #1660) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin (codex built-in harness, #1661) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: track terminal-bench-2-v1 as a taskset dependency It was only a manual editable install, so `uv sync` pruned it. Add it to the env dependency group + [tool.uv.sources] (mirroring r2e-gym-v1) so it persists across syncs and is available out of the box. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): keep multimodal tensors out of rollout dumps (#2794) verifiers#1653 (carry mm tensors across the env-server wire) is merged and pinned, so `MessageNode.multi_modal_data` is no longer `exclude=True` — `model_dump(mode="json")` now serializes the base64 pixel tensors into `train_rollouts.jsonl` and the wandb sample tables, bloating every line. They're the training `mm_kwargs` carrier, not part of the rollout record, so exclude them at the dump boundary (train + eval paths). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): install all v1 example tasksets in prime-rl Declare the remaining 7 verifiers v1 example tasksets (code-golf, deepwiki, glossary, swelego, wiki-search, wikispeedia, wordle) as editable deps so uv sync installs every example, matching the verifiers examples set. chromadb/textarena were already present via the v0 wiki-search/wordle envs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): install the compact example harness in prime-rl The example harness (examples/harnesses/compact) was missing from prime-rl deps, so the documented --harness.id compact branching example failed to resolve (ModuleNotFoundError: harness compact not found). Declare it like the example tasksets so uv sync installs it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin (v1 docs #1662 + Trace.info #1664) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(v1): encode router-replay routed_experts into transport (#2808) * feat(v1): encode router-replay routed_experts into transport Pack Branch.routed_experts ([tokens, layers, top_k]) into the transport RoutedExperts the trainer replays, realigning the token axis to len(token_ids) as a backstop. Bumps the verifiers pin to the companion router-replay commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin (raw-bytes wire) + exclude routed_experts from disk dumps Bumps deps/verifiers to the raw-bytes router-replay codec (RoutedExperts rename + msgpack bin wire). routed_experts now rides the env-server wire as raw bytes, so it can't round-trip the json disk dump — add it to ROLLOUT_DUMP_EXCLUDE alongside multi_modal_data (both are training inputs, not part of the rollout record). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): guard router-replay vs prefix caching + bump verifiers/renderers pins - Add disable_prefix_caching_for_router_replay validator: prefix-cache hits skip recomputing the cached prefix, so the engine returns no routed-expert decisions for those tokens. Router replay needs routing for every token, so force enable_prefix_caching=False (mirrors the existing kv_cache_offload guard). - Bump deps/verifiers: tail-pad routed_experts so the final node aligns (the engine omits the last position's routing). - Bump deps/renderers: _get_offset_tokenizer immune to the global fastokens patch race. Together these make v1 MoE router replay work end-to-end (Qwen3-30B-A3B): verified mismatch_kl drops vs no-replay (0.0005/0.0002/0.0002 vs 0.0015/0.0015/0.0005 @ steps 0/1/2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump deps/renderers pin (ruff format fix) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump deps/verifiers pin (ruff format) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump deps/verifiers pin to merged feat/nano-as-v1 (router-replay #1672) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Remove disable_prefix_caching_for_router_replay validator Drops the guard that force-disabled inference.enable_prefix_caching under router replay. Validating whether prefix caching is actually compatible with router replay (cache hits may carry no routed-expert decisions); re-add if the A/B shows it drops routing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(orchestrator): keep top-level advantage wiring after merge The merge auto-resolved train_sink.py and the orchestrator's TrainSink construction to main's per-env advantage design (#2721), but envs.py and the configs were kept on the branch's top-level design — so TrainSink read `train_envs.get(env_name).advantage_fn`, which TrainEnv no longer defines (AttributeError at rollout time). Restore the branch's top-level advantage: TrainSink takes `advantage_config` and builds its own `self.advantage_fn` once. Verified end-to-end with a reverse-text-v1 RL smoke (reward climbs 0.16 -> 0.40 over 6 steps). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to merged main (1b7736a8) deps/verifiers feat/nano-as-v1 now includes origin/main merged in (composable tasksets + sandbox/save utils). Pin moves 7a98b566 -> 1b7736a8. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump pydantic-config submodule to main (single-dash short flags) deps/pydantic-config 896ade4 -> 99f47c6 (origin/main). Picks up #16 (single-dash short flags for single-character aliases, e.g. `eval -n -m`) and upstreams the dict[str,str] preservation fix (#14). No uv.lock change (editable path source). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to feat/nano-as-v1 tip (5fc0b295) Picks up the py3.11 TypedDict CI fix and #1676 (prime-sandbox programs run for the sandbox lifetime instead of the 15-min background-job default, + retry-log wording). Pin moves 1b7736a8 -> 5fc0b295. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to 3af08c9b Includes verifiers#1677 - scaleswe-v1 resolves task images via Prime's Artifact Registry instead of the incomplete public Docker Hub mirror. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(orchestrator): trim glibc heap back to the OS each step (#2821) Factor the existing end-of-run malloc_trim into a `trim_process_memory()` helper and also call it at the end of `finalize_train_batch`. Each step frees a step's worth of rollouts / traces / transport buffers, but glibc keeps those pages in per-arena free lists, so orchestrator RSS climbs over a long run without ever shrinking. Trimming per step returns them to the OS. Mirrors the per-step trim in prime-rl #2807 — the trim only; that PR's `.raw`-based retention helpers don't apply to this branch's typed Rollout/Trace. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to 7862985b Includes verifiers#1678 - scaleswe-v1 Prime Artifact Registry is now opt-in (`use_prime_registry`, default off), so the default Docker Hub image path works on local docker runtimes again. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: route consistent_hash on X-Session-ID for cross-turn prefix reuse (#2822) The vllm-router `consistent_hash` policy keys on request-id headers (default x-request-id / x-correlation-id / x-trace-id / request-id). The v1 inference clients send none of those per rollout, so a multi-turn rollout's turns hashed to random DP shards and re-prefilled the growing prompt cold each turn (~40% prefix cache hit rate on an agentic SWE run, where it should approach the eviction-limited ceiling). Launch the router with `--request-id-headers x-session-id` so it keys on the per-rollout `X-Session-ID` header the v1 clients now emit (companion verifiers change). Every turn of a rollout pins to one engine, keeping its cross-turn prefix warm. Harmless for the random / round_robin policies, which ignore request-id headers. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to 211faecb (session-affinity routing) Picks up verifiers #1680 (session-affinity routing header for cross-turn prefix reuse) — the counterpart to prime-rl #2822. Pin 7862985b -> 211faecb. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to 3f45f73b (scaleswe-v1 image filter) Picks up verifiers #1679 (filter unavailable scaleswe-v1 images at load). Pin 211faecb -> 3f45f73b. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to 44da378c (r2e-gym-v1 registry opt-in) Picks up verifiers #1681 (r2e-gym-v1 Prime Artifact Registry opt-in, default off). Pin 3f45f73b -> 44da378c. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to a723977c (scaleswe-v1 image_url filter) Picks up verifiers #1683 (filter on Docker Hub image_url, not the resolved image). Pin 44da378c -> a723977c. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to 0c8d0aa1 (rootless harness install) Picks up verifiers #1685 (harnesses install/run without root, pinned /tmp/vf-{harness}) and #1684 (remove Claude Code harness). Pin a723977c -> 0c8d0aa1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to d4b054e3 (skip region-limited e2e test) Picks up verifiers #1686 (skip test_multi_turn on prime — colocated user-sim port exposure is region-limited). Pin 0c8d0aa1 -> d4b054e3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: point verifiers env paths at the flattened environments/ dir (#2824) * chore: point verifiers env paths at the flattened environments/ dir Companion to verifiers #1695 (flatten examples/ into a single environments/). Update the deps/verifiers editable source paths in pyproject.toml + uv.lock: examples/tasksets/<x> -> environments/<x> and examples/harnesses/compact -> environments/compact. Bump the verifiers pin to the flatten branch (7ddc78b2) so the relocated paths resolve. Re-pin to the merged commit once verifiers #1695 lands. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(deps): bump verifiers to 22b02e4b (flatten examples/ into environments/, #1695 merged) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): advance verifiers pin to feat/nano-as-v1 tip (Trace.state + DNS fix) #2824 retargeted the env paths but pinned deps/verifiers at 22b02e4b (#1695, just after the examples/->environments/ flatten), which predates the later merges. Advance the submodule to the actual feat/nano-as-v1 tip bf30ae5f (verifiers 0.1.15.dev298): adds Trace.state (#1711), the unconditional MCP DNS-rebinding relax (#1715, fixes the in-prime tool 421 CI failures), and the init scaffold (#1713). bf30ae5f also dropped verifiers' self-referential `packages` extra, so drop `[packages]` from the verifiers override-dependency (and refresh the stale sources comment). Smoke: `eval reverse-text-v1 -n 1` (subprocess, temp=0) runs clean — no errors, not truncated, reward ~1.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to feat/nano-as-v1 tip (#1717 + #1719) Advance deps/verifiers bf30ae5f -> f7fa7482: - #1717 train-client prefix bridging + token-based message-graph branching - #1719 default the harness runtime to subprocess No prime-rl-side changes needed: every prime-rl v1 config already sets harness.runtime.type explicitly (#1719's default flip is a no-op here), and the consumed verifiers surface (serve EnvClient/serve_env/pool_serve_kwargs, clients.config Eval/TrainClientConfig, task.TaskT, the Trace branch/token API) is untouched by #1717. Imports resolve; reverse-text-v1 eval smoke clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to feat/nano-as-v1 tip (f7fa7482 -> bbfd5646) New since the last pin: #1718/#1721 (CI + tests), #1723 (per-harness tool disabling), #1666 (client consolidation + chat-completion header forwarding), #1722 (warn on non-default subprocess harnesses). No prime-rl-side changes needed: clients.config (the consumed surface) is untouched by #1666's client refactor, #1722 only warns and exempts the default harness (what prime-rl's subprocess configs use), and the rest is additive/CI-only. Imports resolve; reverse-text-v1 eval smoke clean (reward 1.0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: verifiers #1732 companion (model_dump traces, Task.prompt rename) (#2831) * chore(v1): add prime CLI as deps/prime submodule (feat/nano-as-v1) Tracks the companion PrimeIntellect-ai/prime#751 branch that adds verifiers v1 eval support to the prime CLI (run/view consume the v1 entrypoint + Trace format). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to feat/nano-as-v1 tip (bbfd5646 -> 220f21d4) New: #1727 (per-rollout isolation for shared writable tool servers) and #1702 (trim verifiers runtime deps — modal/notebook/quest/pdf moved to extras). No prime-rl-side changes needed: the only dropped transitive dep is pymupdf, used solely by verifiers' experimental quest PDF tool via a lazy import behind the quest extra (prime-rl never touches it). Imports resolve on dev307; reverse-text-v1 eval smoke clean (reward 1.0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: validate env-server traces via model_dump (companion to verifiers to_wire removal) verifiers drops Trace.to_wire/from_wire and the derived computed fields (reward, is_truncated, error, duration are plain properties now). Swap wire.to_wire() -> wire.model_dump() when re-typing a returned Trace into ROLLOUT_TYPE; the .reward / .is_truncated the metrics/eval code reads are the Trace properties, so they still work. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: rename vf.Task(instruction=) -> prompt= (verifiers #1732 companion) verifiers #1732 renames Task.instruction -> Task.prompt; update the dispatcher's error-rollout Task construction to match. Pin bump to the merged commit comes when #1732 lands. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support verifiers v1 eval in prime CLI#751

feat: support verifiers v1 eval in prime CLI#751
mikasenghaas wants to merge 1 commit into
mainfrom
feat/nano-as-v1

mikasenghaas commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mikasenghaas commented Jun 17, 2026

Summary

Breaking

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant