feat: support verifiers v1 eval in prime CLI#751
Draft
mikasenghaas wants to merge 1 commit into
Draft
Conversation
Companion to verifiers feat/nano-as-v1. Migrates the local eval lifecycle to the new config-driven v1 `eval` entrypoint and Trace output format. - Pin verifiers to the feat/nano-as-v1 line (rev bbfd564); bump requires-python to >=3.11,<3.14 (v1 verifiers drops 3.10). - `prime eval run` invokes the v1 `eval` console script. v0 hub/local envs run through the bridge's legacy `--id` path (-> v1 Trace output); convenience flags (-m/-b/-k/--sampling-args/...) become a temp v1 config TOML, with remaining flags forwarded verbatim. Auto-upload is disabled (results stay local). - `prime eval view`/`tui` consume the v1 Trace `results.jsonl`: new utils/v1_results.py adapts a Trace to the v0 record shape the viewer renders, synthesizes run metadata from config.toml, and data.py discovers v1 run dirs. v0 discovery/rendering unchanged. - `prime eval push` errors informatively on a v1 run dir (platform isn't v1-aware yet); v0 push is unchanged. - `--hosted` raises NotImplementedError (the HostedEvalConfig is still built from the parsed run args so the hosted machinery stays wired). - `prime lab setup` also adds tasksets + harnesses (the built-in v1 plugins that `prime eval run` resolves, e.g. the default harness). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mikasenghaas
added a commit
to PrimeIntellect-ai/prime-rl
that referenced
this pull request
Jun 17, 2026
…me) (#2831) * chore(v1): add prime CLI as deps/prime submodule (feat/nano-as-v1) Tracks the companion PrimeIntellect-ai/prime#751 branch that adds verifiers v1 eval support to the prime CLI (run/view consume the v1 entrypoint + Trace format). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to feat/nano-as-v1 tip (bbfd5646 -> 220f21d4) New: #1727 (per-rollout isolation for shared writable tool servers) and #1702 (trim verifiers runtime deps — modal/notebook/quest/pdf moved to extras). No prime-rl-side changes needed: the only dropped transitive dep is pymupdf, used solely by verifiers' experimental quest PDF tool via a lazy import behind the quest extra (prime-rl never touches it). Imports resolve on dev307; reverse-text-v1 eval smoke clean (reward 1.0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: validate env-server traces via model_dump (companion to verifiers to_wire removal) verifiers drops Trace.to_wire/from_wire and the derived computed fields (reward, is_truncated, error, duration are plain properties now). Swap wire.to_wire() -> wire.model_dump() when re-typing a returned Trace into ROLLOUT_TYPE; the .reward / .is_truncated the metrics/eval code reads are the Trace properties, so they still work. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: rename vf.Task(instruction=) -> prompt= (verifiers #1732 companion) verifiers #1732 renames Task.instruction -> Task.prompt; update the dispatcher's error-rollout Task construction to match. Pin bump to the merged commit comes when #1732 lands. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to chore/v1-general-cleanup tip (220f21d4 -> d78d7474) Pin #1732 (general v1 cleanup) so this companion's model_dump trace handling and the Task.prompt rename round-trip against a verifiers that actually has them. Re-pin to the feat/nano-as-v1 tip once #1732 merges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to feat/nano-as-v1 tip (d78d7474 -> caaf0ff3) #1732 (general v1 cleanup) merged into feat/nano-as-v1; re-pin off the chore branch onto the integration tip. This companion's model_dump trace handling + Task.prompt rename now run against merged verifiers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: error-rollout placeholder task uses prompt=None An error rollout has no prompt; None is the honest value (Task.prompt is str | Messages | None) rather than an empty string. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mikasenghaas
added a commit
to PrimeIntellect-ai/prime-rl
that referenced
this pull request
Jun 24, 2026
* feat: add vf-nano as submodule
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump deps/vf-nano to feat/env-server (EnvServer)
Points the submodule at the vf-nano EnvServer branch so the orchestrator can
build on the env-server abstraction.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: run orchestrator on a vf-nano env server (reverse-text)
Switch prime-rl's env path to vf-nano: the orchestrator spawns a vf-nano
EnvServer per env (it never loads an environment), dispatches rollouts by task
index, and trains on the returned Trace dicts (branches + renderer tokens).
- pyproject: dep verifiers -> vf-nano; drop v1/research env packages; only the
vf-nano reverse-text example; override out the transitive v1 verifiers (pulled
by the prime CLI) so it can't shadow vf-nano's `verifiers` package; add orjson
/pandas/msgspec (were transitive via verifiers).
- EnvConfig inherits vf-nano's swappable agent/runtime (+ max_turns).
- envs.py: spawn EnvServer child + EnvClient, info() for num_tasks/group-scoring,
dispatch by task_idx, adapt Trace -> RolloutOutput-shaped dict.
- trajectories.py: trace_to_samples (one sample per Trace branch) + trace_to_output.
- train_source: index sampling; client pool builds vf-nano ClientConfig; lag
monitor vendored; env-server entrypoint repointed; ~14 files retyped off
vf.RolloutOutput / vf.ClientConfig.
- configs/debug/vf_nano_reverse_text.toml.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: consume vf-nano Trace natively (branches→samples, shared renderer config)
- trace_to_samples stitches each Trace branch's tokens into one TrainingSample
(prompt = branch start, then each turn's new context [masked] + generated
tokens [trained]); drop the RolloutOutput adapter — read the Trace's native
fields directly (reward, error{type,message}, timing generation/scoring,
num_turns, branches).
- envs returns the raw Trace; eval_sink / train_sink / dispatcher / metrics /
orchestrator read native Trace fields (no token_usage/completion/timing.total).
- client pool forwards the shared renderers.RendererConfig to the env server's
renderer client (so it uses qwen3, not the tool-less default fallback).
- debug config: tool_call_parser=hermes (vLLM accepts the agent's tools),
max_steps=20.
- bump deps/vf-nano.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: pass typed ClientConfig/SamplingConfig to the env client (no timeout)
- Env.run_rollout/run_group pass the vf-nano ClientConfig object and a
SamplingConfig (built from the env's sampling args) directly — no model_dump,
no per-rollout timeout forwarded to the server.
- debug config: max_steps=20.
- bump deps/vf-nano (typed env-server RPC).
* refactor: orchestrator holds a typed vf.Trace[EnvTask] (no dicts)
The env server returns a Trace minus its derived fields; the orchestrator resolves
the env's Task subclass (from config.id) and validates the wire dict into a strict
Trace[EnvTask], so the whole orchestrator works with a real, typed vf.Trace —
typed task fields included (e.g. task.answer), nothing subscriptable.
- envs.py: resolve_task_type(env_id); run_rollout/run_group validate -> Trace[EnvTask].
- trajectories/types/dispatcher/train_sink/eval_sink/metrics/filters/advantage/utils
/orchestrator: attribute access on the typed Trace (reward, error{type,message},
branches, timing.<span>.duration, num_turns, ...); derived fields recompute on the
consumer.
- Task/Trace/TimeSpan stay strict (StrictBaseModel) — no extra=ignore anywhere.
- bump deps/vf-nano.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: depend on vf-nano[serve]; bump submodule
The orchestrator spawns the env server, so request the serve extra
(zmq/msgpack) explicitly now that vf-nano keeps them out of core.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump vf-nano (client docstring cleanup)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: drop redundant forward-ref quotes in advantage.py
`from __future__ import annotations` already defers all annotations to strings,
so the quotes + `# noqa: F821` on the TYPE_CHECKING-only `vf.Trace` / `TrainRollout`
annotations are unnecessary (no import cycle — verifiers.nano never imports prime_rl).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: rename FinishedRollout.raw -> trace
The field holds a typed vf.Trace, so `trace` reads truer than `raw` (which
suggested an unparsed dict). Renames the field + every `.raw` access, the
`emit_rollout(trace=...)` param/kwarg, the to_dict field filter, and the
dispatcher cancel-path locals.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: simplify FinishedRollout, read straight off the typed Trace
- Drop the FinishedRollout proxy properties (error/reward/is_truncated and the
example_id field); consumers now read r.trace.{reward,is_truncated,task.idx,...}
directly. The trace is the single source of truth.
- Use vf.Trace.has_error for existence checks instead of `.error is not None`.
- Replace the prime-rl trace_* token-length utils with vf.Trace.{completion_len,
total_tokens,has_response} (now on the trace); keep trace_to_samples.
- Carry task_idx end-to-end (GroupState.task_idx, env.run_rollout/run_group(task_idx),
source dict key) instead of the example/example_id dict carrier; identity comes
off trace.task.idx.
- Mark the local-package env arrangement as a temporary/experimental TODO.
- Move the debug config to configs/debug/nano/reverse_text.toml.
- Bump deps/vf-nano (Trace/Turn accessors).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: spawn env server on an OS-assigned port, drop the startup poll
- The env server binds tcp://127.0.0.1:0 and reports its concrete address back
over a queue; the orchestrator connects to that. Removes _get_free_port and its
TOCTOU race (the OS assigns the port atomically).
- A spawned server has already bound + loaded by the time it reports its address,
so the untimed info() is enough — only poll wait_for_server_startup for an
external (config.address) server, which has no spawn handshake.
- Bump deps/vf-nano (port report + Trace/Branch token-length accessors).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: use vf.task_type instead of a local resolve_task_type
The Task-subclass introspection now lives in vf-nano (vf.task_type); drop the
prime-rl copy and build the typed Trace via vf.Trace[vf.task_type(env_id)]. Bump
deps/vf-nano.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: restore backfill_rollout_tokens for SFT (typed Trace)
SFT trains on a teacher served over the chat client, which returns no token ids,
so the trace's turns have tokens=None and trace_to_samples yields nothing. Restore
backfill: for each tokenless turn, render its prompt + assistant response with the
student chat template and split on the longest common prefix to fill TurnTokens
(masks/logprobs come from trace_to_samples). train_sink.process_rollout backfills
when any turn lacks tokens, before building samples.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix: pass task_idx when building Cancelled traces on off-policy drop
drop_group's error_rollout_output calls omitted the required task_idx, so an
off-policy cancel (on_new_version) raised TypeError. Use the group's task_idx
(or -1 when the group is already gone), mirroring handle_completed_rollout.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: consume typed Trace[WireTask]; inline synthetic error traces
- envs.py: EnvClient now returns Trace[WireTask]; upgrade to this env's real Task
subclass via self.trace_type.model_validate(wire.to_wire()).
- dispatcher.py: drop the error_rollout_output helper — inline the synthetic error
Trace at each call site using vf.Error's field names (type/message/traceback); the
task-exception path carries a real traceback, cancels/empty-trajectory carry none.
- Bump deps/vf-nano.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: env-server file logging; align debug config batch/group to canonical
- Spawned env servers now route their output (logging + subprocess-runtime output)
to <output_dir>/logs/envs/<name>.log via a _run_env_server wrapper that redirects
stdout/stderr and sets up logging in the child. Previously the orchestrator-spawned
server logged nowhere.
- Debug config: batch_size 16->128, group_size 8->16, eval num_examples 8->128
(interval=1), matching configs/debug/training_modes/rl.toml.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix: don't double the envs/ segment in the env-server log path
The orchestrator already passes a train/eval-split log_dir (.../logs/envs/train,
.../logs/envs/eval), so _spawn must drop the file directly under it
(<log_dir>/<name>.log) rather than re-adding an envs/ subdir — which had buried the
train/eval split under logs/envs/<kind>/envs/<name>.log.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump vf-nano (Error.traceback str | None)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump vf-nano (to_wire ordering)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: launch env servers as separate processes from the rl entrypoint
Instead of the orchestrator sidecar-spawning each env server as an mp child, the
rl launcher now spawns one `env-server` process per env (train + eval), each on a
free port, with output to logs/envs/{kind}/{name}.log and a crash monitor — same
model as inference/trainer. It sets env.address in the orchestrator config so the
orchestrator attaches (its existing external path) instead of spawning. Envs that
already set address (user-managed external server) are left alone; the orchestrator's
mp sidecar stays as the fallback for running `orchestrator` directly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: env servers use fixed configurable ports, not get_free_port
Add RLConfig.env_server_base_port (default 5000); the i-th launcher-managed env binds
base_port + i. Drops the get_free_port dependency in the launcher.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: separate train/eval env-server port blocks
Train envs bind base_port + i; eval envs bind base_port + ENV_SERVER_KIND_STRIDE + i
(stride 1000), so each kind has headroom for many envs without the blocks colliding
(was a single running index — train and eval sat adjacent).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix: env-server logs + sidecar queue cleanup; train-only debug config
- env_server entrypoint: intercept vf-nano stdlib logging so the server's own logs
(EnvServer up, request failures) land in logs/envs/<kind>/<name>.log — previously
only loguru output was captured, swallowing them.
- envs.py: close the address-handoff mp.Queue after use (no resource_tracker
leaked-semaphore warning on the sidecar path).
- configs/debug/nano/reverse_text.toml: drop the eval block, mirroring
examples/reverse_text/rl.toml (train-only smoke; eval path validated separately).
- bump deps/vf-nano (serve/types docstring trim).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump vf-nano (BaseRequest marker, no request_type field)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: env client uses client= (was client_config=); bump vf-nano
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump vf-nano (drop renderers dep comment)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump vf-nano (configs/ + cli/ split, serve/ runtime-only)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: drop FinishedRollout.to_dict; serialize the Trace to disk directly
The I/O boundary (save_rollouts + monitor sample tables) now dumps the typed
vf.Trace itself (r.trace.model_dump(mode="json")) instead of a Trace+metadata
merge — the on-disk rollout is just the trace.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: track vf-nano agent->harness rename
vf-nano renamed its rollout-driver abstraction Agent -> Harness. Update the
integration: EnvConfig.agent -> harness (HarnessConfig/DefaultHarnessConfig);
env.run_rollout/run_group spawn forwards harness_config; the env-server entrypoint
passes harness_config/harness_timeout; debug config uses `harness = {...}`. Bump
deps/vf-nano to the renamed branch.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: track vf-nano plugin reorg (reverse-text path + bump)
vf-nano reorganized examples into examples/{tasksets,harnesses}/; point the
reverse-text editable source at examples/tasksets/reverse_text and bump deps/vf-nano.
No prime-rl code change — EnvConfig.harness (default) resolves via vf-nano's built-in
harness registry. Verified: 3-step reverse-text smoke trains (0.26 -> 0.42, 0% error).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: reuse vf.EnvConfig in the orchestrator (typed taskset/harness, drop args)
prime-rl's EnvConfig now subclasses vf.EnvConfig and resolves taskset + harness to their
specific config types by id (taskset_config_type / harness_config_type), so env-specific
fields are validated against the real config — the untyped `args` dict and the top-level
`id` are gone (id/stripped_id/resolved_name are now properties off taskset.id). Timeouts
come from vf.TimeoutConfig (timeout.rollout / timeout.scoring), superseding prime-rl's flat
timeout. The env server is spawned with the typed taskset_config (no env_id/args).
- pyproject: install the vf-nano plugin packages (default/rlm harnesses, gsm8k taskset)
as path sources; bump deps/vf-nano to the plugin-packages branch.
- configs/debug/nano/reverse_text.toml: taskset = { id = ... }, harness.id (was harness.type).
Verified: custom taskset (gsm8k.split) + harness (rlm.ref) resolve to typed configs; bad
fields are rejected; the migrated TOML loads through RLConfig.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: inherit vf's shared plugin resolution; trim dead EnvConfig fields
EnvConfig drops its own _resolve_plugins (now inherited from vf.EnvConfig's shared
validator) and the dead v1-forwarding fields: extra_env_kwargs, max_total_completion_tokens,
state_columns (no readers on the vf-nano branch). Also drop stripped_id (no hub installs ->
no @version, so id == it) — callers use .id. Bump deps/vf-nano.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: require a configured taskset per env (no reverse-text default)
EnvConfig no longer auto-defaults its taskset to reverse-text; an env with no taskset id
errors at validation. Env-list defaults are empty (eval's non-empty check still fires only
when an eval block is configured). Bump deps/vf-nano.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump deps/vf-nano (dashboard taskset.id/harness.id fix)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump deps/vf-nano (sampling max_tokens fix)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(env-server): require a configured env; build EnvServer from the EnvConfig
EnvServerConfig.env was a bare EnvConfig() default, which now raises (no taskset) and
crashed the env-server at import. Make env required — the orchestrator always passes a
real env config. Build the server straight from config.env (vf-nano EnvServer takes the
EnvConfig). Bump deps/vf-nano (is_truncated computed field).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(nano): reverse-text uses default harness with enable_bash=false
The model answers directly (no bash tool), so the tool_call_parser is no longer needed.
Bump deps/vf-nano (enable_bash flag).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(nano): hendrycks-sanity config on the math-env taskset
Add configs/debug/nano/hendrycks_sanity.toml (vf-nano analog of configs/debug/hendrycks_sanity):
math-env train env on the sanity dataset (R1-distill, default renderer with think parser,
single-turn no-bash), aime24 eval env. Register math-env + aime24 in the envs group + sources.
Bump deps/vf-nano (math/aime tasksets, enable_bash, SerializeAsAny + is_truncated fixes).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump deps/vf-nano to merged main (math/aime tasksets, fixes)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: drop deps/verifiers, bump deps/vf-nano to merged main
vf-nano main (#8) adds the subprocess-runtime env passthrough (inherit host env
minus API_KEY, so UV_CACHE_DIR reaches rollout workers and lands the uv cache on
local disk) and the math-verify scoring fix (wrap gold + prediction in \boxed,
matching v1 math-env — fixes matrix/vector answers scoring 0).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(nano): hendrycks-sanity mirrors examples/hendrycks_sanity + slurm overlay
nano config is now a verbatim copy of examples/hendrycks_sanity/rl.toml with only
the env sections swapped to vf-nano taskset/harness syntax (math-env taskset,
default harness, subprocess runtime; aime24 eval). Adds a slurm overlay setting
the output dir, wandb run name, and partition.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(nano): fold wandb + slurm into hendrycks_sanity config
Move the [wandb] and [slurm] sections from the slurm overlay into the base config, so it
submits to slurm by default (run locally with --no-slurm). The overlay now only redirects
output_dir. wandb run name is vf-nano-subprocess.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(nano): fold output_dir into hendrycks_sanity config, drop slurm overlay
Everything (output_dir, wandb, slurm) now lives in the one config; remove the now-empty
slurm overlay. Run locally with --no-slurm.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(vf-v1): run on the unified verifiers package (v0 envs + v1=nano + legacy bridge)
Depend on the single verifiers package (deps/verifiers, nesting vf-nano) instead of standalone
vf-nano. EnvConfig is dual-mode: v0 envs via id+args (legacy bridge), v1 envs via taskset/harness
(native nano). The env-server spawn (standalone + mp) picks LegacyEnvServer for v0 and EnvServer
for v1; both return vf.Trace so the orchestrator is unchanged. Plugin sources repointed under
deps/verifiers; reverse-text is the v0 env, reverse-text-v1 the nano taskset.
Verified: uv run rl @ examples/reverse_text/rl.toml (v0, unchanged) trains via the bridge
(reward 0.14->0.79); vf-eval + vf-eval-v1 reverse-text both reward 1.0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(v0): wire alphabet-sort (multi-turn) + bump verifiers (State scrub)
Add the v0 alphabet-sort env as an installable source; bump deps/verifiers to the State-scrub
commit. Verified the bridge on a multi-turn v0 env: alphabet-sort runs (Turns 2.0, 128/128
trainable, 0% error) alongside single-turn reverse-text.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(v0): wire wordle + add configs/wordle/rl.toml (2-GPU)
Wire the v0 wordle TextArena env (multi-turn game) as an installable source; add a 2-GPU
(1 trainer + 1 inference) wordle config. Verified through the legacy bridge: wordle trains
(Turns ~5.3, reward ~0.81, 128/128 trainable, 0% error).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump deps/verifiers (v1 hygiene: init.py/tests/docs)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(v1): track de-vendored verifiers, rename nano -> v1
- Bump deps/verifiers to the de-vendored commit: vf-nano is now vendored as the
verifiers.v1 subpackage (no nested deps/vf-nano submodule).
- Repoint env-plugin sources from deps/verifiers/deps/vf-nano/... to
deps/verifiers/{examples,packages}/...
- Rename verifiers.nano -> verifiers.v1 across the orchestrator/utils/configs;
rename configs/debug/nano -> configs/debug/v1. "v1" is the name now (no "nano").
Verification: v0 reverse-text (legacy bridge) and v1 reverse-text-v1 (native)
both train over a 3-step smoke (reward present, 0% error).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers (eval/serve, v1 deps in base, bundled plugins)
- Bump deps/verifiers: v1 runtime deps moved to base (no `v1` extra), v1 CLIs
renamed eval/serve, shipped plugins bundled in the tasksets/harnesses umbrella
packages, verifiers/v1/harness.py (flattened harnesses subpackage).
- Depend on `verifiers` (was verifiers[v1]) and the `harnesses` umbrella package
(was standalone default/rlm).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers (Task.system_prompt); write full rollout jsonl
- Bump deps/verifiers to the system_prompt commit: Task.system_prompt + harness
APPENDS_SYSTEM_PROMPT support, reverse_text_v1 byte-identical to the v0 env (separate
system message), plus this branch's eval/serve, v1-deps-in-base, CI fixes, and the
retired semgrep policy.
- save_rollouts: drop exclude_keys={"trajectory"} for both train and eval rollouts so the
jsonl carries the full trajectory.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: drop semgrep from uv.lock (verifiers retired the policy group)
Follow-up to the deps/verifiers bump (00c7b77a removed the `policy` dependency
group); regenerates the lock to match.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix: pin the env-server renderer tokenizer to the base model for LoRA runs
Wire `renderer_model_name` (the base model) into the renderer client config so
the env-server renderer builds its tokenizer from the base model instead of the
per-request LoRA adapter name, which has no published HF tokenizer and 404'd
once LoRA training set the served model name to the adapter. Mirrors the
`renderer_model_name` wiring already in `setup_clients` on main. Bumps
deps/verifiers to the matching fix.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(orchestrator): request vLLM token ids for MITO training (#2745)
On the MITO path (no renderer), set return_token_ids in the train env
sampling args so the openai_chat_completions client gets the prompt and
completion token ids back from vLLM and can carry them for training
instead of re-tokenizing the messages downstream. Scoped to
renderer is None so it never reaches the renderer /inference/v1/generate
endpoint (which forwards sampling params to vLLM verbatim).
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers submodule; restore v0 env catalog + v1 -v1 imports
- bump deps/verifiers to the feat/nano-as-v1 tip (token-parse #1585, bridge meta #1586,
taskset -v1 rename #1587, native retries #1588, runtime resource cleanup #1590)
- restore the full v0 env catalog in optional-dependencies.envs + [tool.uv.sources]
(init the research-environments submodule); the taskset -v1 rename freed the v0 names,
so v0 envs (legacy bridge) and v1 tasksets coexist
- point the v1 taskset sources at their -v1 names / _v1 paths (gsm8k-v1, math-env-v1,
aime24-v1; reverse-text-v1 already correct)
- update v1 configs to the -v1 taskset ids (hendrycks_sanity) and drop the superseded
standalone configs/debug/reverse_text_v1.toml
- relock
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers submodule to feat/nano-as-v1 tip (textarena #1592)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers submodule to feat/nano-as-v1 tip
Pulls in runtime resources named after the rollout id (#1596) and the
alphabet-sort-v1 taskset (#1595).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers submodule to feat/nano-as-v1 tip
Brings in env-hub ids (#1597), v0 envs on the eval CLI (#1598), modal runtime (#1594).
* chore(v1): add v1 alphabet-sort debug config (port of examples/alphabet_sort)
* fix: forward extra_env_kwargs to v0 legacy envs; drop dead trajectory tests (#2749)
Restores main's escape hatch for v0 (legacy-bridge) envs: a legacy env's extra_env_kwargs
is auto-populated (timeout_seconds <- timeout.rollout, max_total_completion_tokens <-
max_output_tokens, max_seq_len <- seq_len) and forwarded to LegacyEnvServer at spawn, so v0
rollouts again honor the wall-clock timeout, multi-turn completion budget, and seq-len
truncation (all were silently dropped on this branch).
Also removes tests/unit/orchestrator/test_{trajectories,sft_trajectories}.py, which imported
the deleted interleave_rollout and broke test collection.
* fix: install the bundled `tasksets` package (harbor-v1, textarena-v1)
The `envs` extra wired `harnesses` and the individual `*-v1` example tasksets but
never the bundled `tasksets` package, so the integration tasksets it ships
(`harbor-v1`, `textarena-v1`) couldn't be resolved — `import_taskset("harbor-v1")`
raised ModuleNotFoundError ("tried to import 'harbor_v1'").
Add `tasksets` to the `envs` extra + a path source, and bump the verifiers
submodule to the feat/nano-as-v1 tip (#1600), where the bundled tasksets live
under the `tasksets` namespace package (`tasksets.harbor_v1`) and the loader
resolves the namespaced module.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: consume the v1 message-graph trace (#2763)
* feat: consume the v1 message-graph trace (graph-walk trace_to_samples)
Walk the new message graph (verifiers feat/trace-message-graph, PR #1606): trace_to_samples
builds one TrainingSample per branch by concatenating each branch path's node token_ids /
sampled_mask / logprobs (graph.branch_token_sequences), splitting prompt|completion at the
first sampled token — identical training tensors to the old per-turn stitching, off a trace
that is now linear (not quadratic) in turns. backfill_rollout_tokens is a no-op (training is
renderer-only; `trajectory` is now a read-only view over the graph). Bumps the verifiers
submodule to the graph-trace branch.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump verifiers submodule (MessageNode.mask rename)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: wire alphabet-sort-v1 taskset
Add alphabet-sort-v1 to the `envs` extra + `[tool.uv.sources]` so
configs/debug/v1/alphabet_sort.toml resolves (it referenced an example taskset that was
never wired into prime-rl). Used to verify graph-based training-sample construction on real
RL runs — v0 (legacy bridge) and v1 (native renderer path) both train cleanly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: consume nodes/branches directly (drop Turn/trajectory readers)
`trace_to_samples` already walks the graph; the remaining readers move off the removed
Turn/trajectory API: the gibberish/repetition filters iterate per-node completions,
advantage/dispatcher use `trace.num_turns`/`trace.completion_len`, `get_model_completion_len`
is dropped (use `trace.completion_len`), and the renderer-only train_sink drops the backfill
path (also removing `backfill_rollout_tokens`). Bumps the verifiers submodule.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump verifiers submodule (merge #1605 multiplex interception)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump verifiers submodule (dead-code cleanup)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers (readme highlight + ruff format)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: enforce renderer, SFT backfill, branch-first-class logging
Training is renderer-only now. RL/OPD roll out through the renderer client
(exact sampled token ids + logprobs); SFT rolls out against a chat-completions
teacher that returns no tokens and re-renders the conversation to backfill them
(`backfill_trace`). A renderer is required for every mode (`renderer=None`
rejected) — the oai client never produces correct training tokens for the
message graph. Drops the MITO no-renderer training path.
Logging consumes `trace.branches` as the first-class unit (`branch.token_ids` /
`branch.messages`) instead of the removed `trajectory` field; `trace_to_samples`
builds one sample per branch from the same accessors. Sample loggers take the
rollout objects so env_name/advantage are available.
Add configs/v1/training_mode (rl/opd/sft + lora/external) mirroring the v0
debug configs. Fix the v0 SFT debug configs + rlm_swe to validate under the
renderer requirement.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: flat TrainingSample (token_ids + mask), required renderer
Drop the prompt/completion split from TrainingSample — it doesn't fit a
multi-turn/agentic branch, where context and model-sampled spans interleave. A
sample now carries the branch's flat `token_ids` plus per-token `mask` (True =
trainable), `logprobs`, and `temperatures` (all aligned). `prepare_sample`
passes them straight into the MicroBatch (already flat), and the packer
validates against `token_ids` length.
Make `orchestrator.renderer` a non-optional type (drop the `enforce_renderer`
validator) — training is renderer-only, so the type carries the requirement.
Bump the verifiers submodule to feat/nano-as-v1 (merged #1606 + Branch.branches
inlined).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: SFT teacher rolls out through the renderer client (drop backfill)
Training is renderer-only across every mode, so the SFT teacher now rolls out
through the renderer client too — its rollouts carry tokens directly, the same
as RL/OPD. Drops the chat-completions backfill (`backfill_trace` + the SFT path
in TrainSink) and the now-unused TrainSink renderer.
This requires a self-hosted teacher that shares the student's tokenizer (the
student trains on exactly the ids the renderer feeds the teacher); distilling
from an external chat API is no longer supported. Remove the `sft_external`
debug configs.
Validated: SFT on reverse-text-v1 trains cleanly (Trainable 128/128, eval reward
~0.1 -> ~0.82 over 20 steps) with the teacher on the renderer client, no backfill.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: drop configs/v1/training_mode README
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: consolidate rollout types into Rollout(vf.Trace)
The trace *is* the rollout: replace the FinishedRollout/TrainRollout/EvalRollout
wrappers with a prime-rl Rollout(vf.Trace[TaskT]) subclass that carries the
orchestration metadata (kind, env_name, group_id, policy_version,
off_policy_steps, samples, advantage, is_filtered, filter_results, eval_step) as
exclude=True fields — so dumping a Rollout still yields a plain trace (on-disk
results.jsonl unchanged). envs.py validates the wire trace into Rollout; the
dispatcher stamps the metadata; train vs eval is the `kind` discriminator
(replacing the isinstance check). All consumers read rollout.X directly instead
of rollout.trace.X.
Drop the monitor's SampleRollout duck-type Protocol — the loggers take the real
Rollout (TYPE_CHECKING import) and read branch.token_ids / branch.messages. Also
drop the prime monitor's _split_branch_messages and _json helpers: the
conversation is the unit (no prompt/completion split — meaningless multi-turn).
Fix a latent dispatcher bug surfaced along the way: synthetic error traces used
`error=` / `r.error = ` (a read-only computed field) — now `errors=[...]` /
`r.errors.append(...)`.
Rewrite the (long-stale, dict/`raw`-based) advantage + filters unit tests to
build real Rollouts — they now exercise the current trace-based code (previously
all failing on import/construction).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* ci: allow verifiers + datasets in the slim-config dep check
The v1 config types (EnvConfig, Task, ...) extend `verifiers.v1`, which is a
declared, pure-pydantic dependency of prime-rl-configs (it pulls `datasets` for
the taskset/Task types but no GPU/ML deps). Drop `verifiers` and `datasets` from
the slim-install forbidden list — keep the real heavy training deps (torch,
vllm, transformers, ...).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers submodule to feat/nano-as-v1 tip
Picks up the v1 end-to-end eval test suite (#1609) and the v0 legacy
env-server group-scoring fix (#1612).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers submodule to feat/nano-as-v1 tip
Picks up the v0 legacy-bridge fixes: guard against non-renderer training
clients (#1613) and serve the eval split for eval-only v0 envs (#1614).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): fix stale sft.toml teacher comments
The SFT teacher rolls out through the renderer client (token-in/out) and
must share the student's tokenizer; drop the leftover oai-client / token
backfill description removed in the renderer-only refactor.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers (v0 eval chat-completions client)
Picks up verifiers#1615: the legacy bridge builds a chat-completions client
for v0 eval rollouts (renderer for training), instead of raising on the
non-renderer eval client.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(v1): scaleswe configs + taskset registration (#2765)
- register the scaleswe-v1 taskset (pyproject envs list + uv source)
- point the existing rlm-swe config (configs/rlm_swe/qwen35_4b.toml) at the
scaleswe taskset (task_type="scaleswe", train + eval)
- add configs/debug/v1/scaleswe.toml — a per-env v1 port of that config using
the scaleswe-v1 taskset via the rlm harness on the prime runtime
Companion to verifiers feat/scaleswe-v1 (scaleswe-v1 taskset + setup/workdir hooks).
Needs the deps/verifiers submodule bumped to that branch once it lands.
* feat(v1): multimodal training through the message graph + color-codeword-v1
Consume the v1 trace's multimodal sidecar. `trace_to_samples` builds, per branch,
`mm_kwargs` (the branch's per-image renderer items concatenated on dim 0 and
EncodedTensor-encoded) and `mm_token_type_ids` (the renderer's
`mm_token_type_id_map` applied to the branch tokens); `TrainSink` threads the
mapping through. The wandb sample logger now renders the task as a Table-safe JSON
string with image data elided — an image-bearing instruction crashed wandb's
Table type inference on the nested content list.
Adds `configs/v1/multimodal_color_codeword.toml` (Qwen3-VL-4B on color-codeword-v1,
2-GPU) and registers the `color-codeword-v1` taskset; bumps the verifiers submodule
for the multimodal message-graph support.
Verified end-to-end: the VLM trains through the mm path (eval 0.69 -> 0.78,
Trainable 256/256 — mm_kwargs reach the Qwen3-VL forward); v0 `color-codeword`
eval 0.625 ~= v1 `color-codeword-v1` eval 0.69 (faithful port).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Revert "feat(v1): multimodal training through the message graph + color-codeword-v1"
This reverts commit 85b27cfc0be12b84f9b36456675b10387d01dc8a.
* feat: multimodal training through the v1 message graph + color-codeword-v1 (#2766)
* feat(v1): multimodal training through the message graph + color-codeword-v1
Consume the v1 trace's multimodal sidecar. `trace_to_samples` builds, per branch,
`mm_kwargs` (the branch's per-image renderer items concatenated on dim 0 and
EncodedTensor-encoded) and `mm_token_type_ids` (the renderer's
`mm_token_type_id_map` applied to the branch tokens); `TrainSink` threads the
mapping through. The wandb sample logger now renders the task as a Table-safe JSON
string with image data elided — an image-bearing instruction crashed wandb's
Table type inference on the nested content list.
Adds `configs/v1/multimodal_color_codeword.toml` (Qwen3-VL-4B on color-codeword-v1,
2-GPU) and registers the `color-codeword-v1` taskset; bumps the verifiers submodule
for the multimodal message-graph support.
Verified end-to-end: the VLM trains through the mm path (eval 0.69 -> 0.78,
Trainable 256/256 — mm_kwargs reach the Qwen3-VL forward); v0 `color-codeword`
eval 0.625 ~= v1 `color-codeword-v1` eval 0.69 (faithful port).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin — multimodal review-pass cleanups
Picks up the verifiers feat/v1-multimodal head: the multimodal review-pass
(capability-flag docstrings, trimmed mm comments, color-codeword-v1 config
validator + module constants) and the merged malloc_trim worker-RSS fix (#1621).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin (content-part mm attribution) + config/test sync
- Bump deps/verifiers to the content-part multimodal attribution (drops the unused
placeholder offset machinery).
- Drop max_turns/seed from the color-codeword-v1 taskset args in the config — the
taskset hard-codes them as module constants now, and passing them is rejected.
- Update the mm egress unit test to assert mm_items order (the new attribution),
not placeholder offsets.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(trainer): slice mm_kwargs on truncation so tokens match embeddings
When a sample exceeds seq_len, prepare_sample truncated input_ids and
mm_token_type_ids but passed mm_kwargs through whole — leaving more image
embeddings than surviving image placeholders. Now truncation cuts to a
whole-image boundary (never splitting an image's placeholder block) and slices
mm_kwargs (pixel_values + image_grid_thw) to the images that fully survive, so
image-placeholder count == image-embedding count.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin to ruff-formatted graph.py
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): remove test_trajectories_mm.py
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: thread num_workers to the env-server worker pool (#2768)
* feat(v1): thread num_workers to the env-server worker pool
Wire the verifiers env-server worker pool into prime-rl: the orchestrator's
spawned env server (envs.py) and the `env-server` CLI now serve via
verifiers' serve_env with num_workers, so requests fan out across N worker
processes instead of one event loop. num_workers was already a config field but
dropped on the floor; it's now passed through and defaults to 4.
Companion to verifiers feat/v1-env-workers; needs deps/verifiers bumped to it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(v1): default num_workers to 4
Make the worker pool the default: num_workers defaults to 4 (was "auto"->1)
across the per-env, train, and eval configs, so training/eval env servers fan
rollouts across 4 worker processes out of the box. "auto" stays a valid value
(scales per concurrency); set num_workers=1 for the old single-process server.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(v1): keep num_workers="auto" default on the orchestrator
Revert the orchestrator's per-env / train / eval num_workers defaults back to
"auto" (was 4) so they keep scaling 1 worker per 256 concurrent rollouts out of
the box. The standalone env server can't scale (no concurrency context — it's
driven by external clients), so its resolver collapses "auto" to a fixed 4.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers to be76cbc3 (env-server worker pool)
Align the pin with #1623 (env-server worker pool: router + N workers),
which the just-merged #2768 (thread num_workers to the pool) requires;
the pin had lagged at the pre-#1623 multimodal tip.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(v1): register r2e-gym-v1 taskset
Add r2e-gym-v1 to the base v1 taskset deps + uv sources (editable from
deps/verifiers/examples/tasksets/r2e_gym_v1) so the id resolves through
the v1 loader, matching the other -v1 tasksets.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(swe): use r2e-gym for rlm_swe configs (v0 + v1)
- v0 configs/rlm_swe/qwen35_4b.toml: restore the train env to r2e and the
eval env to swebench-verified-quick (as on main), reverting the scaleswe switch
- v1: rename configs/debug/v1/scaleswe.toml -> r2e_gym.toml, point the train env
at the r2e-gym-v1 taskset, and drop the eval block
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(swe): point rlm_swe configs at r2e-gym (content)
Apply the edits the prior rename commit missed:
- v0 rlm_swe/qwen35_4b.toml: train -> r2e, eval -> swebench-verified-quick (as on main)
- v1 debug/v1/r2e_gym.toml: taskset -> r2e-gym-v1, eval block removed
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(v1): restore env-server worker logs to the env log file (#2770)
Env servers spawn their worker pool as fresh `spawn` processes with no logging
handlers (verifiers#1626), so per-rollout logs (rollout start/done, context-exceed
warnings) were silently dropped. Pass `setup_env_server_logging` to verifiers'
`serve_env` as `log_setup`; it runs in the broker and in every worker. A worker
inherits the broker's redirected stdout/stderr, so its logs land in the same
`envs/{train,eval}/<name>.log` as before — no new files or paths.
Bumps deps/verifiers to the worker-logging fix.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers to 955b6cdf (dashboard token usage fallback)
Realign the pin onto origin/feat/nano-as-v1 and pick up #1627: the --rich
dashboard's token counts fall back to provider usage when the endpoint returns
no token ids (no more 0/0). The prior pin 3df34ba5 was a pre-rebase #1626
variant; 955b6cdf already contains the equivalent #1626 (env-server worker
logging) plus #1627.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers to 8e4ad735 (clean env-server teardown)
Picks up the serve_env SIGTERM-teardown fix: pool/in-process env servers no
longer print a spurious KeyboardInterrupt traceback into the env logs on
shutdown.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(deps): bump verifiers (renderers floor 0.1.8.dev40)
Picks up the verifiers floor bump so the renderers offset-tokenizer fix (dev40,
PRs #72/#75) can't be undercut by a pre-fix PyPI resolution. Re-locks uv.lock to
the dev40 specifier.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers to db82b38a (reap subprocess tree on cancel)
Picks up #1628 (reap the whole subprocess tree when a runtime run is cancelled).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: elastic env-server pool (inherit static/elastic pool config) (#2774)
* feat(v1): elastic env-server pool (inherit pool config from verifiers)
Companion to verifiers#1629. prime-rl's EnvConfig now extends vf.EnvServerConfig, so
each env inherits the `pool` discriminated union (static{num_workers=4} |
elastic{max_workers=None, multiplex=128}, default elastic) and the orchestrator's env
servers scale workers on demand instead of pre-spawning a fixed `auto` count.
- Drop the per-env / train-group / eval-group `num_workers` fields + the auto-resolution
(ceil(max_inflight/256)); the elastic pool self-sizes from load.
- envs.py / env_server.py pass `vf.pool_serve_kwargs(env.pool)` to serve_env.
- Bump deps/verifiers to the elastic-pool branch.
Breaking: `num_workers` is replaced by `pool`. Configs set `pool = { type = "elastic",
multiplex = N }` or `{ type = "static", num_workers = N }`; the rlm_swe + r2e debug
configs are migrated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(v1): back-compat shim mapping legacy num_workers -> pool
EnvConfig forbids extra fields, so configs still setting the removed `num_workers`
would hard-fail. Add a `model_validator(mode="before")` that maps it onto `pool`:
an int -> a fixed `static` pool, `"auto"` -> the default `elastic` pool; an explicit
`pool` always wins. Keeps existing (incl. out-of-tree) configs parsing without edits.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): drop num_workers from rlm_swe + r2e configs (use default elastic pool)
The default `pool` is already elastic (multiplex 128), so an explicit `pool` here was
redundant — just remove the legacy `num_workers` and inherit the default.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers to f404e97f (elastic env-server pool)
Realign the pin onto origin/feat/nano-as-v1: the prior pin d0c5bc98 was the
unsquashed #1629 feature branch, now squash-merged as f404e97f
(content-identical). Picks up #1629 (static/elastic env-server pool config).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers to 88e9bedd
Picks up #1631 (per-rollout setup timing as a distinct phase) and #1632
(per-call model + runtime retries).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers to 40a2e89f (fix trace.timing.*.duration to_wire validation)
Fixes RunRolloutResponse ValidationError 'trace.timing.setup.duration: Extra
inputs are not permitted' that crashed every rollout (#1636 drops computed
durations from to_wire).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers to 5dc084f5
Picks up #1638 (add --resume for evals: re-run a previous run's
missing/errored rollouts).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers to 472622ba
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: stop importing env modules in the orchestrator (always Rollout[WireTask]) (#2781)
* chore(v1): stop importing env modules in the orchestrator
The orchestrator built its per-env trace_type as Rollout[vf.task_type(env_id)] for v1 envs, and
vf.task_type imports the env package just to read its Task subclass for typing the wire trace.
Nothing reads typed env task fields - only task.idx and a full task.model_dump - and WireTask
(extra="allow") preserves those fields (incl. on disk). Always use Rollout[vf.WireTask], so the
orchestrator never imports an env package: the env's type and runtime both live only in the
server process.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): hoist the constant Rollout[WireTask] to a module-level ROLLOUT_TYPE
It no longer varies per env, so it doesn't belong as a per-instance attribute set in
Env.__init__ - lift it to a module constant used directly in run_rollout/run_group.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers to 7270e69b
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers to 66c87d5b
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers to ef45f720
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin (alphabet-sort host user sim, #1645)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin (modal creates_per_sec 5 -> 40, #1646)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix: cap v1 hendrycks-sanity scoring at 10s (#2790)
* fix(v1): cap hendrycks-sanity scoring at 10s
Without a scoring timeout (the default is no limit), a wedged math verify holds its
rollout's permit forever — sympy can spin past the in-script alarm — and at 512
concurrency that starves the pool and stalls long runs. Set timeout.scoring = 10 on the
train and eval envs so the framework cancels and the subprocess runtime kills a runaway
verify, freeing the permit.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: drop inline comment on the scoring timeout
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers (eval/train client rename) + adopt config imports (#2792)
Bump deps/verifiers to feat/nano-as-v1 HEAD (8873a740), which includes verifiers#1654 — the v1
interception rework: role-named clients (EvalClient/TrainClient), route-detected wire dialects
(chat/responses/anthropic), 1:1 relay + streaming, reasoning preserved.
Adopt the renamed client config classes in prime_rl/utils/client.py:
OpenAIClientConfig -> EvalClientConfig, RendererClientConfig -> TrainClientConfig.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin (subprocess cached-interpreter uv-scripts, #1660)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin (codex built-in harness, #1661)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: track terminal-bench-2-v1 as a taskset dependency
It was only a manual editable install, so `uv sync` pruned it. Add it to the env dependency
group + [tool.uv.sources] (mirroring r2e-gym-v1) so it persists across syncs and is available
out of the box.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(v1): keep multimodal tensors out of rollout dumps (#2794)
verifiers#1653 (carry mm tensors across the env-server wire) is merged and pinned, so
`MessageNode.multi_modal_data` is no longer `exclude=True` — `model_dump(mode="json")` now
serializes the base64 pixel tensors into `train_rollouts.jsonl` and the wandb sample tables,
bloating every line. They're the training `mm_kwargs` carrier, not part of the rollout
record, so exclude them at the dump boundary (train + eval paths).
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): install all v1 example tasksets in prime-rl
Declare the remaining 7 verifiers v1 example tasksets (code-golf, deepwiki,
glossary, swelego, wiki-search, wikispeedia, wordle) as editable deps so uv sync
installs every example, matching the verifiers examples set. chromadb/textarena
were already present via the v0 wiki-search/wordle envs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): install the compact example harness in prime-rl
The example harness (examples/harnesses/compact) was missing from prime-rl deps,
so the documented --harness.id compact branching example failed to resolve
(ModuleNotFoundError: harness compact not found). Declare it like the example
tasksets so uv sync installs it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin (v1 docs #1662 + Trace.info #1664)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(v1): encode router-replay routed_experts into transport (#2808)
* feat(v1): encode router-replay routed_experts into transport
Pack Branch.routed_experts ([tokens, layers, top_k]) into the transport
RoutedExperts the trainer replays, realigning the token axis to len(token_ids)
as a backstop. Bumps the verifiers pin to the companion router-replay commit.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin (raw-bytes wire) + exclude routed_experts from disk dumps
Bumps deps/verifiers to the raw-bytes router-replay codec (RoutedExperts rename +
msgpack bin wire). routed_experts now rides the env-server wire as raw bytes, so it
can't round-trip the json disk dump — add it to ROLLOUT_DUMP_EXCLUDE alongside
multi_modal_data (both are training inputs, not part of the rollout record).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(v1): guard router-replay vs prefix caching + bump verifiers/renderers pins
- Add disable_prefix_caching_for_router_replay validator: prefix-cache hits skip
recomputing the cached prefix, so the engine returns no routed-expert decisions for
those tokens. Router replay needs routing for every token, so force
enable_prefix_caching=False (mirrors the existing kv_cache_offload guard).
- Bump deps/verifiers: tail-pad routed_experts so the final node aligns (the engine
omits the last position's routing).
- Bump deps/renderers: _get_offset_tokenizer immune to the global fastokens patch race.
Together these make v1 MoE router replay work end-to-end (Qwen3-30B-A3B): verified
mismatch_kl drops vs no-replay (0.0005/0.0002/0.0002 vs 0.0015/0.0015/0.0005 @ steps 0/1/2).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump deps/renderers pin (ruff format fix)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump deps/verifiers pin (ruff format)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump deps/verifiers pin to merged feat/nano-as-v1 (router-replay #1672)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Remove disable_prefix_caching_for_router_replay validator
Drops the guard that force-disabled inference.enable_prefix_caching under router replay.
Validating whether prefix caching is actually compatible with router replay (cache hits
may carry no routed-expert decisions); re-add if the A/B shows it drops routing.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(orchestrator): keep top-level advantage wiring after merge
The merge auto-resolved train_sink.py and the orchestrator's TrainSink
construction to main's per-env advantage design (#2721), but envs.py and the
configs were kept on the branch's top-level design — so TrainSink read
`train_envs.get(env_name).advantage_fn`, which TrainEnv no longer defines
(AttributeError at rollout time).
Restore the branch's top-level advantage: TrainSink takes `advantage_config`
and builds its own `self.advantage_fn` once. Verified end-to-end with a
reverse-text-v1 RL smoke (reward climbs 0.16 -> 0.40 over 6 steps).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin to merged main (1b7736a8)
deps/verifiers feat/nano-as-v1 now includes origin/main merged in
(composable tasksets + sandbox/save utils). Pin moves 7a98b566 -> 1b7736a8.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump pydantic-config submodule to main (single-dash short flags)
deps/pydantic-config 896ade4 -> 99f47c6 (origin/main). Picks up #16
(single-dash short flags for single-character aliases, e.g. `eval -n -m`)
and upstreams the dict[str,str] preservation fix (#14). No uv.lock change
(editable path source).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin to feat/nano-as-v1 tip (5fc0b295)
Picks up the py3.11 TypedDict CI fix and #1676 (prime-sandbox programs run for
the sandbox lifetime instead of the 15-min background-job default, + retry-log
wording). Pin moves 1b7736a8 -> 5fc0b295.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin to 3af08c9b
Includes verifiers#1677 - scaleswe-v1 resolves task images via Prime's
Artifact Registry instead of the incomplete public Docker Hub mirror.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(orchestrator): trim glibc heap back to the OS each step (#2821)
Factor the existing end-of-run malloc_trim into a `trim_process_memory()`
helper and also call it at the end of `finalize_train_batch`. Each step frees a
step's worth of rollouts / traces / transport buffers, but glibc keeps those
pages in per-arena free lists, so orchestrator RSS climbs over a long run
without ever shrinking. Trimming per step returns them to the OS.
Mirrors the per-step trim in prime-rl #2807 — the trim only; that PR's
`.raw`-based retention helpers don't apply to this branch's typed Rollout/Trace.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin to 7862985b
Includes verifiers#1678 - scaleswe-v1 Prime Artifact Registry is now opt-in
(`use_prime_registry`, default off), so the default Docker Hub image path works
on local docker runtimes again.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: route consistent_hash on X-Session-ID for cross-turn prefix reuse (#2822)
The vllm-router `consistent_hash` policy keys on request-id headers (default
x-request-id / x-correlation-id / x-trace-id / request-id). The v1 inference
clients send none of those per rollout, so a multi-turn rollout's turns hashed
to random DP shards and re-prefilled the growing prompt cold each turn (~40%
prefix cache hit rate on an agentic SWE run, where it should approach the
eviction-limited ceiling).
Launch the router with `--request-id-headers x-session-id` so it keys on the
per-rollout `X-Session-ID` header the v1 clients now emit (companion verifiers
change). Every turn of a rollout pins to one engine, keeping its cross-turn
prefix warm. Harmless for the random / round_robin policies, which ignore
request-id headers.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin to 211faecb (session-affinity routing)
Picks up verifiers #1680 (session-affinity routing header for cross-turn
prefix reuse) — the counterpart to prime-rl #2822. Pin 7862985b -> 211faecb.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin to 3f45f73b (scaleswe-v1 image filter)
Picks up verifiers #1679 (filter unavailable scaleswe-v1 images at load).
Pin 211faecb -> 3f45f73b.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin to 44da378c (r2e-gym-v1 registry opt-in)
Picks up verifiers #1681 (r2e-gym-v1 Prime Artifact Registry opt-in, default
off). Pin 3f45f73b -> 44da378c.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin to a723977c (scaleswe-v1 image_url filter)
Picks up verifiers #1683 (filter on Docker Hub image_url, not the resolved
image). Pin 44da378c -> a723977c.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin to 0c8d0aa1 (rootless harness install)
Picks up verifiers #1685 (harnesses install/run without root, pinned
/tmp/vf-{harness}) and #1684 (remove Claude Code harness). Pin a723977c -> 0c8d0aa1.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin to d4b054e3 (skip region-limited e2e test)
Picks up verifiers #1686 (skip test_multi_turn on prime — colocated user-sim
port exposure is region-limited). Pin 0c8d0aa1 -> d4b054e3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: point verifiers env paths at the flattened environments/ dir (#2824)
* chore: point verifiers env paths at the flattened environments/ dir
Companion to verifiers #1695 (flatten examples/ into a single environments/).
Update the deps/verifiers editable source paths in pyproject.toml + uv.lock:
examples/tasksets/<x> -> environments/<x> and
examples/harnesses/compact -> environments/compact. Bump the verifiers pin to
the flatten branch (7ddc78b2) so the relocated paths resolve.
Re-pin to the merged commit once verifiers #1695 lands.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(deps): bump verifiers to 22b02e4b (flatten examples/ into environments/, #1695 merged)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): advance verifiers pin to feat/nano-as-v1 tip (Trace.state + DNS fix)
#2824 retargeted the env paths but pinned deps/verifiers at 22b02e4b (#1695,
just after the examples/->environments/ flatten), which predates the later
merges. Advance the submodule to the actual feat/nano-as-v1 tip bf30ae5f
(verifiers 0.1.15.dev298): adds Trace.state (#1711), the unconditional MCP
DNS-rebinding relax (#1715, fixes the in-prime tool 421 CI failures), and the
init scaffold (#1713).
bf30ae5f also dropped verifiers' self-referential `packages` extra, so drop
`[packages]` from the verifiers override-dependency (and refresh the stale
sources comment).
Smoke: `eval reverse-text-v1 -n 1` (subprocess, temp=0) runs clean — no errors,
not truncated, reward ~1.0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin to feat/nano-as-v1 tip (#1717 + #1719)
Advance deps/verifiers bf30ae5f -> f7fa7482:
- #1717 train-client prefix bridging + token-based message-graph branching
- #1719 default the harness runtime to subprocess
No prime-rl-side changes needed: every prime-rl v1 config already sets
harness.runtime.type explicitly (#1719's default flip is a no-op here), and the
consumed verifiers surface (serve EnvClient/serve_env/pool_serve_kwargs,
clients.config Eval/TrainClientConfig, task.TaskT, the Trace branch/token API) is
untouched by #1717. Imports resolve; reverse-text-v1 eval smoke clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin to feat/nano-as-v1 tip (f7fa7482 -> bbfd5646)
New since the last pin: #1718/#1721 (CI + tests), #1723 (per-harness tool
disabling), #1666 (client consolidation + chat-completion header forwarding),
#1722 (warn on non-default subprocess harnesses). No prime-rl-side changes
needed: clients.config (the consumed surface) is untouched by #1666's client
refactor, #1722 only warns and exempts the default harness (what prime-rl's
subprocess configs use), and the rest is additive/CI-only. Imports resolve;
reverse-text-v1 eval smoke clean (reward 1.0).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: verifiers #1732 companion (model_dump traces, Task.prompt rename) (#2831)
* chore(v1): add prime CLI as deps/prime submodule (feat/nano-as-v1)
Tracks the companion PrimeIntellect-ai/prime#751 branch that adds verifiers v1
eval support to the prime CLI (run/view consume the v1 entrypoint + Trace format).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(v1): bump verifiers pin to feat/nano-as-v1 tip (bbfd5646 -> 220f21d4)
New: #1727 (per-rollout isolation for shared writable tool servers) and #1702
(trim verifiers runtime deps — modal/notebook/quest/pdf moved to extras). No
prime-rl-side changes needed: the only dropped transitive dep is pymupdf, used
solely by verifiers' experimental quest PDF tool via a lazy import behind the
quest extra (prime-rl never touches it). Imports resolve on dev307;
reverse-text-v1 eval smoke clean (reward 1.0).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: validate env-server traces via model_dump (companion to verifiers to_wire removal)
verifiers drops Trace.to_wire/from_wire and the derived computed fields (reward,
is_truncated, error, duration are plain properties now). Swap wire.to_wire() ->
wire.model_dump() when re-typing a returned Trace into ROLLOUT_TYPE; the .reward /
.is_truncated the metrics/eval code reads are the Trace properties, so they still work.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: rename vf.Task(instruction=) -> prompt= (verifiers #1732 companion)
verifiers #1732 renames Task.instruction -> Task.prompt; update the dispatcher's
error-rollout Task construction to match. Pin bump to the merged commit comes when #1732
lands.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(…
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Companion PR to verifiers
feat/nano-as-v1. Migrates the prime CLI's local eval lifecycle to the new config-driven v1evalentrypoint andTraceoutput format, without regressing the v0 paths.verifiersto thefeat/nano-as-v1line (revbbfd564) in both pyprojects;uv sync. Bumpprime'srequires-pythonto>=3.11,<3.14(v1 verifiers requires it).prime eval runnow drives the v1evalconsole script (via a-cshim in the workspace venv —verifiers.v1.cli.evalhas no__main__). v0 hub/local environments run through the bridge's legacy--idpath (which still produces v1Traceoutput); a native v1 taskset can be selected by passing--taskset.id. Convenience flags (-m/-b/-k/--sampling-args/-n/-r/--header/...) are translated into a temporary v1 config TOML (handles nestedsampling.extra_bodyand dash-cased headers that dotted CLI flags mangle), and remaining v1 flags (--harness.id,--client.*,@ file.toml, ...) are forwarded verbatim and override the temp config. Model validation + inference billing preflight are preserved.prime eval view/tuiconsume the v1 output format: a newprime_cli/utils/v1_results.pyadapts a serializedTraceto the v0 record shape the viewer renders (prompt/completion/reward/metrics/info/error) and synthesizes the run-level metadata fromconfig.toml;data.pydiscovers v1 run dirs (outputs/<taskset>--<model>--<harness>/<uuid>withconfig.toml+results.jsonl). v0 discovery/rendering is unchanged.prime lab setupnow also addstasksets+harnesses(the built-in v1 plugin packages thatprime eval runresolves, e.g. thedefaultharness).Breaking
primenow requires>=3.11,<3.14(was>=3.10). v1 verifiers (0.1.15.dev*) drops Python 3.10.prime eval run --hostedraisesNotImplementedErroruntil the platform backend understands the v1 eval format. The run config is still parsed and aHostedEvalConfigbuilt (machinery preserved), but no submission happens. Migration: run locally withprime eval run <env>(omit--hosted).prime eval pushraises an informative error when pointed at a v1 run dir (the platform isn't v1-aware yet) — inspect locally withprime eval view. Pushing v0 result dirs is unchanged.prime eval runno longer auto-uploads results for v1 runs; results stay local (--skip-uploadis now a no-op). Useprime eval view.Verification
uv syncresolvesverifiers==0.1.15.dev305from the pinned rev.uv run pytest packages/prime/tests-> 783 passed, 2 skipped.results.jsonl(built via the verifiers library):discover_local_eval_runsfinds the run (format=v1), theLazyRunResultsadapter yieldsprompt/completion/reward/metrics,compute_run_overview_statsaggregates rewards, synthesized metadata reportsavg_reward, andprime eval push <v1 dir>raises the informative error.prime eval run --hostedraises the gatedNotImplementedError;prime eval --helplists the full command tree.🤖 Generated with Claude Code