[codex] Support raw image refs for multimodal rendering by eligotts · Pull Request #89 · PrimeIntellect-ai/renderers

eligotts · 2026-06-18T07:04:47Z

Design update — inline/offload image storage

This PR now supports both raw image transport modes used by prime-rl:

offload: existing behavior, raw image bytes are written to run-scoped image assets and refs carry a file-backed image id.
inline: data-image URIs remain inline and raw refs carry the inline source instead of requiring raw_image_id.

This repo adds inline-capable mmraw:v3 refs while preserving mmraw:v2 parsing, keeps Qwen image hashes aligned to the raw decoded bytes, and emits raw descriptor items with either raw_image_id or raw_uri.

Validation after latest push: uv run pytest tests/test_client.py -q passed (14 passed).

Design update — dropped the `None`/cache-only image path

This PR and its companions (prime-rl #2836 / verifiers #1746 / renderers #89) no longer use the "send None for already-cached images" mechanism. Every image carries its raw descriptor ref at every slot (current and prior turns); /inference/v1/generate rematerializes each ref from disk every request.

Why: the None path coupled correctness to deployment (LRU cache present, single replica / DP-affinity, no eviction) and surfaced a miss as a hard vLLM EngineDeadError (qwen3-vl mrope dereferences a None image_grid_thw) that the retry net couldn't catch across the engine→API IPC. Dropping it is deployment-agnostic (a miss is impossible) and non-hacky. vLLM's mm_hash encoder cache still skips the expensive GPU re-encode for free — we only forgo the cheap IPC/CPU-reprocess dedup.

Validated: color-codeword (Qwen3-VL-4B) under DP=2, no affinity / no cache reliance: 0 crashes, 0 data=None, multi-turn accumulation correct, reward ~0.84. Also confirmed under TP.

This repo: every image emits a raw descriptor ref at every slot. _descriptor_only_mm_data no longer strips the pointer (pixel_values were never present in v1, so the strip was both stale and the root cause of the descriptor-only/rebuild churn). Removed the materialize_all_image_refs flag and the now-orphaned materialize_image_refs / materialize_kimi_image_refs.

Original description

Summary

adds generic mmraw:v2 raw multimodal refs in renderers.mm_store, parsed as RawMMRef objects with family, fingerprint, modality, hash, asset id, and adapter-owned payload
emits strict prime_raw_mm_item envelopes instead of processed image payloads for Qwen-VL and Kimi K2.5 image rendering
keeps adapter-specific layout details in renderer-owned payloads (image_grid_thw for Qwen, grid_thws/media token metadata for Kimi)
supports materializing all raw image refs for retry paths after vLLM multimodal cache misses
keeps run-scoped image asset refs file-backed so downstream Prime-RL trainer materializes images with its own processor

Companion PRs

Prime-RL: Support v1 raw multimodal image offload prime-rl#2836
Verifiers: [codex] Support raw image offload in v1 train client verifiers#1746

Notes

Draft/WIP: stacked with the Verifiers and Prime-RL raw image offload PRs.
Verifiers is expected to offload image content to file://.../assets/images/... refs before rendering.
This intentionally treats raw image refs as the supported path, not processed multimodal feature sidecars.

Validation

uvx ruff@0.15.18 check . passed.
uvx ruff@0.15.18 format --check . passed.
uvx 'ty<0.0.22' check . exited 0; remaining diagnostics are warning-level advisories under the repo config.
PYTHONPATH=/home/ubuntu/renderers uv run --no-project --active pytest -q tests/test_client.py passed: 14 passed.
End-to-end hosted-style smoke through Prime-RL with /home/ubuntu/renderers, /home/ubuntu/verifiers, and /home/ubuntu/prime-rl-v1-raw-mm-offload completed inference, env rollouts, train batch creation, trainer step 0, and decoded strict trainer-bound raw image refs.

[!NOTE]

Add raw image ref support to multimodal renderers for inference-time rendering

Renderers for Qwen3-VL, Qwen3.5, and Kimi K2.5 now emit raw image descriptors (layout fingerprint + image_grid_thw + raw_image_uri) by default instead of processed pixel tensors; processed output is still available via config.multimodal_output='processed'.

Adds mm_store.py, a new module providing image offloading, content-addressed storage, and raw multimodal reference encoding/decoding utilities.

_build_vllm_mm_features in client.py is rewritten to consume raw multimodal descriptor envelopes and produce reference-based kwargs_data, replacing the removed Qwen-specific tensor serializer.

BaseRendererConfig gains a multimodal_output: Literal['raw', 'processed'] field defaulting to 'raw', and AutoRendererConfig propagates this setting to resolved renderer configs.

A new vision install extra is added to pyproject.toml with pillow, torch, and torchvision dependencies, required only for processed-mode rendering.

Behavioral Change: existing callers relying on pixel_values tensors in mm_items will receive raw descriptor dicts unless they explicitly set multimodal_output='processed'.

^{Macroscope summarized a7953b9.}

Note

High Risk
Default mm_items shape changes for all multimodal callers (raw descriptors vs pixel_values), and inference now requires offloaded file:// image assets plus companion stack alignment; incorrect integration will break vLLM generate or training paths.

Overview
Multimodal rendering now defaults to lightweight raw image descriptors instead of embedding pixel_values in MultiModalData. A new multimodal_output config ("raw" | "processed", default "raw") selects inference refs vs processor tensors for SFT; auto-resolution carries it into concrete renderer configs.

New renderers/mm_store handles run-scoped image offload (file:// assets), layout fingerprints, prime_raw_mm_item envelopes, and mmraw: refs for vLLM. Qwen-VL, Qwen3.5, and Kimi K2.5 compute placeholder counts from layout math (or lazy processors in "processed" mode) rather than always running the HF image processor on the hot path; image_cache_max is removed from configs.

generate() / _build_vllm_mm_features no longer depend on vLLM/torch to batch-encode Qwen tensors—they serialize raw descriptor refs per item (with optional vllm_modality, e.g. Kimi vision_chunk). Prebuilt prompt_ids can pull multi_modal_data from prompt_attribution when omitted.

Optional renderers[vision] extra (pillow/torch/torchvision); multimodal parity tests use offloaded file:// images for renderers while HF processors still use in-memory PIL.

^{Reviewed by Cursor Bugbot for commit a7953b9. Bugbot is set up for automated code reviews on this repo. Configure here.}

…s-renderers

Drop the cache-only None path. Every image (current and prior turns) carries its raw descriptor ref; _descriptor_only_mm_data no longer strips the pointer, so refs carry forward without a rebuild. Removes the now-orphaned materialize_image_refs / materialize_kimi_image_refs and the materialize_all_image_refs flag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tale comments - Drop the render-time processor constructor arg from Qwen3VL/Qwen35/Kimi renderers: geometry is computed deterministically from config; no renderer runs the HF image processor at render. Remove Kimi dead _get_processor/_process_image/self._processor/_image_cache. - mm_store: remove all backcompat aliases (MMRAW_PREFIX, MM_RAW_PAYLOAD_KEY/VALUE, mmraw_ref, split_mmraw_ref, image_asset_dir) -- no consumers. - client.py: fix stale generate() docstring + comment that referenced the removed None/cache path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

It only sized Kimi per-renderer image cache, which was deleted with the render-time processor path. No consumers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…s-renderers # Conflicts: # renderers/configs.py # renderers/qwen3_vl.py

macroscopeapp · 2026-06-29T16:37:02Z

Approvability

Verdict: Needs human review

This PR introduces a new multimodal output mode system that changes the default behavior for image rendering - now emitting raw image refs instead of processed pixel values. The scope includes a new module, new wire format, and refactored serialization across multiple renderers, warranting human review for correctness and compatibility.

^{You can customize Macroscope's approvability policy. Learn more.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 4e3502f. Configure here.}

Support raw image refs for multimodal rendering

32d5a9d

This was referenced Jun 18, 2026

[codex] Support raw image offload in v1 train client PrimeIntellect-ai/verifiers#1746

Open

Support v1 raw multimodal image offload PrimeIntellect-ai/prime-rl#2836

Open

eligotts and others added 10 commits June 20, 2026 07:41

Emit generic raw multimodal refs

4bc1766

Merge remote-tracking branch 'origin/main' into codex/raw-image-asset…

eaa07bb

…s-renderers

Fix raw image renderer style checks

a8f4386

Remove orphaned image_cache_max config field

f8ca354

It only sized Kimi per-renderer image cache, which was deleted with the render-time processor path. No consumers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Align raw multimodal renderer descriptors

c33805d

Merge remote-tracking branch 'origin/main' into codex/raw-image-asset…

8fcd0c7

…s-renderers # Conflicts: # renderers/configs.py # renderers/qwen3_vl.py

feat: support inline raw image refs

e97c812

Clean up raw multimodal offload renderers

673b790

eligotts marked this pull request as ready for review June 29, 2026 16:36

cursor Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread renderers/client.py

Comment thread renderers/client.py Outdated

eligotts added 2 commits June 29, 2026 16:57

Clarify raw image asset contract

af84c19

Apply ruff formatting

4e3502f

cursor Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread renderers/client.py

eligotts added 5 commits June 29, 2026 17:37

Preserve multimodal sidecar for prebuilt prompts

998e1db

Use URI-based raw image refs

2a19d75

Drop raw multimodal version markers

aa2d44d

Support processed multimodal renderer output

ed5b404

Trim uv lock churn

a7953b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] Support raw image refs for multimodal rendering#89

[codex] Support raw image refs for multimodal rendering#89
eligotts wants to merge 18 commits into
mainfrom
codex/raw-image-assets-renderers

eligotts commented Jun 18, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

macroscopeapp Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

eligotts commented Jun 18, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design update — inline/offload image storage

Design update — dropped the None/cache-only image path

Summary

Companion PRs

Notes

Validation

Add raw image ref support to multimodal renderers for inference-time rendering

Uh oh!

macroscopeapp Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eligotts commented Jun 18, 2026 •

edited by macroscopeapp Bot

Loading

Design update — dropped the `None`/cache-only image path

macroscopeapp Bot commented Jun 29, 2026 •

edited

Loading