Skip to content

[codex] Support raw image refs for multimodal rendering#89

Open
eligotts wants to merge 18 commits into
mainfrom
codex/raw-image-assets-renderers
Open

[codex] Support raw image refs for multimodal rendering#89
eligotts wants to merge 18 commits into
mainfrom
codex/raw-image-assets-renderers

Conversation

@eligotts

@eligotts eligotts commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Design update — inline/offload image storage

This PR now supports both raw image transport modes used by prime-rl:

  • offload: existing behavior, raw image bytes are written to run-scoped image assets and refs carry a file-backed image id.
  • inline: data-image URIs remain inline and raw refs carry the inline source instead of requiring raw_image_id.

This repo adds inline-capable mmraw:v3 refs while preserving mmraw:v2 parsing, keeps Qwen image hashes aligned to the raw decoded bytes, and emits raw descriptor items with either raw_image_id or raw_uri.

Validation after latest push: uv run pytest tests/test_client.py -q passed (14 passed).

Design update — dropped the None/cache-only image path

This PR and its companions (prime-rl #2836 / verifiers #1746 / renderers #89) no longer use the "send None for already-cached images" mechanism. Every image carries its raw descriptor ref at every slot (current and prior turns); /inference/v1/generate rematerializes each ref from disk every request.

Why: the None path coupled correctness to deployment (LRU cache present, single replica / DP-affinity, no eviction) and surfaced a miss as a hard vLLM EngineDeadError (qwen3-vl mrope dereferences a None image_grid_thw) that the retry net couldn't catch across the engine→API IPC. Dropping it is deployment-agnostic (a miss is impossible) and non-hacky. vLLM's mm_hash encoder cache still skips the expensive GPU re-encode for free — we only forgo the cheap IPC/CPU-reprocess dedup.

Validated: color-codeword (Qwen3-VL-4B) under DP=2, no affinity / no cache reliance: 0 crashes, 0 data=None, multi-turn accumulation correct, reward ~0.84. Also confirmed under TP.

This repo: every image emits a raw descriptor ref at every slot. _descriptor_only_mm_data no longer strips the pointer (pixel_values were never present in v1, so the strip was both stale and the root cause of the descriptor-only/rebuild churn). Removed the materialize_all_image_refs flag and the now-orphaned materialize_image_refs / materialize_kimi_image_refs.


Original description

Summary

  • adds generic mmraw:v2 raw multimodal refs in renderers.mm_store, parsed as RawMMRef objects with family, fingerprint, modality, hash, asset id, and adapter-owned payload
  • emits strict prime_raw_mm_item envelopes instead of processed image payloads for Qwen-VL and Kimi K2.5 image rendering
  • keeps adapter-specific layout details in renderer-owned payloads (image_grid_thw for Qwen, grid_thws/media token metadata for Kimi)
  • supports materializing all raw image refs for retry paths after vLLM multimodal cache misses
  • keeps run-scoped image asset refs file-backed so downstream Prime-RL trainer materializes images with its own processor

Companion PRs

Notes

  • Draft/WIP: stacked with the Verifiers and Prime-RL raw image offload PRs.
  • Verifiers is expected to offload image content to file://.../assets/images/... refs before rendering.
  • This intentionally treats raw image refs as the supported path, not processed multimodal feature sidecars.

Validation

  • uvx ruff@0.15.18 check . passed.
  • uvx ruff@0.15.18 format --check . passed.
  • uvx 'ty<0.0.22' check . exited 0; remaining diagnostics are warning-level advisories under the repo config.
  • PYTHONPATH=/home/ubuntu/renderers uv run --no-project --active pytest -q tests/test_client.py passed: 14 passed.
  • End-to-end hosted-style smoke through Prime-RL with /home/ubuntu/renderers, /home/ubuntu/verifiers, and /home/ubuntu/prime-rl-v1-raw-mm-offload completed inference, env rollouts, train batch creation, trainer step 0, and decoded strict trainer-bound raw image refs.

[!NOTE]

Add raw image ref support to multimodal renderers for inference-time rendering

  • Renderers for Qwen3-VL, Qwen3.5, and Kimi K2.5 now emit raw image descriptors (layout fingerprint + image_grid_thw + raw_image_uri) by default instead of processed pixel tensors; processed output is still available via config.multimodal_output='processed'.
  • Adds mm_store.py, a new module providing image offloading, content-addressed storage, and raw multimodal reference encoding/decoding utilities.
  • _build_vllm_mm_features in client.py is rewritten to consume raw multimodal descriptor envelopes and produce reference-based kwargs_data, replacing the removed Qwen-specific tensor serializer.
  • BaseRendererConfig gains a multimodal_output: Literal['raw', 'processed'] field defaulting to 'raw', and AutoRendererConfig propagates this setting to resolved renderer configs.
  • A new vision install extra is added to pyproject.toml with pillow, torch, and torchvision dependencies, required only for processed-mode rendering.
  • Behavioral Change: existing callers relying on pixel_values tensors in mm_items will receive raw descriptor dicts unless they explicitly set multimodal_output='processed'.

Macroscope summarized a7953b9.


Note

High Risk
Default mm_items shape changes for all multimodal callers (raw descriptors vs pixel_values), and inference now requires offloaded file:// image assets plus companion stack alignment; incorrect integration will break vLLM generate or training paths.

Overview
Multimodal rendering now defaults to lightweight raw image descriptors instead of embedding pixel_values in MultiModalData. A new multimodal_output config ("raw" | "processed", default "raw") selects inference refs vs processor tensors for SFT; auto-resolution carries it into concrete renderer configs.

New renderers/mm_store handles run-scoped image offload (file:// assets), layout fingerprints, prime_raw_mm_item envelopes, and mmraw: refs for vLLM. Qwen-VL, Qwen3.5, and Kimi K2.5 compute placeholder counts from layout math (or lazy processors in "processed" mode) rather than always running the HF image processor on the hot path; image_cache_max is removed from configs.

generate() / _build_vllm_mm_features no longer depend on vLLM/torch to batch-encode Qwen tensors—they serialize raw descriptor refs per item (with optional vllm_modality, e.g. Kimi vision_chunk). Prebuilt prompt_ids can pull multi_modal_data from prompt_attribution when omitted.

Optional renderers[vision] extra (pillow/torch/torchvision); multimodal parity tests use offloaded file:// images for renderers while HF processors still use in-memory PIL.

Reviewed by Cursor Bugbot for commit a7953b9. Bugbot is set up for automated code reviews on this repo. Configure here.

eligotts and others added 10 commits June 20, 2026 07:41
Drop the cache-only None path. Every image (current and prior turns) carries its raw descriptor ref; _descriptor_only_mm_data no longer strips the pointer, so refs carry forward without a rebuild. Removes the now-orphaned materialize_image_refs / materialize_kimi_image_refs and the materialize_all_image_refs flag.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tale comments

- Drop the render-time processor constructor arg from Qwen3VL/Qwen35/Kimi renderers: geometry is computed deterministically from config; no renderer runs the HF image processor at render. Remove Kimi dead _get_processor/_process_image/self._processor/_image_cache.

- mm_store: remove all backcompat aliases (MMRAW_PREFIX, MM_RAW_PAYLOAD_KEY/VALUE, mmraw_ref, split_mmraw_ref, image_asset_dir) -- no consumers.

- client.py: fix stale generate() docstring + comment that referenced the removed None/cache path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
It only sized Kimi per-renderer image cache, which was deleted with the render-time processor path. No consumers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s-renderers

# Conflicts:
#	renderers/configs.py
#	renderers/qwen3_vl.py
@eligotts eligotts marked this pull request as ready for review June 29, 2026 16:36
@macroscopeapp

macroscopeapp Bot commented Jun 29, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR introduces a new multimodal output mode system that changes the default behavior for image rendering - now emitting raw image refs instead of processed pixel values. The scope includes a new module, new wire format, and refactored serialization across multiple renderers, warranting human review for correctness and compatibility.

You can customize Macroscope's approvability policy. Learn more.

Comment thread renderers/client.py
Comment thread renderers/client.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4e3502f. Configure here.

Comment thread renderers/client.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants