[codex] Support raw image refs for multimodal rendering#89
Conversation
Drop the cache-only None path. Every image (current and prior turns) carries its raw descriptor ref; _descriptor_only_mm_data no longer strips the pointer, so refs carry forward without a rebuild. Removes the now-orphaned materialize_image_refs / materialize_kimi_image_refs and the materialize_all_image_refs flag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tale comments - Drop the render-time processor constructor arg from Qwen3VL/Qwen35/Kimi renderers: geometry is computed deterministically from config; no renderer runs the HF image processor at render. Remove Kimi dead _get_processor/_process_image/self._processor/_image_cache. - mm_store: remove all backcompat aliases (MMRAW_PREFIX, MM_RAW_PAYLOAD_KEY/VALUE, mmraw_ref, split_mmraw_ref, image_asset_dir) -- no consumers. - client.py: fix stale generate() docstring + comment that referenced the removed None/cache path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
It only sized Kimi per-renderer image cache, which was deleted with the render-time processor path. No consumers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s-renderers # Conflicts: # renderers/configs.py # renderers/qwen3_vl.py
ApprovabilityVerdict: Needs human review This PR introduces a new multimodal output mode system that changes the default behavior for image rendering - now emitting raw image refs instead of processed pixel values. The scope includes a new module, new wire format, and refactored serialization across multiple renderers, warranting human review for correctness and compatibility. You can customize Macroscope's approvability policy. Learn more. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 4e3502f. Configure here.

Design update — inline/offload image storage
This PR now supports both raw image transport modes used by prime-rl:
offload: existing behavior, raw image bytes are written to run-scoped image assets and refs carry a file-backed image id.inline: data-image URIs remain inline and raw refs carry the inline source instead of requiringraw_image_id.This repo adds inline-capable
mmraw:v3refs while preservingmmraw:v2parsing, keeps Qwen image hashes aligned to the raw decoded bytes, and emits raw descriptor items with eitherraw_image_idorraw_uri.Validation after latest push:
uv run pytest tests/test_client.py -qpassed (14 passed).Design update — dropped the
None/cache-only image pathThis PR and its companions (prime-rl #2836 / verifiers #1746 / renderers #89) no longer use the "send
Nonefor already-cached images" mechanism. Every image carries its raw descriptor ref at every slot (current and prior turns);/inference/v1/generaterematerializes each ref from disk every request.Why: the
Nonepath coupled correctness to deployment (LRU cache present, single replica / DP-affinity, no eviction) and surfaced a miss as a hard vLLMEngineDeadError(qwen3-vl mrope dereferences aNoneimage_grid_thw) that the retry net couldn't catch across the engine→API IPC. Dropping it is deployment-agnostic (a miss is impossible) and non-hacky. vLLM'smm_hashencoder cache still skips the expensive GPU re-encode for free — we only forgo the cheap IPC/CPU-reprocess dedup.Validated: color-codeword (Qwen3-VL-4B) under DP=2, no affinity / no cache reliance: 0 crashes, 0
data=None, multi-turn accumulation correct, reward ~0.84. Also confirmed under TP.This repo: every image emits a raw descriptor ref at every slot.
_descriptor_only_mm_datano longer strips the pointer (pixel_valueswere never present in v1, so the strip was both stale and the root cause of the descriptor-only/rebuild churn). Removed thematerialize_all_image_refsflag and the now-orphanedmaterialize_image_refs/materialize_kimi_image_refs.Original description
Summary
mmraw:v2raw multimodal refs inrenderers.mm_store, parsed asRawMMRefobjects withfamily,fingerprint,modality, hash, asset id, and adapter-owned payloadprime_raw_mm_itemenvelopes instead of processed image payloads for Qwen-VL and Kimi K2.5 image renderingimage_grid_thwfor Qwen,grid_thws/media token metadata for Kimi)Companion PRs
Notes
file://.../assets/images/...refs before rendering.Validation
uvx ruff@0.15.18 check .passed.uvx ruff@0.15.18 format --check .passed.uvx 'ty<0.0.22' check .exited 0; remaining diagnostics are warning-level advisories under the repo config.PYTHONPATH=/home/ubuntu/renderers uv run --no-project --active pytest -q tests/test_client.pypassed:14 passed./home/ubuntu/renderers,/home/ubuntu/verifiers, and/home/ubuntu/prime-rl-v1-raw-mm-offloadcompleted inference, env rollouts, train batch creation, trainer step 0, and decoded strict trainer-bound raw image refs.Note
High Risk
Default
mm_itemsshape changes for all multimodal callers (raw descriptors vspixel_values), and inference now requires offloadedfile://image assets plus companion stack alignment; incorrect integration will break vLLM generate or training paths.Overview
Multimodal rendering now defaults to lightweight raw image descriptors instead of embedding
pixel_valuesinMultiModalData. A newmultimodal_outputconfig ("raw"|"processed", default"raw") selects inference refs vs processor tensors for SFT; auto-resolution carries it into concrete renderer configs.New
renderers/mm_storehandles run-scoped image offload (file://assets), layout fingerprints,prime_raw_mm_itemenvelopes, andmmraw:refs for vLLM. Qwen-VL, Qwen3.5, and Kimi K2.5 compute placeholder counts from layout math (or lazy processors in"processed"mode) rather than always running the HF image processor on the hot path;image_cache_maxis removed from configs.generate()/_build_vllm_mm_featuresno longer depend on vLLM/torch to batch-encode Qwen tensors—they serialize raw descriptor refs per item (with optionalvllm_modality, e.g. Kimivision_chunk). Prebuiltprompt_idscan pullmulti_modal_datafromprompt_attributionwhen omitted.Optional
renderers[vision]extra (pillow/torch/torchvision); multimodal parity tests use offloadedfile://images for renderers while HF processors still use in-memory PIL.Reviewed by Cursor Bugbot for commit a7953b9. Bugbot is set up for automated code reviews on this repo. Configure here.