perf(rendering): cross-call render batching (Track A)#213
Merged
Conversation
added 12 commits
June 28, 2026 15:00
The overlay rendered 1600 sprites with one context.render() per sprite, emitting one draw call each. Adding them to a Container and rendering it once batches them into a single draw call (7.8ms -> 1.8ms on the spike).
…hing setView flushes only on real view change; the transform buffer resets once per frame; the plan builder bases node indices at the frame buffer count; nested plans isolate their rows. Per-call renders now batch (1000 -> 1 draw). Leaves the barrier-path allocation gate red — fixed in the next commit.
A frame-scoped buffer made barrier flushes re-upload a growing buffer (O(N^2)). Uploading only [uploadedRows, count) per flush via commitRect makes it O(N) while keeping the cross-frame hash-guard skip. Fixes the effect-barrier gate.
…ter mark Task 3's delta upload tracked only the highest uploaded row, so a slot reused below that mark (a filter composite reusing a row a nested plan had rewound) was never re-uploaded, leaving stale transform data — the filter-boundary browser test rendered the wrong color. Track the exact written-slot range [dirtyMin, dirtyMax] in TransformBuffer instead; the delta upload pushes precisely the changed rows regardless of reuse. Restores filter-boundary (browser 149/149), keeps effect-barrier under budget and cross-call batching.
Mirror the WebGl2 backend's Tasks 2-4 lifecycle changes onto WebGPU: - TransformBuffer is now frame-scoped (reset in resetStats, not per plan) - Add transformBufferCount getter so RenderPlanBuilder offsets node indices correctly for WebGPU (previously fell back to 0 -> no cross-call batching) - _beginDrawPlan: push base/hash stacks instead of resetting; reserve is based on frame-global count + plan nodes to avoid mid-frame reallocations - _endDrawPlan: pop stacks; nested plans flush + rewindTo to free their rows - setView: conditional flush (only on real view change) to stop breaking batches on every render() call that re-applies the same camera view - WebGpuTransformStorage.getBuffer: delta upload via consumeDirtyRange instead of full-buffer writeBuffer on every flush boundary
…gin() After _growBuffer creates a new empty GPUBuffer, set _needsFullUpload=true. In getBuffer, always consumeDirtyRange first (clears stale range), then branch: full [0,count) upload when _needsFullUpload, else delta rowCount>0. Mirrors WebGl2's full-upload-on-grow so mid-frame reallocated slots are never read as uninitialized transforms by the shader. Also removes the dead begin(nodeCount) wrapper — callers use buffer.begin() directly.
…scoped slots The prior commit removed WebGpuTransformStorage.begin() as dead code, but it has ~25 test call sites (30 tests broke). Restore it. Also the webgpu-backend RenderTexture+Sprite test asserted the sprite transform in slot 0, but with frame-scoped batching the graphics-into-RT is slot 0 and the sprite lands in slot 1 — read slot 1 (transform verified: tx=24, ty=18 present after the full-upload-on-grow). Full exojs project green (2510); no other regressions. Process note: the exojs unit project (test/**) was not run during the earlier tasks — only rendering-perf + browser-webgl; this surfaced both issues.
Autofix: sort the playRenderTree import in the perf harness; prettier-format WebGpuTransformStorage after the begin() restore. verify:quick green.
The particle GPU-injection test mocks the backend via Object.create(prototype), bypassing the constructor that initializes _planBaseStack/_planHashStack (used by _beginDrawPlan since the cross-call batching work). Seed them like the existing device mock. Full test suite green (3609).
…nge tests - RenderingContext: setView flushes only on view change (not unconditionally); correctness rests on trailing flush() and renderer-switch flushes - RenderInstruction: nodeIndex is frame-global [frameBase, frameBase+nodeCount), not plan-local [0, nodeCount) - WebGpuTransformStorage: clarify consumeDirtyRange is inside the upload branch only; add upload-guard note explaining why a skipped flush is safe - WebGl2Backend: same upload-guard safety note as WebGpu counterpart - test: 6 new TransformBuffer dirty-range cases (consumeDirtyRange sentinel, coverage+self-clearing, below-HWM reuse, clamping, rewindTo, begin reset)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Calling
context.render(node)once per drawable in aScene.drawloop emitted one GPU draw call per call — 1600 sprites sharing one texture became 1600 draw calls. Thedebug-layer/performance-overlayexample did exactly this. Not a regression (the render path was unchanged since v0.14); a structural property: the per-call path could never batch, and on weaker hardware the per-draw-call cost tanks the frame rate.Fix (transparent — no public API change)
Move the transform-buffer lifecycle from per-
render()-call to per-frame so per-call renders within a frame accumulate into one batch:setViewflushes the active batch only on a real view change (was unconditional → one flush per call)._transformBuffer.begin()moves from_beginDrawPlan(per plan) toresetStats(per frame).RenderPlanBuilderbases node indices at the current frame buffer count → every render call writes distinct transform slots.TransformBuffer.consumeDirtyRange), avoiding O(N²) re-upload in barrier-heavy frames, with a full-upload-on-grow safeguard.WebGl2 + WebGPU both updated (shared
TransformBuffer/RenderPlanBuilder).Measured (real GPU, RTX 5070 Ti, WebGL2, per-call path)
The per-call path now matches the Container path. The
performance-overlayexample is also updated to use a Container.Tests
pnpm test(exojs 2510 + extensions + rendering-perf): 3609 pass / 1 skippnpm test:browser:webgl:chromium: 149/149pnpm verify:quick: green (typecheck ×4, lint:all, format:check, docs:api:check)TransformBufferdirty-range unit tests.Caveat
The WebGPU render path is not headless-device-testable here (
requestDevicefails — missingdxil.dll); its browser tests skip. WebGPU parity is verified by analogy to the fully-tested WebGl2 path + a mock-device unit test that exercises the upload mechanics. A one-time manual run on a real WebGPU device is the honest closing step before leaning on it in production.Follow-up
Track B (collect/plan throughput — retained/dirty-tracked plan toward Pixi-class scaling) will build on this branch as a separate PR.