perf(rendering): cross-call render batching (Track A) by Exoridus · Pull Request #213 · Exoridus/ExoJS

Exoridus · 2026-06-28T15:47:48Z

Problem

Calling context.render(node) once per drawable in a Scene.draw loop emitted one GPU draw call per call — 1600 sprites sharing one texture became 1600 draw calls. The debug-layer/performance-overlay example did exactly this. Not a regression (the render path was unchanged since v0.14); a structural property: the per-call path could never batch, and on weaker hardware the per-draw-call cost tanks the frame rate.

Fix (transparent — no public API change)

Move the transform-buffer lifecycle from per-render()-call to per-frame so per-call renders within a frame accumulate into one batch:

setView flushes the active batch only on a real view change (was unconditional → one flush per call).
_transformBuffer.begin() moves from _beginDrawPlan (per plan) to resetStats (per frame).
RenderPlanBuilder bases node indices at the current frame buffer count → every render call writes distinct transform slots.
Nested plans (filters / cacheAsBitmap) isolate their slots via flush-before-rewind + count/hash restore.
The transform texture/storage uploads only the dirty row range per flush (TransformBuffer.consumeDirtyRange), avoiding O(N²) re-upload in barrier-heavy frames, with a full-upload-on-grow safeguard.

WebGl2 + WebGPU both updated (shared TransformBuffer/RenderPlanBuilder).

Measured (real GPU, RTX 5070 Ti, WebGL2, per-call path)

Sprites	before	after
1600	60fps, 7.8ms, 1600 draws	60fps, 2.4ms, 1 draw
5000	30fps, 32ms, 4999 draws	60fps, 6.9ms, 2 draws
10000	15fps, 65ms, 9982 draws	60fps, 14.1ms, 3 draws
20000	(too slow)	30fps, 30.1ms, 5 draws

The per-call path now matches the Container path. The performance-overlay example is also updated to use a Container.

Tests

pnpm test (exojs 2510 + extensions + rendering-perf): 3609 pass / 1 skip
pnpm test:browser:webgl:chromium: 149/149
pnpm verify:quick: green (typecheck ×4, lint:all, format:check, docs:api:check)
New: cross-call structural regression test, per-call-vs-Container equality test, TransformBuffer dirty-range unit tests.

Caveat

The WebGPU render path is not headless-device-testable here (requestDevice fails — missing dxil.dll); its browser tests skip. WebGPU parity is verified by analogy to the fully-tested WebGl2 path + a mock-device unit test that exercises the upload mechanics. A one-time manual run on a real WebGPU device is the honest closing step before leaning on it in production.

Follow-up

Track B (collect/plan throughput — retained/dirty-tracked plan toward Pixi-class scaling) will build on this branch as a separate PR.

The overlay rendered 1600 sprites with one context.render() per sprite, emitting one draw call each. Adding them to a Container and rendering it once batches them into a single draw call (7.8ms -> 1.8ms on the spike).

…hing setView flushes only on real view change; the transform buffer resets once per frame; the plan builder bases node indices at the frame buffer count; nested plans isolate their rows. Per-call renders now batch (1000 -> 1 draw). Leaves the barrier-path allocation gate red — fixed in the next commit.

A frame-scoped buffer made barrier flushes re-upload a growing buffer (O(N^2)). Uploading only [uploadedRows, count) per flush via commitRect makes it O(N) while keeping the cross-frame hash-guard skip. Fixes the effect-barrier gate.

…ter mark Task 3's delta upload tracked only the highest uploaded row, so a slot reused below that mark (a filter composite reusing a row a nested plan had rewound) was never re-uploaded, leaving stale transform data — the filter-boundary browser test rendered the wrong color. Track the exact written-slot range [dirtyMin, dirtyMax] in TransformBuffer instead; the delta upload pushes precisely the changed rows regardless of reuse. Restores filter-boundary (browser 149/149), keeps effect-barrier under budget and cross-call batching.

Mirror the WebGl2 backend's Tasks 2-4 lifecycle changes onto WebGPU: - TransformBuffer is now frame-scoped (reset in resetStats, not per plan) - Add transformBufferCount getter so RenderPlanBuilder offsets node indices correctly for WebGPU (previously fell back to 0 -> no cross-call batching) - _beginDrawPlan: push base/hash stacks instead of resetting; reserve is based on frame-global count + plan nodes to avoid mid-frame reallocations - _endDrawPlan: pop stacks; nested plans flush + rewindTo to free their rows - setView: conditional flush (only on real view change) to stop breaking batches on every render() call that re-applies the same camera view - WebGpuTransformStorage.getBuffer: delta upload via consumeDirtyRange instead of full-buffer writeBuffer on every flush boundary

…gin() After _growBuffer creates a new empty GPUBuffer, set _needsFullUpload=true. In getBuffer, always consumeDirtyRange first (clears stale range), then branch: full [0,count) upload when _needsFullUpload, else delta rowCount>0. Mirrors WebGl2's full-upload-on-grow so mid-frame reallocated slots are never read as uninitialized transforms by the shader. Also removes the dead begin(nodeCount) wrapper — callers use buffer.begin() directly.

…scoped slots The prior commit removed WebGpuTransformStorage.begin() as dead code, but it has ~25 test call sites (30 tests broke). Restore it. Also the webgpu-backend RenderTexture+Sprite test asserted the sprite transform in slot 0, but with frame-scoped batching the graphics-into-RT is slot 0 and the sprite lands in slot 1 — read slot 1 (transform verified: tx=24, ty=18 present after the full-upload-on-grow). Full exojs project green (2510); no other regressions. Process note: the exojs unit project (test/**) was not run during the earlier tasks — only rendering-perf + browser-webgl; this surfaced both issues.

Autofix: sort the playRenderTree import in the perf harness; prettier-format WebGpuTransformStorage after the begin() restore. verify:quick green.

The particle GPU-injection test mocks the backend via Object.create(prototype), bypassing the constructor that initializes _planBaseStack/_planHashStack (used by _beginDrawPlan since the cross-call batching work). Seed them like the existing device mock. Full test suite green (3609).

…nge tests - RenderingContext: setView flushes only on view change (not unconditionally); correctness rests on trailing flush() and renderer-switch flushes - RenderInstruction: nodeIndex is frame-global [frameBase, frameBase+nodeCount), not plan-local [0, nodeCount) - WebGpuTransformStorage: clarify consumeDirtyRange is inside the upload branch only; add upload-guard note explaining why a skipped flush is safe - WebGl2Backend: same upload-guard safety note as WebGpu counterpart - test: 6 new TransformBuffer dirty-range cases (consumeDirtyRange sentinel, coverage+self-clearing, below-HWM reuse, clamping, rewindTo, begin reset)

Exoridus added 12 commits June 28, 2026 15:00

fix(examples): batch performance-overlay sprites via a Container

8c84bb2

The overlay rendered 1600 sprites with one context.render() per sprite, emitting one draw call each. Adding them to a Container and rendering it once batches them into a single draw call (7.8ms -> 1.8ms on the spike).

test(rendering): add cross-call sprite batching regression test (red)

cea0c6d

test(rendering): per-call render output matches Container render

f10e36c

chore(rendering): fix import order + prettier formatting

afb0ec7

Autofix: sort the playRenderTree import in the perf harness; prettier-format WebGpuTransformStorage after the begin() restore. verify:quick green.

Exoridus merged commit 25f5fef into feat/v0.15 Jun 28, 2026
1 check passed

Exoridus deleted the feat/v0.15-cross-call-batching branch June 28, 2026 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(rendering): cross-call render batching (Track A)#213

perf(rendering): cross-call render batching (Track A)#213
Exoridus merged 12 commits into
feat/v0.15from
feat/v0.15-cross-call-batching

Exoridus commented Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Exoridus commented Jun 28, 2026

Problem

Fix (transparent — no public API change)

Measured (real GPU, RTX 5070 Ti, WebGL2, per-call path)

Tests

Caveat

Follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant