Skip to content

perf(rendering): cross-call render batching (Track A)#213

Merged
Exoridus merged 12 commits into
feat/v0.15from
feat/v0.15-cross-call-batching
Jun 28, 2026
Merged

perf(rendering): cross-call render batching (Track A)#213
Exoridus merged 12 commits into
feat/v0.15from
feat/v0.15-cross-call-batching

Conversation

@Exoridus

Copy link
Copy Markdown
Owner

Problem

Calling context.render(node) once per drawable in a Scene.draw loop emitted one GPU draw call per call — 1600 sprites sharing one texture became 1600 draw calls. The debug-layer/performance-overlay example did exactly this. Not a regression (the render path was unchanged since v0.14); a structural property: the per-call path could never batch, and on weaker hardware the per-draw-call cost tanks the frame rate.

Fix (transparent — no public API change)

Move the transform-buffer lifecycle from per-render()-call to per-frame so per-call renders within a frame accumulate into one batch:

  • setView flushes the active batch only on a real view change (was unconditional → one flush per call).
  • _transformBuffer.begin() moves from _beginDrawPlan (per plan) to resetStats (per frame).
  • RenderPlanBuilder bases node indices at the current frame buffer count → every render call writes distinct transform slots.
  • Nested plans (filters / cacheAsBitmap) isolate their slots via flush-before-rewind + count/hash restore.
  • The transform texture/storage uploads only the dirty row range per flush (TransformBuffer.consumeDirtyRange), avoiding O(N²) re-upload in barrier-heavy frames, with a full-upload-on-grow safeguard.

WebGl2 + WebGPU both updated (shared TransformBuffer/RenderPlanBuilder).

Measured (real GPU, RTX 5070 Ti, WebGL2, per-call path)

Sprites before after
1600 60fps, 7.8ms, 1600 draws 60fps, 2.4ms, 1 draw
5000 30fps, 32ms, 4999 draws 60fps, 6.9ms, 2 draws
10000 15fps, 65ms, 9982 draws 60fps, 14.1ms, 3 draws
20000 (too slow) 30fps, 30.1ms, 5 draws

The per-call path now matches the Container path. The performance-overlay example is also updated to use a Container.

Tests

  • pnpm test (exojs 2510 + extensions + rendering-perf): 3609 pass / 1 skip
  • pnpm test:browser:webgl:chromium: 149/149
  • pnpm verify:quick: green (typecheck ×4, lint:all, format:check, docs:api:check)
  • New: cross-call structural regression test, per-call-vs-Container equality test, TransformBuffer dirty-range unit tests.

Caveat

The WebGPU render path is not headless-device-testable here (requestDevice fails — missing dxil.dll); its browser tests skip. WebGPU parity is verified by analogy to the fully-tested WebGl2 path + a mock-device unit test that exercises the upload mechanics. A one-time manual run on a real WebGPU device is the honest closing step before leaning on it in production.

Follow-up

Track B (collect/plan throughput — retained/dirty-tracked plan toward Pixi-class scaling) will build on this branch as a separate PR.

Exoridus added 12 commits June 28, 2026 15:00
The overlay rendered 1600 sprites with one context.render() per sprite,
emitting one draw call each. Adding them to a Container and rendering it
once batches them into a single draw call (7.8ms -> 1.8ms on the spike).
…hing

setView flushes only on real view change; the transform buffer resets once
per frame; the plan builder bases node indices at the frame buffer count;
nested plans isolate their rows. Per-call renders now batch (1000 -> 1 draw).
Leaves the barrier-path allocation gate red — fixed in the next commit.
A frame-scoped buffer made barrier flushes re-upload a growing buffer
(O(N^2)). Uploading only [uploadedRows, count) per flush via commitRect makes
it O(N) while keeping the cross-frame hash-guard skip. Fixes the effect-barrier
gate.
…ter mark

Task 3's delta upload tracked only the highest uploaded row, so a slot reused
below that mark (a filter composite reusing a row a nested plan had rewound)
was never re-uploaded, leaving stale transform data — the filter-boundary
browser test rendered the wrong color. Track the exact written-slot range
[dirtyMin, dirtyMax] in TransformBuffer instead; the delta upload pushes
precisely the changed rows regardless of reuse. Restores filter-boundary
(browser 149/149), keeps effect-barrier under budget and cross-call batching.
Mirror the WebGl2 backend's Tasks 2-4 lifecycle changes onto WebGPU:

- TransformBuffer is now frame-scoped (reset in resetStats, not per plan)
- Add transformBufferCount getter so RenderPlanBuilder offsets node indices
  correctly for WebGPU (previously fell back to 0 -> no cross-call batching)
- _beginDrawPlan: push base/hash stacks instead of resetting; reserve is
  based on frame-global count + plan nodes to avoid mid-frame reallocations
- _endDrawPlan: pop stacks; nested plans flush + rewindTo to free their rows
- setView: conditional flush (only on real view change) to stop breaking
  batches on every render() call that re-applies the same camera view
- WebGpuTransformStorage.getBuffer: delta upload via consumeDirtyRange
  instead of full-buffer writeBuffer on every flush boundary
…gin()

After _growBuffer creates a new empty GPUBuffer, set _needsFullUpload=true.
In getBuffer, always consumeDirtyRange first (clears stale range), then
branch: full [0,count) upload when _needsFullUpload, else delta rowCount>0.
Mirrors WebGl2's full-upload-on-grow so mid-frame reallocated slots are never
read as uninitialized transforms by the shader.

Also removes the dead begin(nodeCount) wrapper — callers use buffer.begin() directly.
…scoped slots

The prior commit removed WebGpuTransformStorage.begin() as dead code, but it has
~25 test call sites (30 tests broke). Restore it. Also the webgpu-backend
RenderTexture+Sprite test asserted the sprite transform in slot 0, but with
frame-scoped batching the graphics-into-RT is slot 0 and the sprite lands in
slot 1 — read slot 1 (transform verified: tx=24, ty=18 present after the
full-upload-on-grow). Full exojs project green (2510); no other regressions.

Process note: the exojs unit project (test/**) was not run during the earlier
tasks — only rendering-perf + browser-webgl; this surfaced both issues.
Autofix: sort the playRenderTree import in the perf harness; prettier-format
WebGpuTransformStorage after the begin() restore. verify:quick green.
The particle GPU-injection test mocks the backend via Object.create(prototype),
bypassing the constructor that initializes _planBaseStack/_planHashStack (used by
_beginDrawPlan since the cross-call batching work). Seed them like the existing
device mock. Full test suite green (3609).
…nge tests

- RenderingContext: setView flushes only on view change (not unconditionally);
  correctness rests on trailing flush() and renderer-switch flushes
- RenderInstruction: nodeIndex is frame-global [frameBase, frameBase+nodeCount),
  not plan-local [0, nodeCount)
- WebGpuTransformStorage: clarify consumeDirtyRange is inside the upload branch only;
  add upload-guard note explaining why a skipped flush is safe
- WebGl2Backend: same upload-guard safety note as WebGpu counterpart
- test: 6 new TransformBuffer dirty-range cases (consumeDirtyRange sentinel,
  coverage+self-clearing, below-HWM reuse, clamping, rewindTo, begin reset)
@Exoridus Exoridus merged commit 25f5fef into feat/v0.15 Jun 28, 2026
1 check passed
@Exoridus Exoridus deleted the feat/v0.15-cross-call-batching branch June 28, 2026 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant