Skip to content

feat: prototype allocator-level OOM protection (OomGuard breaker + cooperative real-usage accounting)#4582

Draft
andygrove wants to merge 24 commits into
apache:mainfrom
andygrove:oom-guard-circuit-breaker
Draft

feat: prototype allocator-level OOM protection (OomGuard breaker + cooperative real-usage accounting)#4582
andygrove wants to merge 24 commits into
apache:mainfrom
andygrove:oom-guard-circuit-breaker

Conversation

@andygrove

@andygrove andygrove commented Jun 3, 2026

Copy link
Copy Markdown
Member

Which issue does this PR close?

Relates to #4576. This is an exploratory prototype covering both halves of that issue: the RSS circuit breaker ("OomGuard") and cooperative online accounting that feeds the real allocator balance into DataFusion's MemoryPool. It is not a complete, production-default implementation, so it does not close the issue.

Rationale for this change

Comet's memory accounting relies on voluntary MemoryPool reservations, which miss allocations made by Arrow buffers, join scratch space, and expression kernels. Two problems follow:

  1. When real native memory exceeds the container limit, the OS/YARN/Kubernetes kills the entire executor JVM, losing every task and all cached data on it.
  2. Because the pool undercounts real usage, users compensate by manually lowering spark.comet.exec.memoryPool.fraction, a crude, workload-specific knob.

This prototype attacks both by tracking the real bytes the global allocator hands out and using that signal in two layers:

  • A cooperative gate that rejects pool growth (triggering a DataFusion spill and retry) once real usage plus the request would exceed the off-heap budget. This lets the pool act on real usage rather than tracked reservations alone, so fraction no longer needs tuning.
  • A last-resort, executor-global circuit breaker that fails a single task with a retriable ResourcesExhausted error instead of letting the executor get OOM-killed.

The byte-tracking allocator adapts the AccountingAllocator from apache/datafusion#22626 rather than depending on it, since that code lives in DataFusion's test-only sqllogictest crate.

What changes are included in this PR?

Gated behind a new oom-guard cargo feature; the default build is unchanged with zero added per-allocation overhead.

Allocator accounting and circuit breaker:

  • native/core/src/execution/memory_pools/oom_guard.rs (new): AccountingAllocator<A> wrapping the inner global allocator; a single process-wide balance with per-thread drift settled at a 64 KiB threshold; arm/disarm/stamp_current_thread/current_balance; a typed OomGuardPanic, raised via panic_any on an armed, stamped thread that crosses the limit, with reentrancy protection so the panic's own boxing allocation does not recurse.
  • native/core/src/lib.rs: under the oom-guard feature, installs the wrapper as #[global_allocator] over jemalloc / mimalloc / system; mutually exclusive cfgs leave the default build untouched.

Cooperative real-usage gate:

  • native/core/src/execution/memory_pools/real_usage_pool.rs (new): RealUsagePool, a MemoryPool decorator that checks the real allocator balance against a process-global ceiling before delegating growth to the inner pool, returning ResourcesExhausted (which DataFusion catches to spill and retry) when real usage plus the request would exceed the ceiling. It composes around any pool type and adds one relaxed atomic read per try_grow.

Wiring and config:

  • native/core/src/execution/jni_api.rs: stamps tokio worker threads (on_thread_start) and the JNI caller thread; arms the guard from config in createPlan; wraps the memory pool in RealUsagePool (ceiling = the off-heap budget) when the guard is enabled; maps OomGuardPanic to DataFusionError::ResourcesExhausted at both execution boundaries (the spawned/channel path, both producer and consumer, and the busy-poll block_on path).
  • spark/src/main/scala/org/apache/comet/CometConf.scala: registers spark.comet.exec.memoryGuard.enabled (default false) and spark.comet.exec.memoryGuard.size (optional; defaults to the executor off-heap size). Both layers ride this single switch. They default to the same threshold but order correctly: the cooperative gate trips on projected usage (balance + additional) so it spills first, while the breaker trips on actual usage (balance) as the backstop.

Known limitations / out of scope for this prototype (candidates for follow-ups):

  • Executor-global granularity only; no per-task attribution or fairness.
  • Only tokio workers and the JNI caller thread are stamped, so allocations on spawn_blocking/IO/other pools are tracked but cannot themselves trip the breaker.
  • Layout-byte accounting only; no periodic resync to real jemalloc/mimalloc resident stats.
  • The cooperative gate reacts to real usage but does not yet auto-size the pool budget or deprecate spark.comet.exec.memoryPool.fraction.
  • JVM end-to-end spill test and benchmark validation are deferred until the feature moves toward a default build.

How are these changes tested?

  • Rust unit tests in oom_guard.rs cover the decision/settle helpers, and that the breaker trips only on an armed, stamped thread (disarmed never trips, unstamped never trips).
  • Rust unit tests in real_usage_pool.rs cover the cooperative gate: under the ceiling it delegates, over the ceiling it rejects without reserving the inner pool, an unset ceiling never gates, and shrink/reserved/memory_limit delegate to the inner pool.
  • End-to-end Rust tests drive real heap allocations through the installed AccountingAllocator: one asserts an OomGuardPanic is raised and caught, and another asserts the cooperative gate rejects an over-budget grow.
  • Verified the build and clippy -D warnings across the default, oom-guard, and jemalloc,oom-guard feature combinations.

// exceed isize::MAX on any real platform, so no wrapping or overflow occurs.
let old = layout.size() as isize;
let new = new_size as isize;
track(new - old);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fixed this bug in DF. You need to panic before the realloc, otherwise the caller still has the old pointer and tries to free it on unwind and segfaults.

…ard [skip ci]

Account for and enforce the size delta before delegating to the inner
realloc. Panicking after inner.realloc is unsound: realloc may have freed
or moved the old block, leaving the caller to free a dangling old pointer
on unwind and segfault. Enforce while the old pointer is still valid.

Gate panic_any behind a compare_exchange on ARMED so at most one thread
fires the guard panic per arm cycle. The relaxed ARMED load on the hot
path is not a serialization point: several threads can read ARMED=true in
the same window and each dispatch a panic, which Rust's unwind ABI can
turn into a process abort ("failed to initiate panic", exit 133). The
guard re-arms on the next createPlan.
@cetra3

cetra3 commented Jun 10, 2026

Copy link
Copy Markdown

We've been using https://github.com/cetra3/thresher ourselves to handle something similar with datafusion. Namely, we log a trace when the threshold is reached with the heap dump of jemalloc, based upon the example

It's helped us so far nail down some gnarly memory allocation related bugs with our compaction process. I'm hoping to expand this to queries soon

Also fix clippy redundant_closure in jni_api.rs on_thread_start call.
… idioms

Check the real-usage ceiling before inner.try_grow instead of reserving then
rolling back on rejection, removing the rollback path and avoiding a JVM
re-entry on Spark-backed pools. Use resources_datafusion_err! and delegate
memory_limit() to match the sibling pool decorators, and confine the
balance-source test seam to cfg(test).
@andygrove andygrove changed the title feat: prototype allocator-level OOM circuit breaker (OomGuard) feat: prototype allocator-level OOM protection (OomGuard breaker + cooperative real-usage accounting) Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants