SOLR-18284: Load-average and memory circuit breakers no longer stampede or trip on transient pre-GC heap peaks by markrmiller · Pull Request #4514 · apache/solr

markrmiller · 2026-06-09T16:09:20Z

Load-average and memory circuit breakers were doing more harm than good under high concurrency.

LoadAverageCircuitBreaker called OperatingSystemMXBean.getSystemLoadAverage() on every request. The OS load average is a one-minute moving average, so re-polling it per-request is wasted work - and when many requests arrived concurrently, the syscall stampede hammered the CPU, which is the condition this breaker exists to prevent rather than cause.

MemoryCircuitBreaker sampled MemoryMXBean.getHeapMemoryUsage().getUsed() on a 30-second moving average. With a generational collector that signal climbs toward max between collections during normal operation; the breaker would trip on transient pre-GC peaks that GC was about to reclaim.

https://issues.apache.org/jira/browse/SOLR-18284

gerlowskija

This generally looks pretty close IMO. I have two remaining big questions:

Do we really need to write TtlSampledMetric ourselves, or is there something off the shelf (e.g. Caffeine?) that we could use instead?
Why the MemoryCB change to start considering various memory "pools"? If this is a bug, it'd be great to that acknowledged and a sentence or two on it somewhere for posterity to understand this part of the change down the road...

Note that those are "just" questions though; they may not even require changes so much as clarification 🤷

Looks great Mark!

gerlowskija · 2026-06-22T19:11:03Z

+ * the supplier. The {@code refreshing} flag is always released, so a thrown sampler does not wedge
+ * the single-flight latch.
+ */
+final class TtlSampledMetric<T> {


[Q] The implementation here is great and I applaud your concurrency code. But I'm a little surprised that we'd have to implement this functionality ourselves and that there's nothing "off the shelf" that we could use instead. Surely we're not the first folks out there to calculate primitives with a TTL.

Did you consider (and discard) the idea of using Caffeine or something similar here instead? Or might that be worth considering?

Yes, Caffeine can do this: refreshAfterWrite provides a single-flight refresh that serves the stale value to other callers while one thread refreshes, which is largely the behavior I want.

But this is one value, not a keyed map. Caffeine's refreshAfterWrite also refreshes async. It uses ForkJoinPool.commonPool() unless you provide your own executor. This changes the timing from "at most one sample per window" to "at most one in-flight refresh that completes whenever the pool runs it." A heap-pool walk or load-average syscall ends up on the shared pool, which I didn't like. The reasons are similar to my concerns about using a simple Timer, refreshing async, and writing to a shared volatile. This method instead gives me the exact behavior I want on a critical hot path. I also did not want to allocate cache entries or wrap a key just for a single primitive.

So I considered it and decided to stay with the small class. It is about 50 lines that we have exact control over, with no new dependencies that force our use case onto them on the hot path. No extra executors, no extra threads, no nondeterministic timing, no extra allocations.

Yeah, it's small in that it's 50 LOC and no-dependencies. But even concise concurrency code is complex.

If you've considered alternatives then I'm sure you've thought about it more than I have and I trust your judgement there. So, SGTM. Just wanted to make sure the tradeoff was called out 👍

gerlowskija · 2026-06-22T19:28:49Z

+ * <p>Pool selection by collector:
+ *
+ * <ul>
+ *   <li><b>G1 / Parallel / Serial / generational ZGC:</b> uses the pool whose name matches the


[Q] Can you provide a bit more context on this and the stuff at L75-L112 below? Specifically: what value does this differentiation serve? MemoryCircuitBreaker was agnostic of pool and GC-algorithm prior to this PR and there's nothing in SOLR-18284 that points to that being an issue.

Is this a fix for an additional bug that you found while working on this PR? If so, is that additional bug documented somewhere?

Yes, this is the fix, not a side change. I will say that plainly in the JIRA and the class javadoc.

MemoryCircuitBreaker was agnostic to the pool and GC algorithms, but that was because it could do the wrong thing at times. The old breaker read MemoryMXBean.getHeapMemoryUsage().getUsed() That is whole-heap occupancy, including garbage that has not yet been collected. On a generational collector, that number climbs toward max between collections during normal operation. That is the steady-state shape of a healthy JVM, not heap pressure to trigger a circuit breaker. Using 99% of the heap is meaningless - that can be 1 second before a harmless GC drops it to 10%. The breaker tripped on pre-GC peaks that the next collection was about to reclaim.

getCollectionUsage() on the old-gen pool reports occupancy right after the last collection - live data, not garbage, not live data + garbage. That is the only reading that answers "Is the heap actually full?"

getCollectionUsage() is per-pool, so you have to pick a pool, and the one you want is the old/tenured generation where long-lived data is, and each collector names that differently:

G1 → G1 Old Gen

Parallel → PS Old Gen

Serial → Tenured Gen

generational ZGC → ZGC Old Generation

Those match the \b(Old|Tenured)\b name pattern, so the circuit breaker reads getCollectionUsage() on that pool. Non-generational ZGC and Shenandoah are different, and some yet to be introduced collectors may be too - they have one combined heap pool and no old-gen name to match. When nothing matches the pattern, the fallback sums getCollectionUsage() across every HEAP-typed pool instead. So the name match plus fallback is there because the pool layout changes depending on the collector.

Gotcha - appreciate the clarification. So the "correct" metric is only exposed at the level of individual pools; makes more sense, thanks!

…de or trip on transient pre-GC heap peaks Load-average and memory circuit breakers were doing more harm than good under high concurrency. LoadAverageCircuitBreaker called OperatingSystemMXBean.getSystemLoadAverage() on every request. The OS load average is a one-minute moving average, so re-polling it per-request is wasted work - and when many requests arrived concurrently, the syscall stampede hammered the CPU, which is the condition this breaker exists to prevent rather than cause. MemoryCircuitBreaker sampled MemoryMXBean.getHeapMemoryUsage().getUsed() on a 30-second moving average. With a generational collector that signal climbs toward max between collections during normal operation; the breaker would trip on transient pre-GC peaks that GC was about to reclaim.

gerlowskija

LGTM, and thanks for the thorough replies!

(Gentle reminder to avoid force-push in the future if possible. A force-push breaks GH's "Changes since your last review" distinction that I find a huge timesaver.)

github-actions Bot added documentation Improvements or additions to documentation tests labels Jun 9, 2026

gerlowskija reviewed Jun 22, 2026

View reviewed changes

markrmiller force-pushed the cb-load-heap-improvements branch from 12b2237 to 9de1fd3 Compare June 24, 2026 23:09

gerlowskija approved these changes Jun 25, 2026

View reviewed changes

markrmiller merged commit 80fa614 into apache:main Jun 25, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SOLR-18284: Load-average and memory circuit breakers no longer stampede or trip on transient pre-GC heap peaks#4514

SOLR-18284: Load-average and memory circuit breakers no longer stampede or trip on transient pre-GC heap peaks#4514
markrmiller merged 1 commit into
apache:mainfrom
markrmiller:cb-load-heap-improvements

markrmiller commented Jun 9, 2026

Uh oh!

gerlowskija left a comment

Uh oh!

Uh oh!

Uh oh!

gerlowskija Jun 22, 2026

Uh oh!

markrmiller Jun 24, 2026 •

edited

Loading

Uh oh!

gerlowskija Jun 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gerlowskija Jun 22, 2026

Uh oh!

markrmiller Jun 24, 2026 •

edited

Loading

Uh oh!

gerlowskija Jun 25, 2026

Uh oh!

Uh oh!

Uh oh!

gerlowskija left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

markrmiller commented Jun 9, 2026

Uh oh!

gerlowskija left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gerlowskija Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

markrmiller Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gerlowskija Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gerlowskija Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

markrmiller Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gerlowskija Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gerlowskija left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

markrmiller Jun 24, 2026 •

edited

Loading

markrmiller Jun 24, 2026 •

edited

Loading