perf(compress): inline the row search into the lazy parse monolith#461
Conversation
The lazy row parse called an out-of-line per-tier #[target_feature] search method (`find_best_<tier>`) at both probe sites (current position + the lazy_decide lookahead). A #[target_feature] fn cannot inline across the call boundary, so every position paid call + argument-marshalling overhead — a large share of the ~2.24x instruction-count gap vs C on the lazy band, whose ZSTD_searchMax is FORCE_INLINE_TEMPLATE into ZSTD_compressBlock_lazy_generic. Splice the rep + row-probe body (row_best_match!) inline at both sites instead, exactly as the greedy monolith already does, so each lazy tier kernel is one target_feature function with no per-position search call. Removed the now-unused gen_row_find_monolith standalone-method generator. Byte-identical (841 lib + 59 ffi incl cross-validation). Measuring decodecorpus instruction count + speed.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PR refactors the lazy row-parsing macro pipeline in ChangesInline lazy probe refactor
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Summary
Speeds up the row-hash lazy parse (levels 6–12) by inlining the match search
into the parse loop, removing a per-position out-of-line call.
The lazy parse called a per-tier
#[target_feature]search method(
find_best_<tier>) at both probe sites — the current position and thelazy_decide!lookahead. A#[target_feature]function cannot be inlinedacross the call boundary, so every position paid call + argument-marshalling
overhead. Upstream C's
ZSTD_searchMaxisFORCE_INLINE_TEMPLATEintoZSTD_compressBlock_lazy_generic; this brings our shape in line.The rep + row-probe body (
row_best_match!) is now spliced inline at both probesites — exactly as
gen_greedy_monolithalready does for the greedy band — soeach lazy tier kernel is a single
target_featuremonolith with no per-positionsearch call. The now-unused
gen_row_find_monolithstandalone-method generatorwas removed.
Results (i9, x86_64, ours-vs-
c_ffi, flat control)decodecorpus-z000033(1 MiB):level_9_lazycompresslevel_11_lazycompressPer-compress instruction count drops ~3 % (the removed call + marshalling); the
larger wall-clock win comes from better register allocation and scheduling once
the search body lives in the parse frame. Small lazy fixtures
(
small-10k-randomL9,small-4k-log-linesL6) are unchanged — the inlinedbody does not regress the cold-icache small-input path.
Testing
cargo nextest run -p structured-zstd --features hash,std,dict_builder841 pass;-p ffi-bench --features bench_internals,dict_builder59 pass (cross-validationround-trips + skippable + fuzz_interop).
--tests,--no-default-features --features kernel_scalar,hash)and
cargo fmt --checkclean.Summary by CodeRabbit