Skip to content

perf(compress): inline the row search into the lazy parse monolith#461

Merged
polaz merged 2 commits into
mainfrom
perf/row-lazy-monolith
Jul 1, 2026
Merged

perf(compress): inline the row search into the lazy parse monolith#461
polaz merged 2 commits into
mainfrom
perf/row-lazy-monolith

Conversation

@polaz

@polaz polaz commented Jun 30, 2026

Copy link
Copy Markdown
Member

Summary

Speeds up the row-hash lazy parse (levels 6–12) by inlining the match search
into the parse loop, removing a per-position out-of-line call.

The lazy parse called a per-tier #[target_feature] search method
(find_best_<tier>) at both probe sites — the current position and the
lazy_decide! lookahead. A #[target_feature] function cannot be inlined
across the call boundary, so every position paid call + argument-marshalling
overhead. Upstream C's ZSTD_searchMax is FORCE_INLINE_TEMPLATE into
ZSTD_compressBlock_lazy_generic; this brings our shape in line.

The rep + row-probe body (row_best_match!) is now spliced inline at both probe
sites — exactly as gen_greedy_monolith already does for the greedy band — so
each lazy tier kernel is a single target_feature monolith with no per-position
search call. The now-unused gen_row_find_monolith standalone-method generator
was removed.

Results (i9, x86_64, ours-vs-c_ffi, flat control)

decodecorpus-z000033 (1 MiB):

Level before after Δ
level_9_lazy compress 31.7 ms 30.6 ms −3.5 %
level_11_lazy compress 44.8 ms 41.7 ms −7.0 %

Per-compress instruction count drops ~3 % (the removed call + marshalling); the
larger wall-clock win comes from better register allocation and scheduling once
the search body lives in the parse frame. Small lazy fixtures
(small-10k-random L9, small-4k-log-lines L6) are unchanged — the inlined
body does not regress the cold-icache small-input path.

Testing

  • Byte-identical — pure inlining, identical match decisions. cargo nextest run -p structured-zstd --features hash,std,dict_builder 841 pass;
    -p ffi-bench --features bench_internals,dict_builder 59 pass (cross-validation
    round-trips + skippable + fuzz_interop).
  • clippy (default + --tests, --no-default-features --features kernel_scalar,hash)
    and cargo fmt --check clean.

Summary by CodeRabbit

  • Refactor
    • Improved the way lazy matching is performed under the hood, streamlining the search path used during compression.
    • Kept compression behavior the same while reducing extra indirection in the matching flow.

The lazy row parse called an out-of-line per-tier #[target_feature] search
method (`find_best_<tier>`) at both probe sites (current position + the
lazy_decide lookahead). A #[target_feature] fn cannot inline across the call
boundary, so every position paid call + argument-marshalling overhead — a
large share of the ~2.24x instruction-count gap vs C on the lazy band, whose
ZSTD_searchMax is FORCE_INLINE_TEMPLATE into ZSTD_compressBlock_lazy_generic.

Splice the rep + row-probe body (row_best_match!) inline at both sites instead,
exactly as the greedy monolith already does, so each lazy tier kernel is one
target_feature function with no per-position search call. Removed the now-unused
gen_row_find_monolith standalone-method generator. Byte-identical (841 lib + 59
ffi incl cross-validation). Measuring decodecorpus instruction count + speed.
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 169e7b8f-7c6a-409c-8d56-01302c3465de

📥 Commits

Reviewing files that changed from the base of the PR and between 14e31ff and dcccb9c.

📒 Files selected for processing (1)
  • zstd/src/encoding/row/mod.rs

📝 Walkthrough

Walkthrough

The PR refactors the lazy row-parsing macro pipeline in zstd/src/encoding/row/mod.rs. The lazy_parse_body! and gen_lazy_monolith! macros drop their $find:ident parameter, gaining $use_mask, $maskmac, and $cpl tier parameters instead. The rep+row probe is now expanded inline via row_best_match! at both the main probe site and the lazy_decide! lookahead closure. All five tier instantiations are updated to match.

Changes

Inline lazy probe refactor

Layer / File(s) Summary
lazy_parse_body! and gen_lazy_monolith! parameter changes
zstd/src/encoding/row/mod.rs
lazy_parse_body! signature changed to accept $use_mask, $maskmac, $cpl; gen_lazy_monolith! drops $find:ident and threads the new parameters through to lazy_parse_body!.
Inline row_best_match! at both probe sites
zstd/src/encoding/row/mod.rs
Both the main carried/best selection block and the lazy_decide! lookahead search closure replace $m.$find::<K, $rl>(...) with an inline row_best_match! expansion.
Tier instantiation updates
zstd/src/encoding/row/mod.rs
All five lazy monolith call sites (lazy_scalar, lazy_sse42, lazy_avx2bmi2, lazy_neon, lazy_simd128) remove find_best_* arguments and pass row_tag_mask_* plus the tier's common_prefix_len_ptr.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐇 Hoppity-hop through the macro maze,
No more $find calls to end my days!
row_best_match! inline at last,
The search is spliced in, monolith fast.
One fewer hop per probe site—hooray!
This rabbit inlines and bounds away~ 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: inlining row search into the lazy parse monolith for compression performance.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/row-lazy-monolith

Comment @coderabbitai help to get the list of available commands.

@codecov

codecov Bot commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented Jun 30, 2026

Copy link
Copy Markdown

Greptile Summary

This PR inlines the lazy row-match search inside the compression parse loop. The main changes are:

  • Removed the generated per-tier lazy search helper.
  • Expanded row_best_match! directly at the current-position probe.
  • Expanded the same search body in the lazy lookahead path.
  • Kept the existing scalar and SIMD lazy kernel dispatch shape.

Confidence Score: 5/5

The compression refactor appears merge-safe with no code issues identified.

The change is localized to the row-hash lazy parse path and is described as behavior-preserving inlining, with existing cross-validation and formatting checks reported clean.

T-Rex T-Rex Logs

What T-Rex did

  • Ran the lazy-byte-identical test and compared the base run to the head run, confirming matching digests, that all rows report roundtrip=true, and EXIT_CODE: 0.
  • Reviewed the inline-shape test results, confirming the head source only uses inline row_best_match and lacks any row_find_* matches in emitted assembly, with perf-smoke numbers showing changes across big and small workloads.

View all artifacts

T-Rex Ran code and verified through T-Rex

Reviews (2): Last reviewed commit: "Merge branch 'main' into perf/row-lazy-m..." | Re-trigger Greptile

@polaz polaz merged commit e7e8adf into main Jul 1, 2026
28 checks passed
@polaz polaz deleted the perf/row-lazy-monolith branch July 1, 2026 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant