perf(compress): inline the row search into the lazy parse monolith by polaz · Pull Request #461 · structured-world/structured-zstd

polaz · 2026-06-30T10:56:59Z

Summary

Speeds up the row-hash lazy parse (levels 6–12) by inlining the match search
into the parse loop, removing a per-position out-of-line call.

The lazy parse called a per-tier #[target_feature] search method
(find_best_<tier>) at both probe sites — the current position and the
lazy_decide! lookahead. A #[target_feature] function cannot be inlined
across the call boundary, so every position paid call + argument-marshalling
overhead. Upstream C's ZSTD_searchMax is FORCE_INLINE_TEMPLATE into
ZSTD_compressBlock_lazy_generic; this brings our shape in line.

The rep + row-probe body (row_best_match!) is now spliced inline at both probe
sites — exactly as gen_greedy_monolith already does for the greedy band — so
each lazy tier kernel is a single target_feature monolith with no per-position
search call. The now-unused gen_row_find_monolith standalone-method generator
was removed.

Results (i9, x86_64, ours-vs-`c_ffi`, flat control)

decodecorpus-z000033 (1 MiB):

Level	before	after	Δ
`level_9_lazy` compress	31.7 ms	30.6 ms	−3.5 %
`level_11_lazy` compress	44.8 ms	41.7 ms	−7.0 %

Per-compress instruction count drops ~3 % (the removed call + marshalling); the
larger wall-clock win comes from better register allocation and scheduling once
the search body lives in the parse frame. Small lazy fixtures
(small-10k-random L9, small-4k-log-lines L6) are unchanged — the inlined
body does not regress the cold-icache small-input path.

Testing

Byte-identical — pure inlining, identical match decisions. cargo nextest run -p structured-zstd --features hash,std,dict_builder 841 pass;
-p ffi-bench --features bench_internals,dict_builder 59 pass (cross-validation
round-trips + skippable + fuzz_interop).
clippy (default + --tests, --no-default-features --features kernel_scalar,hash)
and cargo fmt --check clean.

Summary by CodeRabbit

Refactor
- Improved the way lazy matching is performed under the hood, streamlining the search path used during compression.
- Kept compression behavior the same while reducing extra indirection in the matching flow.

The lazy row parse called an out-of-line per-tier #[target_feature] search method (`find_best_<tier>`) at both probe sites (current position + the lazy_decide lookahead). A #[target_feature] fn cannot inline across the call boundary, so every position paid call + argument-marshalling overhead — a large share of the ~2.24x instruction-count gap vs C on the lazy band, whose ZSTD_searchMax is FORCE_INLINE_TEMPLATE into ZSTD_compressBlock_lazy_generic. Splice the rep + row-probe body (row_best_match!) inline at both sites instead, exactly as the greedy monolith already does, so each lazy tier kernel is one target_feature function with no per-position search call. Removed the now-unused gen_row_find_monolith standalone-method generator. Byte-identical (841 lib + 59 ffi incl cross-validation). Measuring decodecorpus instruction count + speed.

coderabbitai · 2026-06-30T10:57:27Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 169e7b8f-7c6a-409c-8d56-01302c3465de

📥 Commits

Reviewing files that changed from the base of the PR and between 14e31ff and dcccb9c.

📒 Files selected for processing (1)

zstd/src/encoding/row/mod.rs

📝 Walkthrough

Walkthrough

The PR refactors the lazy row-parsing macro pipeline in zstd/src/encoding/row/mod.rs. The lazy_parse_body! and gen_lazy_monolith! macros drop their $find:ident parameter, gaining $use_mask, $maskmac, and $cpl tier parameters instead. The rep+row probe is now expanded inline via row_best_match! at both the main probe site and the lazy_decide! lookahead closure. All five tier instantiations are updated to match.

Changes

Inline lazy probe refactor

Layer / File(s)	Summary
`lazy_parse_body!` and `gen_lazy_monolith!` parameter changes `zstd/src/encoding/row/mod.rs`	`lazy_parse_body!` signature changed to accept `$use_mask`, `$maskmac`, `$cpl`; `gen_lazy_monolith!` drops `$find:ident` and threads the new parameters through to `lazy_parse_body!`.
Inline `row_best_match!` at both probe sites `zstd/src/encoding/row/mod.rs`	Both the main carried/best selection block and the `lazy_decide!` lookahead `search` closure replace `$m.$find::<K, $rl>(...)` with an inline `row_best_match!` expansion.
Tier instantiation updates `zstd/src/encoding/row/mod.rs`	All five lazy monolith call sites (`lazy_scalar`, `lazy_sse42`, `lazy_avx2bmi2`, `lazy_neon`, `lazy_simd128`) remove `find_best_` arguments and pass `row_tag_mask_` plus the tier's `common_prefix_len_ptr`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

structured-world/structured-zstd#335: Directly overlaps on the same lazy row search/match generation code in mod.rs, monomorphizing tiered lazy parsing around *_rl helpers.
structured-world/structured-zstd#455: Modifies the same lazy row parsing pipeline, routing the lazy decision through the shared lazy_decide! macro that this PR also modifies.

Poem

🐇 Hoppity-hop through the macro maze,
No more $find calls to end my days!
row_best_match! inline at last,
The search is spliced in, monolith fast.
One fewer hop per probe site—hooray!
This rabbit inlines and bounds away~ 🎉

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: inlining row search into the lazy parse monolith for compression performance.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/row-lazy-monolith

_{Comment @coderabbitai help to get the list of available commands.}

codecov · 2026-06-30T11:00:04Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

greptile-apps · 2026-06-30T11:03:33Z

Greptile Summary

This PR inlines the lazy row-match search inside the compression parse loop. The main changes are:

Removed the generated per-tier lazy search helper.
Expanded row_best_match! directly at the current-position probe.
Expanded the same search body in the lazy lookahead path.
Kept the existing scalar and SIMD lazy kernel dispatch shape.

Confidence Score: 5/5

The compression refactor appears merge-safe with no code issues identified.

The change is localized to the row-hash lazy parse path and is described as behavior-preserving inlining, with existing cross-validation and formatting checks reported clean.

T-Rex Logs

What T-Rex did

Ran the lazy-byte-identical test and compared the base run to the head run, confirming matching digests, that all rows report roundtrip=true, and EXIT_CODE: 0.
Reviewed the inline-shape test results, confirming the head source only uses inline row_best_match and lacks any row_find_* matches in emitted assembly, with perf-smoke numbers showing changes across big and small workloads.

_{Ran code and verified through T-Rex}

_{Reviews (2): Last reviewed commit: "Merge branch 'main' into perf/row-lazy-m..." | Re-trigger Greptile}

Merge branch 'main' into perf/row-lazy-monolith

476626d

polaz merged commit e7e8adf into main Jul 1, 2026
28 checks passed

polaz deleted the perf/row-lazy-monolith branch July 1, 2026 01:04

sw-release-bot Bot mentioned this pull request Jul 1, 2026

chore: release v0.0.49 #463

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(compress): inline the row search into the lazy parse monolith#461

perf(compress): inline the row search into the lazy parse monolith#461
polaz merged 2 commits into
mainfrom
perf/row-lazy-monolith

polaz commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

codecov Bot commented Jun 30, 2026

Uh oh!

greptile-apps Bot commented Jun 30, 2026 •

edited

Loading

Greptile Summary

Confidence Score: 5/5

T-Rex Logs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

polaz commented Jun 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results (i9, x86_64, ours-vs-c_ffi, flat control)

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

codecov Bot commented Jun 30, 2026

Codecov Report

Uh oh!

greptile-apps Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

T-Rex Logs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

polaz commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

Results (i9, x86_64, ours-vs-`c_ffi`, flat control)

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

greptile-apps Bot commented Jun 30, 2026 •

edited

Loading