feat(writer): byte-based row group sizing in ParquetWriter#2741
feat(writer): byte-based row group sizing in ParquetWriter#2741shanielh wants to merge 1 commit into
Conversation
2577533 to
8639746
Compare
viirya
left a comment
There was a problem hiding this comment.
Nice addition — this is the analogue of Java's write.parquet.row-group-size-bytes, which we were missing.
The mechanism checks out against the parquet-rs API: AsyncArrowWriter::in_progress_size() is documented as "Anticipated encoded size of the in progress row group" (i.e. post-encoding/compression estimate, which is what you want for byte-uniform groups), and flush() cuts the buffered rows "into a new row group" rather than closing the file — so calling it mid-write is correct. Checking after each FileWriter::write and the one-batch-overshoot caveat are both documented honestly on the builder method.
Default stays None so existing writers are unchanged, and the row-count cap remains an upper bound — both good. The test disables dictionary/compression to get ~incompressible data and asserts multiple row groups at a 64 KiB target, which actually exercises the size-driven cut.
Pulled the branch: 15 parquet_writer tests pass (incl. the new one), clippy + rustfmt clean.
One optional thought, non-blocking: the PR mentions no tracking issue. Given Java exposes this as a table property (write.parquet.row-group-size-bytes), a natural follow-up would be wiring it from table properties too, but the builder-level knob is a fine standalone first step.
8639746 to
b5386b9
Compare
`WriterProperties::max_row_group_size` bounds a parquet row group only by row count, so on-disk row-group size swings with the data's compression ratio — uniform-row-count groups can vary by orders of magnitude in bytes, and a caller that wants N-MiB groups has to guess a ratio to convert the byte target into a row count. Add `ParquetWriterBuilder::with_row_group_size_bytes`: when set, the writer cuts a row group as soon as its anticipated encoded size (`AsyncArrowWriter::in_progress_size`) reaches the target, using parquet's recommended size-driven flush pattern. This yields byte-uniform row groups with no ratio guess — the Rust analogue of Java Iceberg's `write.parquet.row-group-size-bytes`. The row-count cap still applies as an upper bound; whichever fires first cuts the group. Checked after each write, so overshoot is bounded by one batch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
b5386b9 to
1f31c3e
Compare
|
Cool @viirya , I've just rebased over main and generated the public api. Hope this get merged soon, after merging I'll PR wiring the table property to writer. |
|
Thank you @shanielh |
|
cc @kevinjqliu |
Which issue does this PR close?
No tracking issue yet — small, self-contained enhancement; happy to file one if maintainers prefer it for changelog purposes.
What changes are included in this PR?
Adds
ParquetWriterBuilder::with_row_group_size_bytes(bytes)— byte-based row-group sizing forParquetWriter.Today
WriterProperties::max_row_group_sizebounds a row group only by row count. The on-disk size of a row group therefore depends entirely on how well the data compresses: with a fixed row count, well-compressing data produces tiny row groups and poorly-compressing data produces large ones. A caller that wants ~N-MiB row groups has no choice but to guess a compression ratio to convert a byte target into a row count, and that guess is wrong whenever the ratio drifts (across columns, files, or within a file).This is exactly the knob Java Iceberg exposes as
write.parquet.row-group-size-bytes, which has no equivalent in iceberg-rust.When
with_row_group_size_bytesis set, after each write the writer checks parquet'sAsyncArrowWriter::in_progress_size()— the anticipated encoded (post-compression) size of the open row group — and callsflush()to cut the group once it reaches the target. This is the size-driven flush pattern parquet-rs documents. Result: byte-uniform row groups with no ratio guess.Notes:
FileWriter::write, so a group can overshoot by at most one written batch (documented on the builder method). Streaming callers that write in modest batches get tight byte-uniform groups; a caller handing the writer one enormous batch gets coarser cuts.None, so existing writers cut by row count exactly as before.Are these changes tested?
Yes — new unit test
test_parquet_writer_row_group_size_byteswrites ~800 KB of (dictionary- and compression-disabled, so ~incompressible) data with a 64 KiB byte target and the row-count cap left at the parquet default, then reads the file back and asserts it contains multiple row groups — proving the size-based cut fired rather than emitting one default row group.cargo fmt,cargo clippy -p iceberg, and the existingparquet_writertests all pass.