feat: Native Parquet Iceberg Data File Writes In Comet#4487
Conversation
7118d08 to
15c8f15
Compare
comphead
left a comment
There was a problem hiding this comment.
Thanks @jordepic this is epic PR, @mbutrovich FYI, my understanding though that writes should be addressed in iceberg-rs to be supported by iceberg community and reusable by other users
Can you elaborate on that @comphead ? I use iceberg-rs to perform the writes here! |
|
@mbutrovich sorry for the additional tag here. I've actually been running this locally now and it's really been effective. Would you like me to try and split it a bit further, clean up some comments, etc? |
|
Thanks for the epic PR @jordepic. I've started looking through it and using AI to help me comprehend and review this, since I am not an Iceberg expert. I noticed that this PR is creating directly from the I like this this functionality is disabled by default so users can opt-in while this goes through more testing. Could you share some performance numbers and explain how you are benchmarking this? |
|
Thanks, Andy! I'd always be happy to jump on a call to discuss the PR in more detail. I'm not really an expert of anything (and just have medium knowledge in Spark and slightly better for iceberg). Let me see if I can go about getting benchmarks! I find that with wider datasets the difference is more apparent because there is more of a penalty to doing an extra transpose when reading iceberg into rows and writing it back. Also, I'm starting to break this PR into chunks to make it easier for review so that Matt can review it when he is back from pto. Here is the first link: |

Which issue does this PR close?
Closes #4322.
Rationale for this change
Comet, up until this point, has mainly been focused with accelerating reads from iceberg tables. However, a significant resources are being spent across various companies in order to rewrite iceberg data. Large tables need to be compacted to maintain their sort/Z order, and general data pipelines may write significant amounts of iceberg data. Having to do a large transpose between columnar and row-wise data is inefficient, and we'd much prefer to go directly from arrow-based column batches to parquet on disk.
What changes are included in this PR?
This change is split into three parts.
iceberg-writes.md, no delete files since iceberg-rust doesn't support positional deletes/DVs)How are these changes tested?
This change is tested extensively.