Add Spark interop test for reading Puffin deletion vectors#3476
Add Spark interop test for reading Puffin deletion vectors#3476moomindani wants to merge 3 commits into
Conversation
ebyhr
left a comment
There was a problem hiding this comment.
Looks good to me except for comments.
| def run_spark_commands(spark: SparkSession, sqls: list[str]) -> None: | ||
| for sql in sqls: | ||
| spark.sql(sql) |
There was a problem hiding this comment.
I'm not certain if we truly require this method, as calling the sql method is simpler.
spark.sql(f"DROP TABLE IF EXISTS {identifier}")
vs
run_spark_commands(spark, [f"DROP TABLE IF EXISTS {identifier}"])No requested change. I know there's already a similar method in other classes.
There was a problem hiding this comment.
Agreed. I've replaced the run_spark_commands calls with direct spark.sql() calls.
|
|
||
|
|
||
| @pytest.mark.integration | ||
| def test_read_spark_written_puffin_dv(spark: SparkSession, session_catalog: RestCatalog) -> None: |
There was a problem hiding this comment.
How about moving this test to test_deletes.py?
There was a problem hiding this comment.
Moved the test to test_deletes.py and removed the test_puffin_spark_interop.py file.
| assert len(delete_manifests) > 0, "Expected delete manifest with DVs" | ||
|
|
||
| delete_manifest = delete_manifests[0] | ||
| entries = list(delete_manifest.fetch_manifest_entry(table.io)) | ||
| assert len(entries) > 0, "Expected at least one delete file entry" |
There was a problem hiding this comment.
Is there any reason we avoid using the ... == 1 condition?
There was a problem hiding this comment.
No particular reason — there's exactly one delete manifest here, so I changed it to == 1. Thanks!
- Move test_read_spark_written_puffin_dv from test_puffin_spark_interop.py to test_deletes.py; delete the old file - Replace run_spark_commands helper calls with direct spark.sql() calls - Tighten entries assertion from > 0 to == 1
Co-authored-by: Isaac
Rationale for this change
Extracted from #3474 per review feedback (#3474 (comment)): this integration test verifies that PyIceberg can read Puffin deletion vectors written by Spark, which holds independently of the
PuffinWriterchanges in that PR.Are these changes tested?
This PR is test-only. The test passed CI as part of #3474 (integration-test job) before being extracted.
Are there any user-facing changes?
No.