Spliterator currently handles CSV, TSV, and JSONL — the three most common delimited text formats. In data engineering pipelines, Parquet is the dominant columnar storage format, and Arrow IPC is the standard for zero-copy data transfer between processes.
It would be valuable to add a ParquetSpliterator (or ArrowSpliterator) that could:
- Stream a Parquet file row-by-row (mapping row groups to async iterables)
- Accept an optional Arrow schema for typed column projection
- Work with the same Generator/AsyncGenerator interfaces the library already exposes
Since Spliterator already uses a streaming/iterator model, this would fit naturally — Parquet row groups are already designed to be read incrementally. The Apache Arrow JS bindings (@apache-arrow/*) provide Parquet read support that could be wrapped behind Spliterator's existing interfaces.
Would a PR along these lines be welcome? I'd be happy to contribute an initial implementation.
Spliterator currently handles CSV, TSV, and JSONL — the three most common delimited text formats. In data engineering pipelines, Parquet is the dominant columnar storage format, and Arrow IPC is the standard for zero-copy data transfer between processes.
It would be valuable to add a
ParquetSpliterator(orArrowSpliterator) that could:Since Spliterator already uses a streaming/iterator model, this would fit naturally — Parquet row groups are already designed to be read incrementally. The Apache Arrow JS bindings (@apache-arrow/*) provide Parquet read support that could be wrapped behind Spliterator's existing interfaces.
Would a PR along these lines be welcome? I'd be happy to contribute an initial implementation.