Skip to content

Feature request: streaming Parquet / Arrow IPC support #4

Description

@cschanhniem

Spliterator currently handles CSV, TSV, and JSONL — the three most common delimited text formats. In data engineering pipelines, Parquet is the dominant columnar storage format, and Arrow IPC is the standard for zero-copy data transfer between processes.

It would be valuable to add a ParquetSpliterator (or ArrowSpliterator) that could:

  1. Stream a Parquet file row-by-row (mapping row groups to async iterables)
  2. Accept an optional Arrow schema for typed column projection
  3. Work with the same Generator/AsyncGenerator interfaces the library already exposes

Since Spliterator already uses a streaming/iterator model, this would fit naturally — Parquet row groups are already designed to be read incrementally. The Apache Arrow JS bindings (@apache-arrow/*) provide Parquet read support that could be wrapped behind Spliterator's existing interfaces.

Would a PR along these lines be welcome? I'd be happy to contribute an initial implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions