Skip to content

Transform on points fails with non-unique index: "cannot reindex on an axis with duplicate labels" #1159

Description

@pakiessling

Description

spatialdata.transform on a points DaskDataFrame raises cannot reindex on an axis with duplicate
labels when the global index is non-unique, even though PointsModel.parse accepts such an element and
spatialdata_io.merscope produces one by default.

I think that either spatialdata_io should ensure a unique points index or .transform should support a non-unique index.

How this can happen day to day

Reading in a large Merscope dataset uses dd.read_csv which creates a multi partiton dask dataframe with globally non-unique range indices. If there is a non identity transform this leads to the error "cannot reindex on an axis with duplicate labels" when transform is called, for example in sopa.io.explorer.write

Minimal example

import numpy as np
import pandas as pd
import dask.dataframe as dd
from spatialdata.models import PointsModel
from spatialdata.transformations import Affine, set_transformation
from spatialdata._core.operations.transform import transform


def make_points(unique_index: bool) -> dd.DataFrame:
    """Two-partition points frame mimicking `dd.read_csv` output.

    unique_index=False -> each partition is indexed 0..4 (non-unique global index),
    which is what dask's CSV/parquet block-splitting produces.
    """
    p0 = pd.DataFrame({"x": np.arange(5.0), "y": np.arange(5.0), "gene": list("ABABA")})
    p1 = pd.DataFrame({"x": np.arange(5.0) + 5, "y": np.arange(5.0) + 5, "gene": list("BABAB")})
    if unique_index:
        p1.index = p1.index + len(p0)  # 5..9  -> globally unique
    ddf = dd.from_map(lambda i: [p0, p1][i], [0, 1])
    return PointsModel.parse(ddf)


def run(unique_index: bool) -> None:
    pts = make_points(unique_index=unique_index)
    idx = pts.index.compute()
    label = "unique" if unique_index else "per-partition (non-unique)"
    print(f"\n[{label}] npartitions={pts.npartitions} index={idx.tolist()} unique={idx.is_unique}")

    # non-identity transform so spatialdata actually runs the failing code path
    aff = Affine(
        np.array([[2, 0, 1], [0, 2, 1], [0, 0, 1]], dtype=float),
        input_axes=("x", "y"),
        output_axes=("x", "y"),
    )
    set_transformation(pts, aff, "global")

    try:
        out = transform(pts, aff, maintain_positioning=True).compute()
        print(f"[{label}] transform OK -> {out.shape}")
    except ValueError as e:
        print(f"[{label}] transform FAILED -> ValueError: {e}")


if __name__ == "__main__":
    run(unique_index=False)  # reproduces the crash
    run(unique_index=True)   # same data, unique index -> works

Enviroment

spatialdata==0.7.3
spatialdata-io==0.6.0
dask==2026.1.1
pandas==2.3.3
numpy==2.3.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions