Description
spatialdata.transform on a points DaskDataFrame raises cannot reindex on an axis with duplicate
labels when the global index is non-unique, even though PointsModel.parse accepts such an element and
spatialdata_io.merscope produces one by default.
I think that either spatialdata_io should ensure a unique points index or .transform should support a non-unique index.
How this can happen day to day
Reading in a large Merscope dataset uses dd.read_csv which creates a multi partiton dask dataframe with globally non-unique range indices. If there is a non identity transform this leads to the error "cannot reindex on an axis with duplicate labels" when transform is called, for example in sopa.io.explorer.write
Minimal example
import numpy as np
import pandas as pd
import dask.dataframe as dd
from spatialdata.models import PointsModel
from spatialdata.transformations import Affine, set_transformation
from spatialdata._core.operations.transform import transform
def make_points(unique_index: bool) -> dd.DataFrame:
"""Two-partition points frame mimicking `dd.read_csv` output.
unique_index=False -> each partition is indexed 0..4 (non-unique global index),
which is what dask's CSV/parquet block-splitting produces.
"""
p0 = pd.DataFrame({"x": np.arange(5.0), "y": np.arange(5.0), "gene": list("ABABA")})
p1 = pd.DataFrame({"x": np.arange(5.0) + 5, "y": np.arange(5.0) + 5, "gene": list("BABAB")})
if unique_index:
p1.index = p1.index + len(p0) # 5..9 -> globally unique
ddf = dd.from_map(lambda i: [p0, p1][i], [0, 1])
return PointsModel.parse(ddf)
def run(unique_index: bool) -> None:
pts = make_points(unique_index=unique_index)
idx = pts.index.compute()
label = "unique" if unique_index else "per-partition (non-unique)"
print(f"\n[{label}] npartitions={pts.npartitions} index={idx.tolist()} unique={idx.is_unique}")
# non-identity transform so spatialdata actually runs the failing code path
aff = Affine(
np.array([[2, 0, 1], [0, 2, 1], [0, 0, 1]], dtype=float),
input_axes=("x", "y"),
output_axes=("x", "y"),
)
set_transformation(pts, aff, "global")
try:
out = transform(pts, aff, maintain_positioning=True).compute()
print(f"[{label}] transform OK -> {out.shape}")
except ValueError as e:
print(f"[{label}] transform FAILED -> ValueError: {e}")
if __name__ == "__main__":
run(unique_index=False) # reproduces the crash
run(unique_index=True) # same data, unique index -> works
Enviroment
spatialdata==0.7.3
spatialdata-io==0.6.0
dask==2026.1.1
pandas==2.3.3
numpy==2.3.5
Description
spatialdata.transform on a points DaskDataFrame raises cannot reindex on an axis with duplicate
labels when the global index is non-unique, even though PointsModel.parse accepts such an element and
spatialdata_io.merscope produces one by default.
I think that either spatialdata_io should ensure a unique points index or .transform should support a non-unique index.
How this can happen day to day
Reading in a large Merscope dataset uses dd.read_csv which creates a multi partiton dask dataframe with globally non-unique range indices. If there is a non identity transform this leads to the error "cannot reindex on an axis with duplicate labels" when transform is called, for example in sopa.io.explorer.write
Minimal example
Enviroment