Supervoxel splitting with base+fork support and locks#534
Merged
Conversation
Drop the inline propagate_to_coarser_scales call from write_seg; coarser mips are now the async downsample worker's responsibility. write_seg is back to a single base-scale tensorstore write, so SV splits no longer block on the full pyramid update. TestWriteSeg updated to assert the coarser scales stay zero after write_seg (propagation tested separately via TestPropagateToCoarserScales).
split_supervoxel now returns its base-resolution bbox. split_with_sv_splits collects one per call and attaches the list to Result.seg_bbox (new optional field). publish_edit includes the list in the payload and sets downsample="true" so the worker only runs on edits that touched base seg. List kept unmerged — lets the worker skip tiles outside the actual change region.
workers/downsample_worker.py consumes edits-exchange messages flagged downsample="true" and writes each non-base mip within the SV-split bbox. graph/downsample.py splits the region into pyramid_blocks (sized so no two blocks share a storage chunk at any mip), then either tinybrain'd in one call (fast path, typical small edits) or per-mip (fallback when base read exceeds memory budget). Write filtering keeps OCDBT delta proportional to the actual change. DownsampleBlockLock serializes overlapping jobs via kvdbclient's row-key lock API; 26-byte hash-prefixed keys avoid tablet hot-spots. Depends on kvdbclient lock_by_row_key / unlock_by_row_key / renew_lock_by_row_key landing first.
Used by the async downsample worker as the mip-pyramid kernel.
Closes a concurrency gap where two SV splits on overlapping L2 chunks but distinct roots can't be serialized by root locks — they acquire disjoint root-lock sets and race on seg state. L2ChunkLock serializes them via the kvdbclient row-key lock primitive. Row key = 2-byte blake2b hash + 8-byte uint64 chunk_id (10 bytes). Hash prefix keeps spatially-clustered L2 chunks from hot-spotting a single bigtable tablet under concurrent load. Primitive only — callers land separately. RowKeyLockRegistry helper moved to tests/helpers.py so L2ChunkLock and DownsampleBlockLock tests share it instead of duplicating.
Callers now get Cut(atomic_edges) | PreviewCut(ccs, illegal_split) | SvSplitRequired(sv_remapping) instead of unwrapping-by-convention or catching SupervoxelSplitRequiredError. The exception still unwinds inside LocalMincutGraph — cheapest way to bail out of deep path code — but it's caught once at the run_multicut boundary and never escapes, so callers don't use raise/catch for control flow.
MulticutOperation._apply dispatches on the tagged multicut result and, when an SV split is needed, calls the new edits_sv.split_supervoxels under its surrounding RootLock, refreshes source/sink SV IDs from seg, and retries multicut against the post-split graph. Root lock spans the whole critical section; L2ChunkLock held only around the split loop. This closes two races that existed when split_with_sv_splits handled the flow outside any lock: same-root (root lock now never released between multicut and commit) and cross-root (L2ChunkLock serializes overlapping split regions). split_with_sv_splits is deleted; handle_split calls cg.remove_edges directly.
Pre-compute each rep's bbox from the chunk coords of its CC members in sv_remapping (no coord-padding, no resolution-axis assumption). split_supervoxels builds the union lock set across reps — sparse chunks plus one L2-chunk margin for update_edges's 1-voxel overlap read — acquires once, then loops per-rep splits. _update_chunks surfaces the change_chunks that actually got new SV IDs; write_seg_chunks fires one tensorstore future per change-chunk and awaits together, so only chunks with real label changes hit OCDBT. Gap chunks between CC pieces and neighbor chunks read for the overlap never get rewritten, keeping the delta proportional to the edit. split_supervoxels also threads back the fresh source/sink SV IDs from the in-memory new_seg block (same bytes that just landed on storage), so the retry multicut sees current IDs without an extra seg read. Drops _get_whole_sv (dead since the sv_remapping switch). Adds a high-level architecture doc covering the end-to-end flow, concurrency design, and durable invariants.
Enable opening a CG's OCDBT at a prior commit via the driver's `version` spec field (int generation or ISO-8601 commit_time upper bound). Groundwork for operator-driven replay of failed SV splits, which needs clean pre-op reads against append-only storage. OCDBT stamps commits from absl::Now() with no caller-override hook, so pins will use OperationTimeStamp captured under the L2 chunk lock rather than aligning OCDBT commit times to operation time.
This comment was marked as spam.
This comment was marked as spam.
split_supervoxels is now a pure planner returning a SplitResult (seg_bboxes, source_ids_fresh, sink_ids_fresh, seg_writes, bigtable_rows). No lock acquisition, no writes. The caller (MulticutOperation._apply) holds the L2 chunk locks and fires the consolidated persist — OCDBT chunks + bigtable rows — inside an inner lock scope. seg_writes is a flat list of (voxel_slices, data) pairs across all reps so write_seg_chunks fires every chunk write as one parallel tensorstore batch. Removes the per-rep serialization in the old write_seg_chunks loop. get_seg_source_and_destination_ocdbt gains a pinned_at kwarg, forwarded to build_cg_ocdbt_spec — used later by the recovery path.
New L2 chunk counterpart to IndefiniteRootLock, keyed by chunk row. L2ChunkLock now acquires via lock_by_row_key_with_indefinite so a temporal acquire sees a crashed op's indefinite cell and refuses. IndefiniteL2ChunkLock records its chunk scope on the op-log row's L2ChunkLockScope column at __enter__ and clears it on clean exit, giving recovery a durable scope without a bigtable-wide scan. Both indefinite locks (root and L2 chunk) now short-circuit __exit__ when an exception is propagating: cells stay held, scope stays set. Partial writes may exist after an exception; leaving the cells forces subsequent ops to refuse at lock-acquire and the operator to run recovery explicitly. privileged_mode=True on either lock is the operator recovery escape hatch: skips acquire, pre-populates acquired_keys so __exit__'s value-matched release deletes the crashed op's cells. RowKeyLockRegistry (test helper) gains the three new kvdbclient primitives.
Operator recovery for SV-split ops that crashed mid-write. A worker death inside IndefiniteL2ChunkLock leaves per-chunk indefinite cells set and records the chunk scope on the op-log row. Recovery reverts partial OCDBT writes using a version-pinned read of pre-op voxels, then replays the op normally. list_stuck scans OperationLogs for ops still at CREATED past a min-age threshold. replay(cg, op_id) runs cleanup_partial_writes followed by repair.edits.repair_operation(..., unlock=True); IndefiniteL2ChunkLock's privileged-mode __exit__ deletes the crashed op's pre-existing cells after the replay's writes land. Architecture-level operator guide at docs/sv_splitting_recovery.md, linked from docs/sv_splitting.md's Concurrency section.
Pass the op's timestamp from execute → _apply → split_supervoxels → split_supervoxel → copy_parents_and_add_lineage / add_new_edges so every new-SV bigtable mutation lands at the op's logical write time. Gets atomic visibility under a parent_ts filter and makes replay's override_ts actually control what time-filtered readers see after a repair_operation. Parent-copy and Child-list writes deliberately keep the old cell's timestamp so pre-op readers still see the old hierarchy. Replace the seven-tuple in/out soup in split_supervoxels with named dataclasses: SvSplitTask (plan_sv_splits → split_supervoxel input) and SvSplitOutcome (split_supervoxel's per-task output). Drop the two unused return fields on split_supervoxel.
list_stuck filter switches from Status==CREATED to "L2ChunkLockScope set and Status != SUCCESS past min_age". The authoritative signal for stuck-ness is "scope recorded, not cleared" — worker crash (Status stays CREATED) and Python exception during persist (Status=EXCEPTION after Fix 1) both fall under it. Ops without scope aren't blocking other ops and are outside stuck_ops' concern. replay now verifies each chunk in the recorded scope actually has Concurrency.IndefiniteLock held by this op_id before running cleanup or repair. If cells are missing or held by a different op, raises with a clear error. Protects against double-replay (first run already released cells) and out-of-band clearing (manual bigtable edit, buggy release path) — both would have cleanup_partial_writes revert chunks that aren't ours.
fork_base_manifest is now an explicit step — invoked from the ingest CLI's --ocdbt path or the seg_ocdbt notebook — rather than being auto-triggered on first ws_ocdbt_scales access. ws_ocdbt_scales asserts fork_exists() so a missing fork fails with a clear pointer instead of a tensorstore mismatch/not-found error.
Stuck-op detection keys off L2ChunkLockScope being populated, not Status=CREATED — that filter also covers caught-exception paths (Status=FAILED) where cells are held but the row isn't CREATED. Recovery now verifies each chunk's IndefiniteLock is actually held by the op before cleaning up, so a stale scope can't have us revert chunks another op owns. Reflect both in docs and the list CLI help. Drop "mode-downsample" from the SV-split diagram — tinybrain owns the algorithm; the doc shouldn't pin it.
`_rep_bbox` enveloped every piece of the cross-chunk-connected rep — for physical SVs split into many pieces across chunks, the bbox grew far wider than the cut surface needs. Replace with `_coords_bbox`: envelope of the user-placed source/sink coords plus a one-chunk margin (matches the existing L2 lock margin and 1-voxel shell). After the seg read, `cut_supervoxels` is intersected with the IDs present in seg, so the "whole sv" set names only the rep pieces the bbox actually touches. Pieces of the rep outside the bbox keep their existing IDs — their cross-chunk edges to in-bbox split fragments are routed via the 1-voxel shell, edges between two unsplit pieces don't change. Adds `TestCoordsBbox` covering envelope+margin, volume-bound clipping, and that `plan_sv_splits` returns a tight bbox regardless of how distant the rep's other pieces sit.
handle_supervoxel_id_lookup and id_helpers.get_atomic_ids_from_coords both short-circuit to lookup_svs_from_seg whenever ocdbt_seg is true, regardless of node-id layer. 2D slice clicks send L1 IDs from a view that may be stale after an SV split; 3D mesh clicks send a root and no L1 at all. Either way the current SV is what matters, so we read seg at the click coords and let downstream same/different-root checks surface any staleness with the sv_id->root diagnostic. Also vectorize lookup_svs_from_seg's per-coord indexing into a single advanced-index op.
`_schema_from_src` was passing both `domain` and `shape` to ts.open when cloning the source schema to the destination. For sources with a non-zero `voxel_offset`, `domain` carries absolute bounds (e.g. [17756, 62244)) while `shape` implies an origin of 0, and tensorstore refuses to merge them — base creation fails on any precomputed source that doesn't start at the origin. `domain` already encodes both extent and offset, so passing only it is sufficient and avoids the conflict.
cloudbuild now uses BuildKit registry cache (`:buildcache`) so unchanged stages reuse the prior build's layer artifacts and already-warm nodes skip re-downloading them on pull. The fresh-ingest CLI no longer infers `ocdbt_populate_base` from `base_exists` — manifest presence didn't reflect whether chunks were actually copied. Replaced with an explicit `--populate-base` flag; operator sets it on first ingest and omits on subsequent runs.
Edit operations dump {WATERSHED}/graphene_errors/{cg.graph_id}/{op_id}.json
on AssertionError/RuntimeError/unknown Exception with op type, user,
inputs, exception class+message, traceback. err_dump.read_err_artifact
reads it back. Assertion messages across cutting, edits, edges/stale,
sv_split/edges, sv_split/edits, and operation now carry the values that
disagreed (root→l2_count, parents, duplicates, new_id vs got, chunk
mismatch pairs, stale nodes) and the broken positional logger.error in
CreateParentNodes is rewritten as an f-string so id/parent/root actually
reach the log.
… future work
NOTES.md captures the seg-read-union + subgraph-union dedups (opt-in
when n_tasks > 1, byte-equal single-rep path) so the plan stays
discoverable in-repo. README §9 surfaces the new
{WATERSHED}/graphene_errors/{cg.graph_id}/{op_id}.json artifact path
and the read_err_artifact helper.
Add pychunkedgraph/pipeline/: a workload-agnostic core (grid scatter, per-chunk
Bigtable lock, exit-code contract, worker harness) shared by ingest and meshing
subpackages. Ingest builds L2/parent chunks under a per-chunk lock; meshing runs
marching cubes / sharded stitching, idempotent, plus one-shot mesh-metadata setup.
Self-contained except the chunk-compute functions; dispatch.py is the only
branch-specific shim. Entrypoints: python -m pychunkedgraph.pipeline.{ingest,meshing}[.setup].
3fc8548 to
9f752e5
Compare
Adapt ingest dispatch/setup to pcgv3 chunk builders + graph classes; add migrate + migrate_cleanup (--clean) workloads. Upgrade clean is a function arg, earliest_ts reads cached meta (set in migrate setup); no CLEAN_CHUNKS/EARLIEST_TS env.
9f752e5 to
1a8c48f
Compare
The bigtable.data client leaves a non-daemon channel-refresh thread that atexit join()s forever, so workers hung after finishing a batch until the pod grace period SIGKILLed them (exit 137, breaking exit-42 FailIndex). os._exit with the real return code once stdio is flushed. Co-Authored-By: Claude <noreply@anthropic.com>
lock.py used the classic bigtable client (conditional_row) which the data client doesn't have -> AttributeError on acquire. Rewrite acquire/renew/release on kvdbclient lock_by_row_key, keep the done marker via mutate_row/_read_byte_row; worker passes cg.client. Co-Authored-By: Claude <noreply@anthropic.com>
The package handler propagated to root, so entrypoints with a root handler printed every record twice; chunk coords rendered as np.int64. Co-Authored-By: Claude <noreply@anthropic.com>
The root-layer pod runs the sanity suite as its final step, so every ingest ends verified; existence() now raises instead of only printing diagnostics. A failed check fails the pod without re-opening the chunk. Co-Authored-By: Claude <noreply@anthropic.com>
The .setup one-shots returned normally, so graph I/O left the bigtable.data channel thread hanging the pod (mesh-meta stalled ~20m). Add run_and_exit in pipeline/__init__ — main() code or 0, SystemExit code, else traceback+1, then flush + os._exit — and call it from all six worker/setup entrypoints, replacing the three inline copies. Co-Authored-By: Claude <noreply@anthropic.com>
setup's cg.create() raises ValueError on an existing table; --exist-ok catches it and skips (resume-safe), without it the error surfaces. Co-Authored-By: Claude <noreply@anthropic.com>
Drop the vendored grid/harness/lock/exit_codes for the shared cave_pipeline.distribution package so the operator and every worker compute the same chunk-scatter bijection from one source; workers inject cg_factory and layer_bounds into the generic harness. Co-Authored-By: Claude <noreply@anthropic.com>
Picks up the 1-byte chunk-done marker the ingest workers write. Co-Authored-By: Claude <noreply@anthropic.com>
meta resolution/bounds derive from the watershed info JSON; sv lookup and the seg fallback read voxels via a neuroglancer_precomputed handle. cloud-volume stays a lazy ws_cv hatch (meshing/diagnostics). Co-Authored-By: Claude <noreply@anthropic.com>
nested imports (graph_tool via a _graph_tool shim) keep graph_tool, scipy, pandas, networkx, and cloudfiles off the cold import path; first use pays the load. bump kvdbclient to 0.7.1 (drops its cloud-volume). Co-Authored-By: Claude <noreply@anthropic.com>
…ds_by_label fastremap 1.20.0 emits 6-conn boundary voxels per label natively, so the `_label_boundary_mask` axial-diff pass and the `vol *= mask` mutation both go away. The point_cloud output (and therefore downstream KDTree min-distance queries) is unchanged. Co-Authored-By: Claude <noreply@anthropic.com>
ws_ts_scale(mip) reads the target scale (non-OCDBT mip>0 read mip 0). Mesh block size derives per-axis from the watershed pyramid, so chunk_size leaves mesh_config; setup rejects an out-of-range mip. Tests mock the tensorstore watershed reads. Co-Authored-By: Claude <noreply@anthropic.com>
The pipeline entrypoint carried a verbatim copy of setup_mesh_meta and MeshConfig that still required mesh_config.chunk_size, so mesh-meta failed once chunk_size was dropped from the dataset yaml. Import the single source of truth from pychunkedgraph.meshing instead. Co-Authored-By: Claude <noreply@anthropic.com>
PCG isn't on PyPI, but the image must report the pushed tag. The image pip-installs the package (--no-deps) with the tag fed to setuptools_scm via a cloudbuild build-arg; the hand-bumped literal + bumpversion are gone. Co-Authored-By: Claude <noreply@anthropic.com>
Non-meshing modules (app routes, sv-split profiler, pipeline/ingest entrypoints) imported meshing eagerly, pulling cloudvolume on import. Nest those imports so cv loads only when meshing actually runs. Co-Authored-By: Claude <noreply@anthropic.com>
Meshing splits initial from edited roots at a timestamp boundary, but sampling one root is unreliable: skip connections spread root creation times across layers. Instead, stamp earliest_ts when the root layer is written — the explicit cell timestamp shared by every root, lifted +500ms so the boundary sits strictly above them. get_earliest_timestamp returns it pre-edit; derive_initial_ts consumes it. Migrate no longer clobbers an ingest-stamped value. Co-Authored-By: Claude <noreply@anthropic.com>
setuptools_scm rejects non-PEP-440 strings, so pushing a build-label tag (not a semver) failed the image build. Pass the version build-arg only for version-like tags; other tags build with the Dockerfile default untouched. Co-Authored-By: Claude <noreply@anthropic.com>
A table copied/restored under a new graph_id must not let its meshes alias the source's. Hinge on dynamic_mesh_dir: an explicit graph-suffixed value shares initial meshes, so re-derive only the dynamic subdir; a bare "dynamic" or unset value gives the copy a private per-graph top-level dir. Move the whole rewrite into ChunkedGraphMeta.for_copied_graph so the graph class makes a single call. Co-Authored-By: Claude <noreply@anthropic.com>
Replace setuptools_scm (built 0.0.0 without a reachable semver tag) with a committed _version.py the release workflow bumps, commits, and tags in lockstep; setup.py reads it and the image build drops the version arg. Gate the workflow's Helm-chart update behind an opt-in input and add a workflows README. Versioning is per branch: main 2.x, pcgv3 3.x. Co-Authored-By: Claude <noreply@anthropic.com>
At chunk_layer + 1 == layer_count the parent column is rank-1 and the existing layer_agreement / np.where path returns chunk_layer instead of the root. Short-circuit to layer_count so meshes traversing up from the second-to-top layer reach the root. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Supervoxel splits in OCDBT-backed segmentations are now atomic under worker failure, with an operator tool for recovering stuck ops.
Locks
Two indefinite locks, both holding their cells on exception:
On unclean exit, both hold and the L2 scope is written to the op-log row.
Stuck-op signal
Non-empty scope field on the op-log row. Covers both
CREATED(worker crash) andFAILED(caught exception) paths; status alone missedFAILED. Clean exit clears it.Cleanup-then-replay
Concurrent ops on non-overlapping chunks advance OCDBT during the outage, so recovery can't use one pinned view.
OCDBT time-travel
Tensorstore
versionfield in the spec (integer generation or ISO-8601Ztimestamp, interpreted ascommit_time ≤ T). Threaded aspinned_at. Pinned handles are read-only.Reader contract
dataset_infonow publishes the full kvstore spec (kvstack layers + OCDBT config + data prefixes) instead of path fragments, so readers pass it verbatim to tensorstore. The multi-scale open asserts fork presence, failing with a clear message instead of a tensorstore internal.Beyond the lock/recovery work, this PR also brings:
Supervoxel splitting
supervoxel-splitter[fast]package; PCG drives it viaPCG_SV_SPLITTER-resolvable class with a defaultGeodesicSplitter(dj3dbackend).update_edges, boundary-only point cloud viafastremap.point_cloud(shell=True), pykdtree replacing scipy cKDTree, parallel arrival fields.pychunkedgraph/graph/sv_split/README.md(design) +NOTES.md(known issues, future work, dedup plan).Pipeline + ingest
cave-pipeline[distribution]>=0.0.3core; pcgv3 ingest dispatch + migration workload routed through it.run_and_exithelper for clean worker exit (Bigtable thread no longer pins the process).ingest layer(OCDBT lifecycle),ingest mesh_meta,purge_layer,--exist-ok, richerstatuspanel.OCDBT
pinned_at) reads + tensorstoreversionfield threaded as the time-travel handle.ocdbt/is now a package;OcdbtConfigdataclass + yaml-driven config; coordinator-managed populate; on-disk config wins on open.dst.write(src).Meshing
publish_edit.unsharded_mesh_dir; copied-table mesh dirs isolated by layout;dynamic_mesh_dirin info JSON.calculate_stop_layershort-circuits at top layer.Observability
pychunkedgraph/graph/err_dump.py: every AssertionError / RuntimeError / unknown Exception in the edit path writes{WATERSHED}/graphene_errors/{cg.graph_id}/{op_id}.jsonwith op type, user, inputs, exception, traceback.read_err_artifact(cg, op_id)reads it back.edits.py,cutting.py,sv_split/edits.py,edges/stale.pynow name the values that disagreed.VERBOSElog level for stage-summary diagnostics;HierarchicalProfileris fork-safe + tree-reported; op-id-tagged log lines.Deploy / build
python:3.14-slim+ conda env viaconda-pack;WORKDIR /app+procps;ENV LD_LIBRARY_PATH=$VIRTUAL_ENV/libso the loader picks conda's libpython over the slim base's.uwsgi.ini:reload-on-rss = 2048(per-worker RSS cap post-request);disable-loggingfor kube-probes;setuptools_scm-derived version stamped from a committed literal.docker buildxwithdocker-containerdriver + registry cache.Deps + env
fastremap>=1.20.0,cave-pipeline[distribution]>=0.0.3,<0.1,supervoxel-splitter[fast]>=0.1.0,pykdtree>=1.4.3,tinybrain,kvdbclient>=0.7.0,messagingclient>0.3.0.