Skip to content

feat(platform-wallet)!: shared ThreadRegistry for coordinator lifecycle + shutdown UAF/data-loss fixes#3954

Open
Claudius-Maginificent wants to merge 318 commits into
v3.1-devfrom
feat/platform-wallet-shutdown-join
Open

feat(platform-wallet)!: shared ThreadRegistry for coordinator lifecycle + shutdown UAF/data-loss fixes#3954
Claudius-Maginificent wants to merge 318 commits into
v3.1-devfrom
feat/platform-wallet-shutdown-join

Conversation

@Claudius-Maginificent

@Claudius-Maginificent Claudius-Maginificent commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Why this PR exists

Problem. The SPV wallet's three sync coordinators (identity / platform-address / shielded) and the event-adapter each ran their own ad-hoc worker lifecycle. The coordinators must run on dedicated OS threads (std::thread + Handle::block_on) because the dash-sdk sync futures are !Send, so a tokio task group cannot host them. The teardown logic was triplicated across the three coordinators plus two parallel bookkeeping structures, and two teardown paths could leave a live coordinator thread detached while the host freed its context.

What breaks without it (concrete):

  1. F1 — use-after-free on shutdown. shutdown()/quiesce() joined a coordinator by taking its JoinHandle by value into a future wrapped in a 30 s tokio::time::timeout. If the timeout fired mid-poll, the future dropped and the handle dropped with it — silently detaching a still-live OS thread without parking it. The next destroy saw background_join == NoneNotRunning → reported clean, so the host freed the callback context while the detached thread could still fire into it.
  2. F2 — shielded-store data-loss / UAF. clear_shielded() wiped the shielded store gated only on the current coordinator quiesce being clean. A prior-generation shielded thread parked after a stop→start race still held Arc refs to the persister being wiped, so the clear could run under a live writer.

Blocking relationship. Builds on the in-branch shutdown-join work. Introduces a generic engine in the shared dash-async crate that rs-dapi could adopt later (no rs-dapi change here).

What was done?

  • New ThreadRegistry<K> in dash-async — one generic engine managing both !Send OS-thread workers and Send tokio tasks behind a uniform API: a generation-match epilogue, reap-or-park orphan tracking, weight-ordered graceful shutdown(), and per-worker async drain hooks. The is_syncing/quiescing domain barrier stays in each coordinator and is handed to the registry as a DrainHook — the registry owns the lifecycle machinery, not the domain semantics.
  • Migrated the three coordinators + the event-adapter onto the registry; deleted the triplicated lifecycle code and both parallel structures (background_* fields, the CoordinatorOrphans type); extracted a CoordinatorLifecycle helper to dedup the remaining shared is_syncing/quiescing protocol — and after the latest review pass, the triplicated coordinator start() poll-loop body itself was hoisted into CoordinatorLifecycle::spawn_periodic_loop (each coordinator now provides only its domain-specific pass thunk).
  • F1 fixed structurally — joins keep the JoinHandle owned by the registry slot (never moved into a cancellable future) and deterministically re-park on timeout; F2 fixed — a single key-scoped any_alive_for() gate over live + parked workers blocks the store wipe.
  • Lifecycle hardening (from three internal review rounds): generation-guarded quiesce, graceful thread-spawn-failure re-park with full slot-config rollback, panic-safe epilogue, a terminal closing latch, a restarted worker's prior parked under the slot lock, clear_shielded's drain bounded by a timeout (no host ANR) and its quiescing gate held continuously across drain → liveness-check → wipe, and the quiescing↔is_syncing handshake made self-fencing (SeqCst).
  • Cancel-only shielded_sync_stop FFI — the bridge no longer raises its own AtomicFlagGuard on quiescing, eliminating a race where a concurrent stop's guard Drop could lower the gate clear_shielded holds continuously. The single join point for shielded teardown is now destroy (or shielded_clear for the clear flow); shielded_sync_stop matches the cancel-only convention of identity_sync_stop and platform_address_sync_stop.
  • Per-key clearing latchThreadRegistry::hold_clearing(key) -> ClearingGuard RAII latch refuses start_thread/start_task for any key being cleared; wired into clear_shielded_inner alongside the quiescing gate. Closes the residual where a full shielded_sync_start() racing clear_shielded could otherwise reopen the gate and re-persist into the just-wiped store.
  • Event-adapter cancel-drainwallet_event_adapter_loop now drains the broadcast::Receiver via bounded try_recv() on cancel before exiting, so an event emitted just before stop (notably TransactionInstantLocked, which P2P does not replay) is still persisted. Bound (CANCEL_DRAIN_BUDGET = 4096) counts every attempt — including Lagged — toward the cap so a flood cannot stall teardown.
  • Type collapse — the wallet-side CoordinatorThreadStatus / CoordinatorExitStatus wrappers were collapsed onto dash_async::WorkerStatus / ShutdownReport<WalletWorker> directly (both were introduced in this PR, never released — no released-API break). shutdown() now returns ShutdownReport<WalletWorker>; consumers re-export via platform_wallet::{ShutdownReport, WorkerStatus}. A never-started worker is encoded as key-absent-from-per_worker rather than a materialized NotRunning slot; the all_clean() verdict is unchanged.
  • #[must_use] on ShutdownReport and CoordinatorExitStatus so silently dropping the UAF-avoidance status now warns at compile time.

How Has This Been Tested?

  • platform-wallet lib: 212 (default) / 313 (--features shielded) passing — at/above baseline, no coverage regression. Net +3 over the baseline includes new T5/T11/cancel_all regressions.
  • dash-async: 35 unit tests (+1 doctest), including the F1/F2 / cancel-drain bound / per-key clearing latch (3 latch tests) regressions. The F1/F2 + lifecycle fixes are proven non-vacuous (reverting each fix makes its regression test fail, then restored).
  • platform-wallet-ffi: 108 (--features shielded) passing.
  • cargo clippy --workspace --features shielded --all-targets -- -D warnings clean; cargo fmt --all --check clean.
  • Validated by a functional/spec QA pass and three full three-lens code-review rounds (security/concurrency, Rust quality, project consistency); the latest round adversarially re-verified the four critical invariants (F1, F2, SEC-001, SEC-002) intact after the cancel-only stop + per-key latch — concurrent stop can no longer lower a continuously-held clear gate, and a concurrent (re)start for the shielded key is no-op'd by the registry for the whole clear flow.
  • All review threads from the two earlier rounds (T1–T11) addressed; their landing commits are referenced inline in the resolved threads.

Breaking Changes

  • PlatformWalletManager::shutdown() returns dash_async::ShutdownReport<WalletWorker> (was ()); FFI destroy maps !all_clean()ErrorShutdownIncomplete — hence the ! in the title. A never-started worker is encoded as absent-from-per_worker rather than NotRunning in a materialized field; all_clean() is unchanged.
  • The wallet-side wrapper types CoordinatorThreadStatus / CoordinatorExitStatus (both introduced in this PR, never released) were collapsed onto the dash_async equivalents; consumers re-export via platform_wallet::{ShutdownReport, WorkerStatus}.
  • New additive public API in dash-async (ThreadRegistry, WorkerStatus, WorkerConfig, ShutdownWeight, ClearingGuard, ShutdownReport, …) and a new dev-only dash-async test-util feature.
  • FFI shielded_sync_stop is now cancel-only (does not return ErrorShutdownIncomplete); the join + orphan-liveness contract for hosts moves to destroy and shielded_clear. Swift PlatformWalletManager.deinit updated accordingly.
  • No breaking change to the remaining platform-wallet public/FFI surface.

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have added "!" to the title and described breaking changes in the corresponding section
  • I have made corresponding changes to the documentation if needed

🤖 Co-authored by Claudius the Magnificent AI Agent

lklimek and others added 30 commits May 22, 2026 15:17
…tBytes, never raw bytes (CMT-002)

Add `SecretStore` as the public, never-leaking secrets entry point.
`get` yields a zeroizing `SecretBytes` (a raw `Vec<u8>` never crosses the
boundary); `set` takes `&SecretBytes` so callers cannot pass an unwrapped
buffer. The `File` arm delegates to new inherent typed methods on
`EncryptedFileStore`, returning `FileStoreError` losslessly so
`WrongPassphrase` vs `Corruption` vs `Busy` stay distinct. The `Os` arm
projects `keyring_core::Error` best-effort into the new
`FileStoreError::OsKeyring { kind }` payload-free discriminant. The
internal `CredentialApi`/`CredentialStoreApi` SPI impls are unchanged;
`SecretStore` wraps them.

Docs (SECRETS.md, lib.rs, secrets/mod.rs) present `SecretStore` as the
consumer front door with keyring_core as the internal SPI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed-path error distinction

Drop the boxed-marker recovery in `From<FileStoreError> for
keyring_core::Error`: the SPI seam is now lossy and string-only, with no
`Box<dyn>` round-trip. The lossless `WrongPassphrase`/`Corruption`/`Busy`
distinction lives on the typed `SecretStore` path. Repoint the in-crate
SPI tests that recovered the typed error through `NoStorageAccess` onto
the typed path, asserting only the lossy projection at the seam.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ore NoStorageAccess for lossless SPI recovery

Revert the string-only `From<FileStoreError> for keyring_core::Error`:
`WrongPassphrase` / `Busy` now box the single typed `FileStoreError`
itself into `NoStorageAccess`, so external keyring_core-SPI consumers
recover the variant losslessly via
`source().downcast_ref::<FileStoreError>()`. No second type is
reintroduced (FileStoreFailure stays deleted), satisfying the original
error.rs objection. The `BadStoreFormat` group has no box slot, so it
carries only a secret-free string and stays fully typed on the
`SecretStore` path. Seam tests assert downcast recovery and the
secret-free BadStoreFormat rendering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…re __secrets-test-helpers (CMT-008)

The in-RAM MemoryCredentialStore test double had no consumer outside its
own module. Its behaviors (label rejection, namespacing, zeroizing
storage) are already covered by the tempdir-backed EncryptedFileStore
tests, so the store and its dedicated `__secrets-test-helpers` feature
are retired. The dev-dependency self-reference uses `__test-helpers`,
not the secrets one, so nothing else needs it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pose_secret guard scan (CMT-012/010)

CMT-012: `hex::decode_to_slice` accepts uppercase, so `parse_service`
now rejects any A-F before decoding — the service string is always
constructed lowercase via `WalletId::to_hex`, making lowercase a clean
parse invariant. Adds a test that an uppercase-hex service is rejected
and the lowercase form of the same bytes is accepted.

CMT-010: the expose_secret leak guard joined only a 2-line window, so a
3+-line `tracing::…(field = expose_secret(), …)` call slipped through.
The scan now groups whole statements (concatenating until parens
balance and a `;`/`{` is seen) so the sink and `expose_secret` land in
one window. Adds a non-vacuous planted 3-line case the widened scan
catches and the old 2-line window would have missed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ader (CMT-013/014)

CMT-013: removes internal finding-ID and rework narration (SEC-00N,
EDIT-N, CMT-NNN, "trimmed fork of", "the defect: used to…") from
comments across src/secrets/, keeping the present-state behavior and
requirement-spec rationale. Comments describe what IS, not the journey.

CMT-014: removes the embedded MIT license-text block atop secret.rs
(first-party, same org, matching license) and replaces the module
header with a one-line doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… failures (Display-only, secret-free)

Library-idiom + security-event logging only; no blanket error! at
routine return sites.

- Swallowed mlock failures in secret.rs (3 sites) move from debug! to
  warn!: they are .ok()-swallowed and caller-invisible, yet
  security-relevant (the secret may be swappable to disk or land in a
  core dump). Display `{e}` only.
- Corruption/tamper detected in get()/rekey() (post-verify AEAD tag
  failure → Corruption): error! with the non-secret wallet-id/label,
  Display only, never the secret or the raw keyring error.
- Vault write failure in write_vault: warn! with the io error's
  Display; paths are caller-supplied non-secret.

Never `{:?}` a keyring_core::Error and never log a secret wrapper; all
new lines use `%` Display, so the EDIT-2 no-debug-format guard still
passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ShieldedChangeSet, SubwalletId, and ShieldedNote now derive
serde::Serialize/Deserialize behind the `serde` feature. Without
them the serde+shielded feature combo failed to compile: the
serde-derived PlatformWalletChangeSet carries a cfg(shielded)
Option<ShieldedChangeSet> field whose type had no serde impl.
This is the combo built by the macOS CI lane, which was red on
this PR head.

The leaf types are plain POD ([u8;32]/u32/u64/bool/Vec<u8>), and
the shielded delta is meant to be persisted for cold-start
rehydration, so deriving (not #[serde(skip)]) is the correct fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_secrets-test-helpers references (QA-002)

The `__secrets-test-helpers` feature and its `secrets::MemoryCredentialStore`
in-RAM test double were removed in the keyring_core SPI rework. Remove the
stale feature row from the README Cargo features table and replace the
obsolete backend bullet in SECRETS.md with the current test pattern: a
tempdir-backed `EncryptedFileStore` (or `SecretStore::file`) constructed via
`tempfile::tempdir()`, available under the default `secrets` feature with no
special flag required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`cargo clippy -p platform-wallet-storage --no-default-features
--features sqlite,cli` failed with three dead-code errors:
`load_state`, `decode_pair_key`, and the non-test `ContactsRecords`
arm. On this PR standalone their only consumer is the
`__test-helpers` wrapper `load_state_for_test`; the production
reader that will consume them ships with the separate rehydration
feature.

Gate the reader, its key decoder, `ContactsRecords`, and the
imports they pull behind `__test-helpers` (collapsing the duplicate
`ContactsRecords` definitions into one). The off-state build now
goes clean; the default `--all-targets` build (which pulls
test-helpers via dev-dep) and the prod `sqlite,cli,secrets` combo
stay clean. Un-gate when the rehydration feature adds the
production consumer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…llapse V002 (CMT-001/006)

Rewrite V001 as hand-written CREATE TABLE ... PRIMARY KEY ... FOREIGN KEY
SQL with native ON DELETE CASCADE, run through refinery. Drop the barrel
dependency (it can't emit composite FK clauses for sqlite3) and delete
the FK-emulation triggers plus the whole of V002 — native FK enforces
INSERT / UPDATE re-parenting / DELETE-cascade across all 18 child tables,
including the composite identity_keys/dashpay_profiles -> identities FK.

Add a single open_conn choke-point (src/sqlite/conn.rs) that every
connection-open site routes through; for read-write handles it enables
PRAGMA foreign_keys=ON and reads it back, hard-erroring with the new
WalletStorageError::ForeignKeysNotEnforced if the linked SQLite ignores
the pragma (R-LOW: no compile-time knob can force it).

R2: core_utxos.spent_in_txid clearing to NULL on transaction delete
stays a single trigger — a native composite ON DELETE SET NULL nulls
every FK column and wallet_id is NOT NULL, so SQLite would reject the
delete (1299). The single-column trigger preserves the lazy semantics.

Also harden wallet_metadata.birth_height read against u32 truncation
(CMT-014) and the backup destination against TOCTOU via create_new.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…count UTXOs (CMT-003/011)

core_derived_addresses gains an account_index column, populated from the
derived-address path via the now-pub(crate) accounts::account_index
helper. apply() writes derived addresses BEFORE UTXOs in the same
transaction so the UTXO writer can resolve each UTXO's owning account by
its rendered address (SELECT account_index FROM core_derived_addresses
WHERE wallet_id=? AND address=?), replacing the hardcoded 0. A miss warns
and defaults to 0. The spent-only synthetic path (CMT-011) uses the same
lookup. Sync-height reads now u32::try_from instead of `as u32` (CMT-012).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…entity-key writes (CMT-005/004)

platform_addrs::apply rejects an entry whose wallet_id differs from the
flush scope with the typed WalletIdMismatch before the INSERT (the native
FK now also backstops it, but the typed error pinpoints the mismatch).

identity_keys::apply rejects an upsert whose entry (identity_id, key_id,
wallet_id) disagrees with its map key / flush scope (IdentityKeyEntryMismatch),
so the typed columns and the serialized blob can never describe different
rows on disk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…peek errors (CMT-007/009/010)

Remove the delete-wallet subcommand (CLI surface only; the library
delete_wallet API stays). peek_schema_version now returns Result and
distinguishes a fresh DB (no history table -> Ok(None)) from a transient
open/query failure (Err), so `migrate` can no longer print a wrong
`applied: N` from a swallowed error. run_backup defers refuse-to-overwrite
to backup_to's typed BackupDestinationExists instead of a duplicate
is_file() guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…i-account, overflow guards)

New tests/sqlite_hardening_3625.rs proves: native FK rejects an orphan
child and an identity_keys row with no identities parent; mixed-wallet
platform-address write fails typed (CMT-005); identity-key entry/key-vs
-blob mismatch is rejected (CMT-004); multi-account UTXOs bucket to their
real account and an undeclared address defaults to 0 (CMT-003/011); the
birth_height and sync_height overflow cases error rather than truncate
(CMT-012/014); a compaction-marker-only wallet survives load (CMT-002).

Update existing tests for the rewritten schema: TC-027 derived-address
insert carries account_index; the error-classification exhaustiveness
table covers the three new variants; TC-072 asserts the delete-wallet
subcommand is gone; FK test header drops the trigger wording. A conn.rs
unit test exercises the foreign-key read-back assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ope conn doc (QA-001/004)

TC-048 now explicitly asserts the UTXO row survives the transaction
delete (COUNT==1) with wallet_id/value/account_index preserved, in
addition to spent_in_txid being NULL — so a future change turning the
single-column trigger into a cascading DELETE fails loudly instead of
passing implicitly.

Scope the conn.rs choke-point doc to library connection paths and note
the CLI's read-only peek_schema_version probe opens directly (no
mutations, and open_conn is pub(crate) so the separate bin target can't
reach it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ency

CMT-003's pure-SQL account_index lookup removed the last reference to
key-wallet-manager; cargo machete (macOS CI) flagged it as unused. No
`.rs` file references key_wallet_manager under any cfg/feature, and the
optional dep carried no extra feature activation, so removal is clean —
not papered over with a machete ignore. Drop the dependency line and its
`dep:key-wallet-manager` wiring from the sqlite feature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cOS path symmetry

`backup_to` canonicalizes its returned path; on macOS the temp dir lives
under `/var` (symlink to `/private/var`), so the canonical `written`
path no longer prefixes the un-canonicalized `out_dir` and the
`starts_with` assertion failed (Linux has no such symlink, hence
green locally). Canonicalize the expected dir before the check,
mirroring TC-032. TC-031 is the only affected assertion — TC-032 was
already symmetric, and TC-035 only feeds the canonical path back into
restore_from without comparing it to a temp path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…L/SHM perms (CMT-001/003/004)

CMT-001 (BLOCKING data-loss): restore_from() unlinked <dest>-wal /
-shm BEFORE staging + integrity / schema / max-version checks. A failed
gate left the main DB intact (NamedTempFile drop) but the sidecars were
gone — un-checkpointed pages lost. Move the unlink to immediately
BEFORE tmp.persist() so both succeed atomically or neither runs.

CMT-003 (HIGH security): the create_new() call in run_to() created
backup files at the umask-default mode before apply_secure_permissions
chmodded them to 0o600. Set OpenOptionsExt::mode(0o600) under cfg(unix)
so the file is never world/group readable, even briefly. Keep the
post-open chmod as defense-in-depth (Windows no-op, re-tightens).

CMT-004 (HIGH security): apply_secure_permissions only chmodded the
main DB; WAL/SHM siblings inherited the process umask. Extend the same
helper to chmod <path>-wal / <path>-shm to 0o600 when present (silent
skip when absent, Windows no-op). One function, idempotent — matches
the user preference for a single API surface.

Regressions:
- tests/sqlite_restore_staged_validation::rejected_restore_leaves_wal_shm_siblings_intact
- tests/sqlite_permissions::wal_and_shm_sidecars_are_chmodded_0o600

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…ore_from (CMT-010)

restore_from() was streaming the entire source DB into a NamedTempFile
in the destination's parent dir BEFORE running the schema-history /
max-version gate. A caller passing a valid-but-huge incompatible DB
(CLI / FFI / UI import) filled the wallet partition before being
rejected.

Add cheap source-side schema-history presence + MAX(version) checks
against the source handle that was already opened for integrity_check
— failing fast saves the disk. Keep the staged-copy gate as the
TOCTOU-safe final check: a concurrent swap during the restore window
still gets rejected because the second check binds to the bytes
actually persisted.

Factored shared logic into migrations::max_supported_version() +
migrations::assert_schema_version_supported() so the open()-path gate
(CMT-005, next commit) reuses the same code.

Regression: tests/sqlite_restore_staged_validation::forward_version_rejected_before_staging

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
… failure (CMT-002 data-loss)

delete_wallet_inner drained the per-wallet buffered changeset BEFORE
the existence check / auto-backup / DELETE transaction, but never
restored it on any pre-commit failure. The operator's delete correctly
reported the backup or SQL error, yet the wallet's pending writes
vanished from the buffer even though no DELETE actually committed.

Mirror flush_inner's take/restore discipline: hold the drained
changeset in a Cell<Option<...>>, on success consume it AFTER
tx.commit(), on any error path restore it into the buffer via
self.buffer.restore(wallet_id, cs). If the restore itself fails (e.g.
poisoned buffer mutex), log the lost changeset at error level — the
delete error still surfaces (it's the primary cause).

Regression: tests/sqlite_delete_wallet::delete_wallet_restores_buffer_on_backup_failure
points auto_backup_dir at an unwritable path, primes a buffered
changeset, asserts (a) delete returns AutoBackupDirUnwritable and
(b) the buffered writes survive — a follow-up flush lands them.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…ic with restore_from (CMT-005)

SqlitePersister::open ran the migration probe (count_pending) and
refinery::run, but never compared the on-disk MAX(version) against the
highest embedded migration. An older binary opening a DB produced by a
newer release saw pending_count == 0 and silently proceeded to decode
forward-schema blob columns — the migration version IS the entire
compatibility boundary for this crate (per blob.rs:4-6).

Add the same SchemaVersionUnsupported gate restore_from enforces,
calling the shared migrations::assert_schema_version_supported helper
introduced in the CMT-010 commit. Sites are now symmetric.

Regression: tests/sqlite_open_version_gate::open_rejects_forward_schema_version
forges a version=1_000_000 row, asserts the second open fails with
SchemaVersionUnsupported.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…ent stores (CMT-008)

delete_wallet_inner previously released and re-acquired the connection
Mutex three times (existence-check / auto-backup / DELETE-tx). A
parallel store(wallet_id, cs) could slip in between any pair of those
scopes — in Immediate mode the racing flush landed AFTER the DELETE
commit (wallet reappears), in Manual mode a buffered store survived
the delete and was persisted by the next commit_writes() against a
freshly-recreated parent.

Approach (a): acquire self.conn()? ONCE at the top of
delete_wallet_inner and hold it across the full pipeline so Immediate
stores block on flush_inner's own self.conn(). For Manual mode (where
store doesn't take conn at all), do a post-commit second drain of the
buffer to discard any racing-buffered changeset — the wallet is being
removed, those writes are intentionally void. run_auto_backup already
accepted &Connection so no signature changes were required.

Regression: tests/sqlite_delete_wallet::concurrent_store_does_not_resurrect_deleted_wallet
spawns a worker hammering store() while the main thread deletes,
commits any remaining writes, asserts every PER_WALLET_TABLES row
count for that wallet_id is zero.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…e / typed AssetLockEntryMismatch (CMT-006/007)

CMT-006: blob::decode used bincode's NoLimit config, so an attacker-
controlled length prefix inside a Vec/String/map could drive the
allocator BEFORE the trailing-byte gate ever fired — a crafted backup
that passed restore_from's integrity check could OOM the host on the
next load(). Switch encode + decode to .with_limit::<16 MiB>() (well
above any legitimate per-row payload) and surface the cap-exceeded
case as a NEW typed variant `WalletStorageError::BlobTooLarge {
len_bytes, limit_bytes }` so operators can distinguish hostile/
corrupt oversize from a structural decode failure. Generic decode
failures (trailing bytes, schema mismatch) still surface as the
existing `BlobDecode` per user preference for precise variants.

CMT-007: asset_locks::decode_row decoded the typed outpoint from
op_bytes but then constructed TrackedAssetLock using the BLOB's
out_point + account_index. A torn write or restored corruption that
diverged blob-vs-typed would silently mis-bucket the lock under the
wrong account. Add a cross-check symmetric with IdentityKeyEntryMismatch
under a NEW typed variant `WalletStorageError::AssetLockEntryMismatch {
typed_outpoint, blob_outpoint, typed_account_index, blob_account_index }`.

is_transient() + error_kind_str() updated for both new variants — the
match remains wildcard-free per the existing #[non_exhaustive]-rejection
note in error.rs.

Regressions:
- src/sqlite/schema/blob.rs::tests::decode_rejects_oversize_blob_with_blob_too_large
- tests/sqlite_hardening_3625::asset_lock_typed_vs_blob_mismatch_rejected

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…inner blob (CMT-009)

IdentityKeyWire::into_entry decoded public_key_bincode with
bincode::decode_from_slice but bound the consumed count to `_`. A
corrupt or forward-incompatible inner payload whose IdentityPublicKey
prefix is valid would deserialize successfully even with garbage past
the typed length — leaving identity_keys weaker than every other blob
path. The outer blob::decode already enforces the same gate.

Capture consumed and surface BlobDecode("unexpected trailing bytes in
identity_keys.public_key_bincode") on mismatch (using the existing
blob_decode helper per user preference — trailing-bytes is the
canonical BlobDecode case, not a typed-variant case).

Regression: src/sqlite/schema/identity_keys::tests::into_entry_rejects_trailing_bytes_in_public_key_bincode

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…te (CMT-011)

list_active was byte-for-byte identical to load_state — same SQL,
same decode loop. Per the docstring's own admission, every persisted
status is considered active (consumed locks leave via
AssetLockChangeSet::removed), so the missing WHERE clause was
intentional but made the two functions semantically indistinguishable.

Option (a) per user preference: drop list_active, migrate the one
caller (tests/sqlite_persist_roundtrip.rs:377) to load_state, refresh
load_state's docstring to subsume the consumed-locks comment, and
verify zero remaining references with grep.

Also patches the wildcard-free match in tests/sqlite_error_classification.rs
to cover the new AssetLockEntryMismatch / BlobTooLarge variants added
in the prior CMT-006/007 commit (the exhaustiveness gate is what
this test exists to enforce).

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…MT-012)

The smoke test previously accepted any non-success via
assert!(!out.status.success(), ...) so a runtime failure unrelated to
unknown-subcommand handling (panic during clap setup, OOM, signal
kill) would still pass. Assert the explicit clap usage code Some(2)
before checking stderr content — costs nothing, tightens the contract.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
- sqlite_delete_wallet.rs: drop a useless std::fs::Permissions::from(...).
- sqlite_permissions.rs: silence clippy::field_reassign_with_default
  (consistent with the existing test files' top-level allow).

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…atch_only

load_from_persistor now rebuilds every persisted wallet watch-only from
its keyless AccountRegistrationEntry manifest (Wallet::new_watch_only)
and applies the keyless core-state projection on top. No seed material
is touched on the load path: signing keys are derived on demand later
through the MnemonicResolverHandle sign entrypoints, which carry the
fail-closed wrong-seed gate themselves.

Drops the SeedProvider port + WalletSecret/SecretPhrase/SecretSeed
payloads (and the storage CredentialStoreSeedProvider adapter that fed
them) — load no longer needs the abstraction. WrongSeedForDatabase
stays on PlatformWalletError for the sign-path gate. RT suite reshapes
to RT-WO (watch-only round-trip) + RT-Corrupt (per-row decode skip with
SkipReason::CorruptPersistedRow{kind: CorruptKind::MissingManifest}) +
RT-Z (no key material in any LoadOutcome / SkipReason surface).

apply_persisted_core_state and its F2/F3/F4 fixes are unchanged.

AR-7 residual risk on the load path is eliminated (no Wallet of a
signing type is constructed during load, so its Debug-leak surface is
gone from this path).

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
lklimek and others added 5 commits June 25, 2026 10:35
…iveness, like clear_shielded/destroy

shielded_sync_stop returned Success on status.is_clean() alone, ignoring a prior-generation shielded thread still parked alive as an orphan — asymmetric with clear_shielded/destroy and a misleading contract (the orphan still holds the host callback context). No live UAF today since Swift always does stop->destroy, but Success should imply no live shielded worker/orphan.

Add manager::shielded_worker_alive() (the same shielded-scoped any_alive_for gate clear_shielded consults) and have shielded_sync_stop return ErrorShutdownIncomplete when a parked orphan survives a clean drain. FFI ABI unchanged (same return-code semantics); docstring updated so Success accurately implies no live shielded worker/orphan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…cing gate continuously

SEC-001: clear_shielded's in-flight-pass drain was unbounded and the FFI clear bridge a bare block_on, so a heavy direct pass could hang the host (ANR). Bound it with a SHUTDOWN_JOIN_TIMEOUT_SECS timeout (mirroring shielded_sync_stop); on timeout the clear reports Timeout and aborts BEFORE the wipe, leaving the store intact. Split out clear_shielded_inner(drain_timeout) so the timeout path is testable without the 30s budget.

SEC-002/RUST-002: the gate lapsed between quiesce() returning (its RAII guard lowers the shared flag) and the post-drain re-raise, letting a direct sync_now/sync_wallet slip past any_alive_for and re-persist into the wiped store. Fix: raise+HOLD the gate via clear's own guard BEFORE draining, and drain via a new gate-neutral quiesce_under_held_gate (extracted cancel_join_and_drain shared with quiesce, which stays byte-identical — Fix-1 invariant untouched). The gate now stays raised continuously across drain, liveness check, and wipe; doc softened to note the only residual is a full start() racing clear (per-key-latch follow-up). PROJ-004: clear now calls shielded_worker_alive() instead of re-inlining any_alive_for. Also clarifies the quiesce doc that only shielded gates sync_wallet (platform-address's is intentionally ungated). TDD: both SEC tests proven non-vacuous (revert->fail->restore).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…story-tags

RUST-004: a failed thread spawn left the slot carrying the FAILED start's weight/drain/join_budget (and a bumped generation). Now snapshot the pre-start config and restore ALL of it on the Err path, so the re-installed prior keeps its own teardown config; generation rolls back too (the +1 is only observed under the slot lock and the failed start spawns no thread, so the rollback is net-zero and the externally-visible generation stays monotonic). New regression test asserts the restored config.

RUST-005: trimmed the duplicated park-under-lock rustdoc block in start_thread that repeated park_prior_locked's doc.

RUST-003/PROJ hygiene: removed [F1/F2/F3 FIX] history-tags from committed comments, replaced the 'Why F1 and F2 cannot recur' module section with present-state invariant descriptions, and fixed the glossary to reference the key-scoped any_alive_for (the gate store-wiping paths actually consult) rather than the registry-wide any_alive.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…an-trigger contract

shielded_sync_stop now also returns .errorShutdownIncomplete when the drain was clean but a prior-generation shielded thread is still parked alive as an orphan (not only when the in-flight drain times out). Update the manager deinit comment to reflect both triggers; behaviour unchanged (the deinit already leaks one strong ref to the handlers on .errorShutdownIncomplete). Comment-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…fencing (SeqCst)

SEC-003: the pass-gate handshake is a Dekker-style StoreLoad pair across two distinct atomics — quiesce does store(quiescing) … load(is_syncing); begin_pass does CAS(is_syncing) … load(quiescing). Release/Acquire do NOT order StoreLoad across separate locations, so by the annotations alone both sides could miss each other (begin_pass reads a stale quiescing==false and runs a pass past a raised gate while the drain reads a stale is_syncing==false and returns). It was sound only incidentally — registry.quiesce happens to take the slots Mutex (a fence) before returning; a lock-free refactor of that path would make the race live.

Promote the four handshake ops to SeqCst (a single total order guarantees at least one side observes the other): the gate-raise stores in quiesce, hold_quiescing_gate, and the registry drain hook; the is_syncing CAS (success) and quiescing load in begin_pass; and the is_syncing load in the drain. Gate lowering (reopen / RAII drop) and observational reads stay Release/Acquire — a stale-high gate read only makes a pass bail conservatively. Fix-1's gate-before-cancel + never-latched invariant is preserved (SeqCst is strictly stronger than the prior Release). Added a load-bearing-ordering comment at begin_pass. Not unit-testable (ordering); the existing gate/drain handshake tests still pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

All three prior findings (911f99f) are resolved at 748c4f8: CoordinatorLifecycle::quiesce now self-raises the quiescing gate and drains in-flight direct passes (coordinator_lifecycle.rs:166-173, with a regression test); the shielded_sync_stop FFI now consults manager.shielded_worker_alive() and returns ErrorShutdownIncomplete when a prior-generation shielded orphan is parked (shielded_sync.rs:127-157); and start_thread now parks the prior worker under the slot lock via park_prior_locked before the out-of-lock reap, with a regression test in the registry. One in-scope cumulative concern remains: the shared quiescing AtomicBool means a concurrent quiesce()'s AtomicFlagGuard drop can lower the gate that clear_shielded's hold_quiescing_gate is supposed to keep raised continuously — the mechanism does not match the doc's continuous-hold promise in clear_shielded_inner. In practice this is gated by the host's UI-thread serialization, so it is a suggestion, not a blocker.

🟡 1 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:166-173: clear_shielded's "continuously held" gate can be lowered by a concurrent quiesce()
  `clear_shielded_inner` holds `_clearing_gate = shielded_sync_manager.hold_quiescing_gate()` across drain → liveness check → wipe, expecting the shielded `quiescing` boolean to stay `true` for the whole sequence (and the doc at manager/mod.rs:517-520 explicitly promises "the quiescing gate is raised continuously" so concurrent direct `sync_now`/`sync_wallet` are held off). But the same `AtomicBool` backs `CoordinatorLifecycle::quiesce`, which builds its own `AtomicFlagGuard` (coordinator_lifecycle.rs:171); `AtomicFlagGuard::drop` unconditionally stores `false` (rs-dash-async/src/atomic.rs:25-29). If `platform_wallet_manager_shielded_sync_stop` is invoked concurrently with `clear_shielded` (the FFI handle store only takes a brief read lock in `with_item`, so two calls on the same handle can interleave), the stop's `quiesce()` returns and its dropping guard clears the flag while clear is still between its drain and its wipe. A direct `sync_now`/`sync_wallet` then sees `quiescing=false` in `begin_pass`, claims `is_syncing`, and can re-persist into the store `coord.clear()` is about to wipe — a store/host desync. The host's UI-thread serialization mitigates this today, but the documented invariant on `clear_shielded` only excludes concurrent `shielded_sync_start`, not `_stop`. Fix options: refcount the gate (so the inner drop doesn't lower it under the outer hold), use a clear-specific second flag that `begin_pass` also consults, or tighten the host-serialization doc to enumerate `_stop` too.

Comment thread packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs Outdated
Comment thread packages/rs-dash-async/src/registry.rs Outdated
}

impl SlotState {
fn dormant() -> Self {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just impl Default?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 0e73de7: fn dormant() replaced with a manual impl Default for SlotState. Manual (not derived) — #[derive(Default)] would zero join_budget, but the dormant slot must carry DEFAULT_JOIN_BUDGET = 30s. Both call sites switched to or_default().

(Reply-only — leaving for you to resolve.)

Comment thread packages/rs-dash-async/src/registry.rs Outdated
if slot.cancel.is_some() {
return;
}
let prior = slot.handle.take();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same code in both cases. DRY.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 0e73de7: extracted the duplicated 8-line slot-rotation block into SlotState::prepare(cfg) -> (Option<WorkerHandle>, CancellationToken, u64). start_thread keeps its rollback-snapshot lines (captured BEFORE calling prepare) so a spawn failure can restore the prior config; start_task just calls prepare.

(Reply-only — leaving for you to resolve.)

// R2: gate-before-cancel — fully await the drain hook before the
// cancel signal is observed.
if let Some(drain) = drain {
drain().await;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have a timeout here, at least measure time and tracing::debug! it if operation takes more than expected

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 0e73de7: drain hook is now timed via Instant::elapsed() with DEBUG/WARN logging in quiesce. WARN fires above DRAIN_HOOK_WARN_THRESHOLD = 10s (1/3 of DEFAULT_JOIN_BUDGET). No hard timeout — the caller still bounds the whole teardown; this is observability for a hung drain.

(Reply-only — leaving for you to resolve.)

let (drain, budget, my_gen) = {
let slots = self.lock_slots();
match slots.get(&key) {
Some(s) if s.cancel.is_some() || s.handle.is_some() => {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use slot_alive() ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 0e73de7: added an inline comment in quiesce explaining why s.cancel.is_some() || s.handle.is_some() is used rather than slot_alive() — a finished-but-unreaped handle still needs classifying into its terminal status; slot_alive() treats handle.is_finished() as not-alive and would short-circuit to NotRunning, dropping the result.

(Reply-only — leaving for you to resolve.)

///
/// Panic-strategy caveat (same as `AtomicFlagGuard`): the clear-on-panic
/// half relies on `Drop` running while the stack unwinds, so it holds under
/// `panic = "unwind"`. Under `panic = "abort"` a worker panic aborts the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we detect panic mode and log a warning if it's wrong?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I added a small follow-up on thepastaclaw:tracker-1583-panic-mode-warning that detects the build panic strategy with cfg(panic = "abort") and emits a one-shot tracing::warn! from ThreadRegistry::with_reap_backstop when the crate is built with abort semantics. It also updates the nearby EpilogueGuard docs and adds a focused idempotence/cfg regression test.

Commit: thepastaclaw@7f353680000c

Validation: cargo test -p dash-async --lib registry::, cargo check -p dash-async --tests, cargo clippy -p dash-async --tests -- -D warnings, cargo fmt --all --check.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The panic-strategy caveat is the correct handling: stable Rust has no runtime API to query the active panic strategy — only #[cfg(panic = "...")] is available, and iOS release intentionally uses panic = "abort", so a hard compile_error! gate would be wrong for that target. The doc note stays as a known limitation. No code change.

(Reply-only — leaving open for you to resolve after review.)

lklimek and others added 7 commits June 25, 2026 22:09
…Report + registry ergonomics

T5 (correctness): `shutdown()` discarded the `reap_orphans_impl` status,
so a panicked/errored orphan that finished within the reap grace
(`detached == 0`, status non-clean) silently passed `all_clean()`. Add a
new `orphan_status: WorkerStatus` field to `ShutdownReport`, fold it
through `all_clean()`, and capture it in `shutdown()`. Mirror the
propagation on the wallet boundary: `CoordinatorExitStatus::from_report`
now folds `orphan_status` through `detached_threads` when the survivor
count is zero, so the FFI/Swift surface stays honest. Two registry
regression tests (panicked + clean reaped orphan), two wallet boundary
tests (propagation + non-over-trigger).

T1 (ergonomics): replace `fn dormant()` with a manual `impl Default for
SlotState`. `#[derive(Default)]` would zero `join_budget`, so the impl
is hand-rolled to preserve `DEFAULT_JOIN_BUDGET = 30s`. Both call sites
switch to `or_default()`.

T2 (DRY): extract the duplicated 8-line slot rotation into
`SlotState::prepare(cfg) -> (Option<WorkerHandle>, CancellationToken,
u64)`. `start_thread` keeps its rollback-snapshot lines (captured before
`prepare`) so a spawn failure can restore the prior config.

T3 (observability): wrap `drain().await` in `quiesce()` with an
`Instant`-based timing log — DEBUG by default, WARN when the drain
exceeds `DRAIN_HOOK_WARN_THRESHOLD = 10s` (1/3 of `DEFAULT_JOIN_BUDGET`).
No hard timeout — the caller still bounds teardown.

T4 (invariant comment): document why `quiesce`'s slot-inhabited check
uses the raw `cancel.is_some() || handle.is_some()` rather than
`slot_alive()` — a finished-but-unreaped handle still needs classifying;
`slot_alive()` would false-`NotRunning` it.

Tests: 31/31 dash-async, 212/212 wallet default, 312/312 wallet
--features shielded. cargo clippy --all-targets -- -D warnings clean.

Addresses PR #3954 review threads T5, T1, T2, T3, T4.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF
…ed mid-flight

T8/T9/T10: All three coordinator `stop()` docs claimed a pass already
inside `sync_now` "keeps running to completion". The loop's
`biased; cancel-first` select in `start()` actually drops the
`sync_now` future at its next `.await` the instant cancel fires (see
the existing comments at `start()`). Correct each docstring:

- `identity_sync.rs`: in-flight pass cancelled mid-flight; completed
  `persister.store(...)` calls committed, in-flight store abandoned.
- `platform_address_sync.rs`: in-flight pass cancelled mid-flight;
  `on_platform_address_sync_completed` host callback may not fire.
- `shielded_sync.rs`: in-flight pass cancelled mid-flight; persister-
  callback fan-out for completed stores has fired, in-flight stores
  abandoned.

The functional contract (use `quiesce()` for a real barrier) is
unchanged; only the description of stop's effect on the in-flight pass
is now accurate.

Addresses PR #3954 review threads T8, T9, T10.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF
…nates concurrent-AtomicFlagGuard race

T11 primary path: `platform_wallet_manager_shielded_sync_stop` ran
`quiesce()` inline, which raises the `quiescing` gate via its own
`AtomicFlagGuard`. A concurrent `shielded_clear` that holds the gate
continuously via `hold_quiescing_gate` would then have stop's
second guard's `Drop` lower the gate, opening a window for a new pass
to slip past the "no new pass" barrier.

Convert stop to cancel-only — matching `platform_address_sync_stop` /
`identity_sync_stop`: signal the loop, return immediately. The join +
orphan-liveness gate that prevents host UAF stays single-sourced on
`destroy` (and `shielded_clear` for the clear flow). With no second
guard ever created by stop, the primary T11 path is closed.

Ripple updates so docs stay accurate:
- `error.rs` `ErrorShutdownIncomplete` doc: removed `shielded_sync_stop`
  from the returners list — now only `destroy` + `shielded_clear`.
- `accessors.rs` `shielded_worker_alive()` doc: same.
- Swift `PlatformWalletManager.deinit`: both stops now `.discard()`;
  leak-on-incomplete logic keyed solely on `destroyCode`.
- Swift `PlatformWalletManagerShieldedSync.stopShieldedSync()` comment
  reflects cancel-only semantics — Swift-side generation bump still
  drops trailing events delivered to `@MainActor`.

DEFERRED (follow-up):
- T11 RESIDUAL (5d569459 per-key clearing latch in ThreadRegistry, for
  a full `start()` racing `clear`) — substantial new mechanism, should
  be its own focused PR.
- security-engineer adversarial verification of SEC-002 gate-continuity
  + F2 orphan-awareness intact under cancel-only — must run before
  merge.

Tests: 108/108 FFI tests pass with shielded; 312/312 wallet --features
shielded; cargo clippy --workspace --features shielded --all-targets
-- -D warnings clean. Swift side not built (no toolchain on host) —
6b6a5a0d Swift `deinit` build-verify remains a separate macOS merge
gate.

Addresses PR #3954 review thread T11 (primary path).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF
…-adapter (T7)

The wallet-event adapter's `tokio::select!` broke on `cancel.cancelled()`
without draining the broadcast::Receiver's buffer. Events emitted just
before stop (e.g. a `TransactionInstantLocked`, which P2P does NOT
replay) were silently lost — the next sync subscribes fresh and only
sees future events.

Drain in the cancel branch via `try_recv()` before exiting. Bounded by
`CANCEL_DRAIN_BUDGET = 4096` (well above the 256-cap broadcast capacity,
so a normal teardown drains everything; pathological flood can't stall
shutdown — respects the SEC-001 bounded-teardown invariant).

Refactor: extract `process_event()` so the live loop and the cancel-
drain cannot drift in behaviour; `drain_buffered_events()` handles
`Lagged` with a warn and reports drained count for trace observability.

Caveat: block-driven events (TransactionDetected, BlockProcessed,
SyncHeightAdvanced) ARE re-derivable on the next SPV resync from
`last_processed_height`, so the primary risk closed here is the
non-replayed IS-lock case. The drain treats all kinds uniformly for
simplicity.

Tests: 312/312 wallet --features shielded pass — no regressions. A
focused unit test for the cancel-drain path requires synthesizing
upstream `WalletEvent`s and a persister mock for which no existing test
infrastructure is set up; tracked as a follow-up TODO.

Addresses PR #3954 review thread T7.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF
…ones, surface deferrals

Phase 3 QA on the 4 PR #3954 review-thread commits surfaced 14 doc-rule
findings (2 MEDIUM, 12 LOW; Marvin + Adams + Smythe). Smythe verified
all 4 critical invariants (SEC-001/002, F1, F2) INTACT. Apply the
trivial fixes here; defer the remaining substantial items.

MEDIUM addressed:
- DOC-001 + PROJ-002: drop "Rationale: before this change..." historical
  paragraph (and its `prior quiesce-here design` cousin in the inline)
  from the `shielded_sync_stop` FFI docstring. The full causal chain is
  in commit f94fed9's body; readers have `git blame` for context. The
  invariant ("a second `AtomicFlagGuard` here would race a continuously-
  held gate in `shielded_clear`") is restated in present tense.
- DOC-002: strip ephemeral `T7:` review-thread ID from the core_bridge
  cancel-arm comment; trim 5 lines → 3.
- PROJ-001: drop stale `(mirroring shielded_sync_stop's timeout)` from
  `clear_shielded_inner` — post-refactor `shielded_sync_stop` has no
  drain and no timeout; the cross-reference is now false.

LOW addressed:
- DOC-003: trim `SlotState::prepare` doc 7 → 4 lines (private fn).
- DOC-004: trim `quiesce` R2 timing comment 5 → 3 lines.
- DOC-006: trim `drain_buffered_events` doc 4 → 3 lines.
- DOC-007: trim `from_report` doc + `detached_threads` inline.
- DOC-008: trim Swift `stopShieldedSync` comment 7 → 4 lines.
- PROJ-005: add `// TODO` marker above `drain_buffered_events` for the
  deferred unit test (no test scaffolding exists for the event adapter
  today). Project convention requires an in-source TODO.
- PROJ-006: correct the Swift `shieldedSyncGeneration` docstring — `stop`
  has no Rust quiesce barrier post-refactor; it cancels the in-flight
  pass, and `clear` is the one with the full quiesce.

Deferred (also surfaced as silent gaps by Adams PROJ-003/004, tracked
separately):
- 18d7acf4 — `CoordinatorThreadStatus`/`CoordinatorExitStatus` collapse
  onto `dash_async::WorkerStatus`/`ShutdownReport` (separate PR; T5 is
  fully fixed without it via the wallet-side `from_report` propagation).
  Tracked as a MemCan todo for follow-up.
- 0d895258 — broader ThreadRegistry comment-hygiene sweep (separate
  audit/sweep PR).
- 5d569459 — T11 RESIDUAL per-key clearing latch (separate PR; Smythe
  verified residual is acceptable MEDIUM — requires host-side
  programming error, data-integrity-only impact).

Verified post-fix: 31/31 dash-async, 312/312 wallet --features shielded,
108/108 FFI shielded tests pass. cargo fmt --all --check + cargo clippy
--workspace --features shielded --all-targets -- -D warnings clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF
…, PROJ-001/003/004 + Swift doc

Grumpy-review fan-out (smythe / adams / bilby, opus, parallel) over the
full PR delta (vs v3.1-dev, merge-base 1653b89). Smythe: 2 LOW.
Bilby: 2 MED + 7 LOW. Adams: 1 MED + 5 LOW. No CRITICAL/HIGH. All four
critical concurrency invariants (F1/F2/SEC-001/SEC-002) verified
INTACT. Applied the trivial-to-medium findings here; deferrals tracked.

MEDIUM applied:
- SEC-001 (smythe, core_bridge.rs): cancel-drain budget was bypassable
  by a sustained broadcast `Lagged` (the arm logged but never advanced
  `drained`, so the loop could spin past the cap until the outer
  registry budget fired). Fix: bound on an `attempts` counter that
  ticks regardless of try_recv outcome, so Lagged consumes a budget
  slot like everything else. `drained` keeps its semantics (events
  actually persisted) and is reported on the budget-hit warn for
  observability.
- RUST-001 (bilby, core_bridge.rs:423-439): `is_empty_no_records` was
  byte-for-byte identical to `Merge::is_empty` (changeset.rs:236) with
  a self-contradicting docstring ("skips" a field it actually checks).
  Two copies could drift silently and skip persisting a populated
  CoreChangeSet → quiet data loss. Delete the local copy, call
  `Merge::is_empty` via the trait import.
- PROJ-001 (adams, mod.rs:271-282): `SHUTDOWN_JOIN_TIMEOUT_SECS`
  doc still listed the FFI shielded-stop bridge as a quiesce+join
  consumer — but commit f94fed9 made that bridge cancel-only. Drop
  the stale mention; note the cancel-only status inline.

LOW applied:
- RUST-004 (bilby): add `#[must_use]` to `ShutdownReport` and
  `CoordinatorExitStatus` — silently dropping the UAF-avoidance
  `all_clean()` signal now warns at compile time. tc010 test that
  intentionally discards the panic-result is annotated with `let _ =`.
- RUST-005 (bilby): document `start_thread`'s synchronous reap-spin
  (`std::thread::sleep` up to `reap_backstop`) as async-unsafe; should
  be invoked via `spawn_blocking` or a dedicated thread. Footgun
  prevention for the rs-dapi-client adoption that's already in scope.
- RUST-006 (bilby, core_bridge.rs:296-302): trim the
  `ChainLockProcessed` comment's "earlier signal had a gap" history
  narration; keep the present-state rationale + issue citation.
- RUST-007 (bilby, ffi/manager.rs:383-385): drop the redundant `else`
  after the diverging if-return in the FFI destroy handler.
- PROJ-003 (adams): add a focused test for `cancel_all()` — cancel
  three workers in one call, assert none are running, drain each
  cleanly. Was a public method with no in-repo caller. Test count:
  dash-async 31 → 32.
- PROJ-004 (adams, PlatformWalletManager.swift:155-162): deinit
  comment said "signal both" then "all three discard" — count
  mismatch. Reword to explicitly name the two host-driven loops
  (platform-address + shielded) and note identity-sync + event adapter
  are joined inside destroy.
- CALL-001 (adams, swift example app ShieldedService.swift:537-543):
  `clearLocalState` doc claimed `stopShieldedSync` provides the drain
  barrier; post-refactor that's `clearShielded()`. Doc-only; the impl
  was already correct.

DEFERRED / DISMISSED (with rationale):
- SEC-002 (smythe): T11 RESIDUAL — `shielded_sync_start` racing
  `clear_shielded` reopens the held quiescing gate. Documented as a
  host-discipline precondition + in-source TODO; the per-key clearing
  latch is queued as memcan todo 5d569459 (medium priority). Smythe's
  earlier adversarial pass classified the residual as an acceptable
  MEDIUM (host-side programming error to trigger; data-integrity, not
  UAF).
- RUST-002 / PROJ-002: cancel-drain unit test for the IS-lock-loss
  path. No existing test scaffold for `wallet_event_adapter_loop`; the
  in-source TODO above `drain_buffered_events` already marks the work
  as deferred. Follow-up.
- RUST-003 (dead per-coordinator `pub quiesce()`): retained — they are
  intentional public API for parity with the three coordinator
  surfaces, removing them mid-PR would surprise downstream consumers.
- RUST-008 / CODE-001 (ephemeral IDs `R1`/`R2`/`R3`, `F1`/`F2`,
  `SEC-001`/`SEC-002`, `TC-XXX`, `GAP-XXX`): DISMISSED. These are
  project invariant labels documented in module-level docs and used as
  cross-reference anchors throughout the codebase (R1 = lock
  atomicity, R2 = gate-before-cancel, R3 = kind-dispatch; F1 = UAF
  safety; SEC/TC/GAP = security/test-case/gap invariant IDs). They
  are stable convention, not transient review-thread IDs (those were
  T1-T11 and never appeared in committed code; the one `T7:` leak
  Marvin caught earlier is fixed). Grep for `T[0-9]` review IDs and
  `PR#NNN` patterns returned only durable cross-repo issue citations.
- RUST-009 (over-budget internal comment blocks): bilby's own note
  acknowledged the rationale is justified concurrency invariant prose;
  consolidating into module-level docs is broader cleanup tracked
  under the 0d895258 comment-hygiene sweep.

Verified: cargo fmt --all --check + cargo clippy --workspace --features
shielded --all-targets -- -D warnings clean.
Tests: 32 dash-async (31 → 32, +cancel_all_signals_every_worker),
312 wallet --features shielded, 108 wallet-ffi shielded.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Incremental review of 748c4f8..6ed5200. Latest delta adds the server-advertised rate-limit ban path in rs-dapi-client + dashmate gateway wiring, simplifies platform_wallet_manager_shielded_sync_stop to cancel-only, drains buffered wallet events on cancel in event-adapter (T7), and threads orphan_status through ShutdownReport. No new blocking defects found. The prior architectural finding around clear_shielded's continuously-held gate vs. CoordinatorLifecycle::quiesce on the same AtomicBool is STILL VALID but now latent: ShieldedSyncManager::quiesce remains pub on the same atomic, however no FFI export or in-tree Rust caller races it against clear_shielded (verified). Carried forward as suggestion only. Model coverage degraded: Codex (general/rust-quality/ffi-engineer) did NOT run this round due to ACP/native auth stall; review is Claude-only.

Model coverage note: Claude general, Claude rust-quality, and Claude ffi-engineer ran. Codex general/rust-quality/ffi-engineer did not run because ACP-Codex authentication stalled during the authenticate handshake and native Codex CLI returned 401/missing bearer; this review is therefore degraded from the usual Claude+Codex matrix.

🟡 1 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:166-249: Public ShieldedSyncManager::quiesce can still lower clear_shielded's continuously-held gate (carry-forward of prior 748c4f82-1)
  `clear_shielded_inner` (manager/mod.rs:548) holds `_clearing_gate = shielded_sync_manager.hold_quiescing_gate()` across the entire drain → liveness-check → store-wipe sequence, expecting the shielded `quiescing` boolean to stay raised so no direct `sync_now`/`sync_wallet` slips past and re-persists into the store being wiped (mod.rs docs 515-524). The hold guard at coordinator_lifecycle.rs:243-249 and the RAII guard inside `CoordinatorLifecycle::quiesce` at line 171 both wrap the SAME `Arc<AtomicBool>`, and `AtomicFlagGuard::drop` unconditionally `store(false, Release)`s. A concurrent invocation of `ShieldedSyncManager::quiesce` (which is still `pub` at shielded_sync.rs:280) would lower the gate mid-Clear, opening exactly the window the docstring promises is closed. 

  Reachability on this head: verified there is no FFI export that triggers this — `platform_wallet_manager_shielded_sync_stop` is cancel-only in this delta, `destroy` goes through the registry shutdown path whose drain hook only stores `true` with no RAII guard (lines 121-131), and `shielded_clear` uses `quiesce_under_held_gate`. The only in-tree caller of `shielded_sync().quiesce()` is the internal test at coordinator_lifecycle.rs:346. So the cross-boundary exploit path is closed by surface; the residual is a fragile in-process Rust contract that future callers can break silently. The module docs at mod.rs:515-524 already acknowledge this as a follow-up (per-key clearing latch / refcount on the registry). Suggested cheap structural fixes: narrow `ShieldedSyncManager::quiesce` to `pub(crate)` (no FFI consumer today), refcount the gate, or force every caller through `quiesce_under_held_gate` after `hold_quiescing_gate`. Non-blocking — flagged so the latent fragility does not regrow once new shielded entry points are added.

Comment thread packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs Outdated
lklimek and others added 6 commits June 26, 2026 07:21
…orLifecycle::spawn_periodic_loop (d2e1a335)

The three periodic-sync coordinators (identity_sync, platform_address_sync,
shielded_sync) had byte-identical `start()` loops — `reopen_quiescing_gate`
→ `worker_config` → `Handle::current` → `registry.start_thread` with the
same `biased; cancel-first` select + interval sleep body, plus a
near-identical multi-paragraph comment. Hoist the loop into
`CoordinatorLifecycle::spawn_periodic_loop<F, Fut, I>` so each coordinator
keeps only its domain-specific pass body.

Each coordinator's `start()` now collapses to a 10-line thunk: clone two
`Arc<Self>` (one for `pass`, one for `interval`), call
`self.lifecycle.spawn_periodic_loop(...)`. The `pass` closure clones the
captured Arc once per tick (cheap) and `.await`s the coordinator's
`sync_now` — `Fn() -> Fut` semantics with the inner future free to be
`!Send` since it runs under `Handle::block_on` on the OS-thread worker.

Knock-on cleanup: `CoordinatorLifecycle::registry()` and `worker()` had no
remaining callers after the hoist; removed both.

Tests: 312/312 wallet --features shielded pass (no regressions).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF
…orExitStatus onto dash_async (18d7acf4)

Both wrapper types were new in this PR, so this is API churn within the
PR, not a released-API break. They duplicated dash_async equivalents:

- delete CoordinatorThreadStatus (enum + is_clean + From<WorkerStatus>);
  callers use dash_async::WorkerStatus directly
- delete CoordinatorExitStatus (struct + all_clean + from_report);
  PlatformWalletManager::shutdown now returns ShutdownReport<WalletWorker>

A never-started worker is now an absent per_worker key rather than a
materialized NotRunning slot; all_clean() is unchanged.

Touchpoints:
- rs-platform-wallet/src/manager/mod.rs: delete both types, shutdown
  returns the report, clear_shielded_inner + tests use WorkerStatus
- rs-platform-wallet/src/manager/coordinator_lifecycle.rs: quiesce /
  quiesce_under_held_gate / cancel_join_and_drain return WorkerStatus
- rs-platform-wallet/src/manager/{identity,platform_address,shielded}_sync.rs:
  quiesce returns dash_async::WorkerStatus
- rs-platform-wallet/src/error.rs: ShieldedShutdownIncomplete.status: WorkerStatus
- rs-platform-wallet/src/lib.rs: re-export dash_async::{ShutdownReport, WorkerStatus}
- rs-platform-wallet-ffi/src/error.rs: test uses platform_wallet::WorkerStatus

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…atorExitStatus onto dash_async (18d7acf4)

`CoordinatorThreadStatus` and `CoordinatorExitStatus` (both NEW in this
PR, never released) were a thin, hand-maintained shadow of
`dash_async::WorkerStatus` and `ShutdownReport<K>` — exhaustive 1:1
`From` mapping for `WorkerStatus`, named-field projection for
`ShutdownReport` — and required a per-PR test of the variant table
that drifted on every status addition (T5's `orphan_status` field).

Delete the wrappers and use the dash-async types directly. The wallet
surface keeps the same end-user-visible verdict (`all_clean()`) and
gains the orphan-status fold-through for free (the registry already
owns it post-T5).

Touchpoints (every wallet/FFI/test reference):
- packages/rs-platform-wallet/src/manager/mod.rs — delete the two
  types + their impls; `shutdown()` returns `ShutdownReport<WalletWorker>`
  directly; `clear_shielded_inner`'s `Err(_elapsed)` arm and Detached
  raise use `WorkerStatus::{Timeout, Detached}`; all 11+ unit-test
  blocks rewritten to assert against `report.per_worker.get(&key)` /
  `report.all_clean()` instead of the named-field projections.
- packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs —
  `quiesce`, `quiesce_under_held_gate`, `cancel_join_and_drain` return
  `WorkerStatus`; drop the now-redundant `.into()` at the `quiesce`
  call site.
- packages/rs-platform-wallet/src/manager/{identity_sync,
  platform_address_sync,shielded_sync}.rs — coordinator `quiesce()`
  methods return `WorkerStatus`.
- packages/rs-platform-wallet/src/error.rs — `ShieldedShutdownIncomplete
  { status }` carries `dash_async::WorkerStatus`.
- packages/rs-platform-wallet/src/lib.rs — drop the wallet-type re-
  exports; re-export `dash_async::{ShutdownReport, WorkerStatus}` so the
  PR-introduced public surface stays single-import.
- packages/rs-platform-wallet-ffi/src/error.rs — `is_clean()` and Debug
  on `WorkerStatus` work unchanged; one stale type reference updated.
- packages/rs-dash-async/src/registry.rs — `WorkerStatus` doc was the
  source of the wallet-CoordinatorThreadStatus cross-reference; rewrite
  it standalone (the wallet type it described no longer exists).

Semantic change worth flagging (preserved end-user behavior):
- `CoordinatorExitStatus` materialised never-started workers as
  `NotRunning` (`shielded_sync: Some(NotRunning)` / `None` for non-
  shielded builds). `ShutdownReport` encodes "never started" as
  key-absent-from-`per_worker`. End-user verdict (`all_clean()`) is
  identical; tests that compared `== NotRunning` now check
  `!report.per_worker.contains_key(&WalletWorker::X)`.

Verified: cargo fmt --all --check + cargo clippy --workspace --features
shielded --all-targets -- -D warnings clean.
Tests: 32 dash-async (incl. 1 doctest), 312 wallet --features shielded,
108 wallet-ffi --features shielded — zero regressions.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF
…5d569459)

After the 8fb49ae9 cancel-only `shielded_sync_stop` closed T11's primary
path, a residual remained: a concurrent `shielded_sync_start()` racing
`clear_shielded` calls `reopen_quiescing_gate` (unconditional store-false
on the shared flag), lowering the gate `clear_shielded_inner` holds
continuously. If the start lands between the clear's liveness check and
the store wipe, the fresh sync loop re-persists into the just-wiped
store — a data-integrity desync (not a UAF — the documented host
serialization precondition has held in practice, but in-code defence is
better than relying on host discipline).

Close it with a per-key clearing latch on `ThreadRegistry`:

- New `clearing: Mutex<BTreeSet<K>>` field on the registry — mirrors
  the existing `closing` latch pattern but per-key and resettable.
- `start_thread`/`start_task` honour the latch under the same slot lock
  they honour `closing` under, so a `(re)start` for a key whose owner
  is mid-clear is refused (no slot installed, no worker spawned).
- New `pub fn hold_clearing(self: &Arc<Self>, key: K) -> ClearingGuard<K>`
  returns an RAII guard; the guard's `Drop` removes the key from the
  set on every exit path, including panic unwinding — callers cannot
  leak the latch.
- `Debug` impl picks up the new `clearing` field.

Wired into `PlatformWalletManager::clear_shielded_inner`: the clear
flow now holds BOTH the registry latch (against full `(re)start`s) AND
the existing quiescing gate (against direct `sync_now`/`sync_wallet`
passes that bypass the registry via `begin_pass`).

Tests:
- dash-async 32 → 35 (+`hold_clearing_blocks_starts_for_latched_key_only`,
  `hold_clearing_nested_guards_share_the_latch`,
  `hold_clearing_releases_on_panic_unwind`).
- wallet `--features shielded` 312 → 313
  (+`shielded_sync_start_no_ops_while_clearing_latch_held` — the wallet-
  level integration test that proves T11 residual is closed: holding
  the latch (mirror of `clear_shielded_inner`'s hold), a concurrent
  `shielded_sync_manager.start()` leaves `is_running()==false`; sibling
  coordinators (per-key latch) start cleanly; latch release restores
  startability).
- wallet-ffi `--features shielded` 108 (unchanged).
- cargo fmt + cargo clippy --workspace --features shielded --all-targets
  -- -D warnings clean.

Closes the deferred 5d569459 follow-up flagged by Smythe's adversarial
review of the T11 primary fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF
…eadRegistry diff (0d895258)

Apply Cross-Cutting Rules (length cap ≤2-3 internal / ≤5-10 public API
doc; present-state only; two-tier audience) to comments INTRODUCED or
MODIFIED by this PR vs v3.1-dev. Load-bearing concurrency rationale
preserved verbatim — the comments that actually carry teardown safety
information (SeqCst handshake, slots→orphans lock-nesting, F1/F2
invariants, panic = "abort" caveats, must_use justifications, clearing-
latch contract) keep their full prose because removing them would let a
maintainer break the invariant without realising it.

Files swept:
- packages/rs-dash-async/src/registry.rs — trimmed several over-long
  internal blocks (most R1/R2/R3 narration condensed; the load-bearing
  bits kept). T-id review-thread leaks in private test fn names
  removed (t5_* → bare descriptive names; 3 fns in this file).
- packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs —
  trimmed comments around spawn_periodic_loop and the gate-hold guards.
- packages/rs-platform-wallet/src/manager/mod.rs — trimmed comments
  around clear_shielded_inner + the panicked-orphan tests; renamed
  the one remaining t5_-prefixed test fn.

Verification post-sweep:
- cargo fmt --all -- --check clean.
- cargo clippy --workspace --features shielded --all-targets -- -D
  warnings clean.
- dash-async 35 / wallet --features shielded 313 / wallet-ffi
  --features shielded 108 — no regressions.
- grep across rs-dash-async + rs-platform-wallet[-ffi] for ephemeral
  T1-T11 review-thread IDs: zero hits (only project-invariant labels
  R1/R2/R3, F1/F2, SEC-001/SEC-002, TC-XXX, GAP-XXX remain — those
  are cross-referenced design anchors, not transient PR IDs).

Closes the deferred 0d895258 follow-up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF
…nsafe) for Swift 6 strict concurrency

CI Swift build was failing with:

  error: cannot access property 'persistenceHandler' with a non-sendable
  type 'PlatformWalletPersistenceHandler?' from nonisolated deinit
  (and the same for 'eventHandler')

Swift 6 strict concurrency forbids accessing non-`Sendable` properties
from a `nonisolated deinit`, but our deinit must read both handler
references to `passRetained` them on incomplete-shutdown — leaking the
reference is the load-bearing UAF avoidance for a wedged FFI worker
that may still call back through them.

Opt the two properties out of the check with `nonisolated(unsafe)`. The
access is safe: it only runs on the deinit path of the wrapper class
itself, after the Rust side has already signalled the worker did not
drain cleanly (a documented FFI lifetime contract — see the comments
on the handlers). The handler types themselves remain non-`Sendable`
across the rest of the codebase; this is a targeted escape hatch for
the deinit-only access.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Two blocking issues found at HEAD 284752a. (1) ShieldedSyncManager::start()'s spawn_periodic_loop calls reopen_quiescing_gate() before start_thread's clearing-latch check — a racing FFI shielded_sync_start during clear_shielded silently lowers the clear's continuously-held quiescing gate, partially defeating the new latch. (2) The Swift ShieldedService.clearLocalState catches clearShielded()'s ShieldedShutdownIncomplete error and continues to wipe SwiftData, violating the explicit PlatformWalletResult contract ("must NOT commit its own persistence wipe — the Rust store was left intact"). Prior carry-forward finding (public ShieldedSyncManager::quiesce can lower the held gate) is STILL VALID as a latent suggestion. Several lower-severity issues around the ClearingGuard's set-membership semantics and its test coverage are kept as suggestions/nitpicks.

🔴 2 blocking | 🟡 1 suggestion(s) | 💬 1 nitpick(s)

1 additional finding(s) omitted (not in diff).

1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:131-161: Refused shielded (re)start still lowers clear_shielded's continuously-held quiescing gate
  `spawn_periodic_loop` calls `self.reopen_quiescing_gate()` at line 137 (which `store(false, Release)`s `quiescing`) before invoking `registry.start_thread(...)` at line 142. The new per-key clearing latch correctly refuses the start inside `start_thread` (registry.rs:414-416), but by then the gate has already been lowered. `clear_shielded_inner` (manager/mod.rs:390-415) holds the gate up via `hold_quiescing_gate()` precisely to keep direct `sync_now`/`sync_wallet` callers blocked across drain → liveness-check → wipe, and a single shared `AtomicFlagGuard` always clears the flag on drop — the clear path has no way to re-raise it. So an FFI `platform_wallet_manager_shielded_sync_start` (exported at rs-platform-wallet-ffi/src/shielded_sync.rs:57) landing during a clear will: (a) call `ShieldedSyncManager::start()` → `spawn_periodic_loop` → `reopen_quiescing_gate()` setting `quiescing=false`, (b) be refused by the clearing latch at the registry, (c) leave the gate down for the remainder of the clear. A direct `sync_now`/`sync_wallet` going through `begin_pass` then observes `quiescing=false` and slips past the barrier, re-persisting into the store the clear is about to wipe — the exact race the new latch was added to close. Fix by checking the clearing latch BEFORE reopening the gate (e.g. early-return from `spawn_periodic_loop` if `registry.is_clearing(worker)`), or ordering the gate reopen so it only fires when the registry actually accepts the start (e.g. inside the closure passed to `start_thread`, or only after a non-rejecting outcome). The same hazard exists structurally for the identity-sync and platform-address coordinators if they ever gain a clearing latch.

In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-628: Swift clear flow wipes SwiftData rows after Rust reports the shielded store was left intact
  `platform_wallet_manager_shielded_clear` maps `PlatformWalletError::ShieldedShutdownIncomplete` to `ErrorShutdownIncomplete` and returns BEFORE touching the Rust shielded store (rs-platform-wallet-ffi/src/shielded_sync.rs:446-461). The Swift error contract states this explicitly: `PlatformWalletResult.swift:188-193` says on `shutdownIncomplete` "the host ... must NOT commit its own persistence wipe — the Rust store was left intact so it can be retried once the pass settles." But `ShieldedService.clearLocalState` catches the thrown error, logs it via `SDKLogger.error`, and falls through into `modelContext.delete(model: PersistentShieldedNote/OutgoingNote/SyncState/Activity)` at lines 624-627. If Rust refused the clear because a coordinator didn't drain in time, SwiftData rows are deleted while the Rust commitment tree remains populated — the exact cross-language desync the new FFI result code was added to prevent. The legacy `// Best-effort — failure logs but doesn't abort the wipe` comment at lines 606-607 predates this contract and is now wrong. The catch must distinguish `.shutdownIncomplete` from other errors and return early without wiping host rows.

In `packages/rs-dash-async/src/registry.rs`:
- [SUGGESTION] packages/rs-dash-async/src/registry.rs:596-613: ClearingGuard uses set-membership semantics — nested or overlapping guards release the latch on the first drop
  `hold_clearing` inserts the key into a `BTreeSet<K>` and `ClearingGuard::drop` removes it unconditionally. With two live `ClearingGuard`s for the same key (nested call by future code, OR two concurrent FFI `platform_wallet_manager_shielded_clear` invocations on the same handle — `HandleStorage::with_item` only holds a read lock), the first guard to drop removes the set entry and the surviving guard's protection lapses silently. Today only `clear_shielded_inner` takes the latch and the test `hold_clearing_nested_guards_share_the_latch` (registry.rs:1857) documents the quirk via comment — but the public `pub fn hold_clearing` signature gives no hint to callers, and a concurrent host-thread clear is reachable through the FFI surface. Convert the backing store to `BTreeMap<K, NonZeroUsize>` so guards refcount/compose, or serialize `clear_shielded` per key at the manager boundary so re-entrant clears can't overlap.

Comment thread packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs
Comment thread packages/rs-dash-async/src/registry.rs
Comment thread packages/rs-dash-async/src/registry.rs Outdated

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Latest delta (97e373b) only adds nonisolated(unsafe) to two Swift handler properties for Swift 6 strict-concurrency compliance — that change is narrow and sound. All five prior findings remain unaddressed at HEAD, including two blocking issues: (1) spawn_periodic_loop unconditionally lowers the shared quiescing gate before the per-key clearing latch can refuse the (re)start, opening a window where a direct sync_now/sync_wallet can persist into the store being cleared, and (2) Swift ShieldedService.clearLocalState wipes SwiftData rows even when Rust returns ErrorShutdownIncomplete — the comment now explicitly codifies 'best-effort, failure logs but doesn't abort the wipe', which is the opposite of the documented FFI contract.

🔴 2 blocking | 🟡 1 suggestion(s)

1 additional finding(s) omitted (not in diff).

2 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:131-161: [prior-1, carried forward] `spawn_periodic_loop` lowers the shared quiescing gate before the per-key clearing latch can refuse the (re)start
  `spawn_periodic_loop` calls `self.reopen_quiescing_gate()` at line 137 — an unconditional `quiescing.store(false, Release)` on the `Arc<AtomicBool>` — BEFORE handing off to `registry.start_thread(...)` at line 142, where the per-key `hold_clearing(ShieldedSync)` latch refuses the start. During `clear_shielded_inner`, the clear flow continuously holds both the registry clearing latch and the quiescing gate (via `hold_quiescing_gate`, which is a non-restoring RAII guard). Race: (T1) clear holds latch + gate=true; (T2) host calls `shielded_sync_start()` → `spawn_periodic_loop` → `reopen_quiescing_gate` writes `false` → `start_thread` is refused by the latch, but the gate is already down; (T3) a direct `sync_now`/`sync_wallet` enters `begin_pass`, observes `quiescing == false`, takes the `is_syncing` slot, and runs to completion — persisting into the shielded store inside the clear-then-wipe window. The `quiesce_under_held_gate` debug_assert at line 219-222 catches the inconsistency only in debug builds; release builds proceed silently. Fix: only reopen the gate once `start_thread` confirms acceptance — either consult `lock_clearing` in `spawn_periodic_loop` and skip the reopen on refusal, move the reopen inside the registry's slot critical section after the latch check, or use the registry's accept signal to gate the store.
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:196-203: [prior-3, carried forward] Public `quiesce()` drops an `AtomicFlagGuard` on the same shared atomic that `clear_shielded` holds raised
  `CoordinatorLifecycle::quiesce` constructs `_quiescing_gate = AtomicFlagGuard::new(&self.quiescing)` over the same `Arc<AtomicBool>` that `clear_shielded_inner` holds raised via `hold_quiescing_gate`. `AtomicFlagGuard::Drop` unconditionally writes `false`, so any concurrent caller routed through the public `quiesce()` while a clear is in flight will lower the clear-flow's continuously-held gate. Today the wallet's `shutdown` routes through `registry.shutdown()` rather than this method, but `ShieldedSyncManager::quiesce` remains `pub` and reachable via the `shielded_sync_arc` accessor — the trap is one accidental call away from re-opening the same lapse Clear was hardened to prevent. The structural fix is to give the gate a refcount/holder-aware re-raise (so concurrent owners compose), or to gate the public `quiesce` against the clearing latch and assert no other holder is active.

In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:606-628: [prior-2, carried forward] `clearLocalState` wipes SwiftData rows even when `clearShielded()` throws `ErrorShutdownIncomplete` — violates the documented FFI contract
  `PlatformWalletFFIResultCode::ErrorShutdownIncomplete` is the explicit signal that the Rust shielded store was left intact: the FFI doc in `packages/rs-platform-wallet-ffi/src/error.rs` states 'the manager is NOT torn down and the store was left intact (Clear aborted before touching it). The host may retry the clear, and must not commit its own persistence wipe — doing so would desync the host's rows from the still-populated shared tree.' The current implementation at lines 608-616 catches the thrown error, logs it via `SDKLogger.error`, then falls through to step 2 (lines 623-628), unconditionally deleting `PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, and `PersistentShieldedActivity`. The comment at line 606-607 even codifies this: 'Best-effort — failure logs but doesn't abort the wipe.' That is the exact split-brain the FFI contract was designed to prevent: when Rust correctly preserves notes (orphan-alive backstop, drain timeout, non-clean drain status), Swift discards its mirror, and the next `bindShielded` + sync repopulates host rows from the surviving Rust state — the user-visible 'Clear' becomes a no-op on the Rust side while host data is lost. Minimum fix: inspect the thrown error code and short-circuit the SwiftData wipe (surfacing the error to the user / scheduler) on `errorShutdownIncomplete`.

Comment thread packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs
Comment thread packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs
lklimek and others added 4 commits June 26, 2026 09:13
…g gate when start will be refused (thepastaclaw BLOCKING)

5d569459 added a per-key clearing latch on `ThreadRegistry` and wired
`clear_shielded_inner` to hold it across the whole clear. But
`CoordinatorLifecycle::spawn_periodic_loop` called
`reopen_quiescing_gate()` BEFORE delegating to `start_thread`, where the
latch is checked. So a concurrent `shielded_sync().start()` during a
clear had:

  1. `clear_shielded_inner` holding `_clearing_gate` (quiescing=true)
     and `_clearing_latch` (registry refuses ShieldedSync (re)start).
  2. The start path: `reopen_quiescing_gate()` → quiescing.store(false).
  3. Then `start_thread()`: latch refuses the start, returns no-op.
  4. Gate left FALSE despite `_clearing_gate` being held — a direct
     `sync_now`/`sync_wallet` via `begin_pass` would now observe a
     down gate and slip past the clear's "no new pass" barrier,
     re-persisting into the store the clear is about to wipe.

Fix:

- New `ThreadRegistry::is_clearing(key) -> bool` public method exposes
  the latch state so a coordinator can observe it before any wallet-
  side side-effects (the registry only protects calls that go through
  it; the quiescing flag is wallet-side).
- `spawn_periodic_loop` now checks `self.registry.is_clearing(self.worker)`
  and bails BEFORE `reopen_quiescing_gate` — the start is refused without
  touching the gate, preserving `clear_shielded`'s continuous SEC-002
  hold.

Regression test:
`shielded_start_during_clear_preserves_quiescing_gate` raises both
guards `clear_shielded_inner` raises (latch + gate), calls `start()`,
asserts the start is refused AND the gate stays raised. Pre-fix this
asserts false — non-vacuous. New test-only accessors
`CoordinatorLifecycle::quiescing_load_for_test` and
`ShieldedSyncManager::quiescing_load_for_test` expose the flag read for
the assertion.

Verified: 35 dash-async / 314 platform-wallet --features shielded (was
313, +1 for the regression) / 108 platform-wallet-ffi --features
shielded all pass. cargo fmt --check + cargo clippy --workspace
--features shielded --all-targets -- -D warnings clean.

Addresses thepastaclaw BLOCKING finding (3480216817 + carry-forward
3480287460) on PR #3954.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF
…own-join

Single-merge approach (stack tip `origin/feat/platform-wallet-secret-protection`)
folds all three open PRs in one shot:

- #3692 feat(platform-wallet): watch-only rehydration from persistor
        (seedless load)
- #3828 refactor(platform-wallet)!: retire in-band pool snapshot;
        hardcode core UTXO account_index=0
- #3953 feat(platform-wallet-storage)!: layered opt-in secret protection
        (Tier-2 password envelope)

Incoming: 271 commits, 93 files, +9752 / -3042 vs merge-base e2039e5.

Conflicts (3 content + 1 semantic-only) resolved by hand. Resolution
policy: preserve our ThreadRegistry / CoordinatorLifecycle / shutdown
invariants (F1, F2, SEC-001, SEC-002); accept their domain logic
(rehydration paths, secret-protection storage, in-band pool retirement);
port their additions onto our refactored types (`ShutdownReport<WalletWorker>`,
`WorkerStatus`, the per-key clearing latch, `spawn_periodic_loop` hoist).

- packages/rs-platform-wallet/src/changeset/core_bridge.rs:
  Kept OUR `CANCEL_DRAIN_BUDGET` and the cancel-drain helper plus its
  doc. Kept THEIR `warn_if_non_default_account` (account_index=0 domain
  guard called 2×) and THEIR `CoreChangeSet` projection tests. Dropped
  THEIR re-added `is_empty_no_records` impl — our prior PR refactored
  the call site to `Merge::is_empty()`, so re-adding the helper would
  be dead code (clippy fail).
- packages/rs-platform-wallet/src/manager/platform_address_sync.rs and
  manager/shielded_sync.rs:
  Took OUR `CoordinatorLifecycle` refactor (lifecycle field +
  `spawn_periodic_loop`) over THEIR pre-refactor `background_cancel` /
  `background_generation` atomics. Their generation-guard regression
  test auto-merged and now validates OUR per-key clearing latch — ported
  its doc prose off the deleted `background_cancel` field name.
- Semantic-only (no marker): the auto-merged shielded_sync.rs
  `make_manager()` test helper called the pre-refactor 2-arg
  `ShieldedSyncManager::new(event_manager, coordinator_slot)`. Our
  refactor added a 3rd `registry` param; ported it forward
  (`ThreadRegistry::new()` added as arg #3, mirroring the sibling
  address-sync test helper).
- packages/rs-platform-wallet-ffi/src/core_wallet_types.rs:
  Ported one dangling `is_empty_no_records` comment ref → `is_empty`.

No conflict markers remain anywhere; no dangling refs to
`CoordinatorThreadStatus`, `CoordinatorExitStatus`, `background_cancel`,
or `background_generation` in Rust source.

Verified post-merge:
- cargo fmt --all --check clean.
- cargo clippy --workspace --features shielded --all-targets
  -- -D warnings clean (subsumes `cargo check` for the same feature
  surface — clippy is a strict superset).
- dash-async lib: 35 passed (unchanged).
- platform-wallet lib default: 217 passed (was 212; +5 from rehydration
  test additions).
- platform-wallet lib --features shielded: 319 passed (was 313 pre-merge;
  +6 from the stack's test additions).
- platform-wallet-ffi --features shielded: 108 passed (unchanged).
- Zero test regressions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF
…let-shutdown-join

Pulls in the integration merge prepared in worktree branch
`merge/pr-stack-3692-3828-3953` (commit 9b3d6f8 — see that commit's
body for the full conflict-resolution table and verification).

Sequencing: this merge lands AFTER 432309b (the thepastaclaw
BLOCKING gate-lowering fix on spawn_periodic_loop), so the stack's
post-merge state has BOTH the per-key clearing latch invariants from
the gate-fix AND the rehydration/secret-protection domain code.
…lic quiesce against in-flight clear (thepastaclaw 3480216821/3480287466)

Two remaining thepastaclaw findings on the 5d569459 per-key clearing
latch addressed together.

## (1) ClearingGuard set-membership → refcount

Backing store changes from `Mutex<BTreeSet<K>>` to
`Mutex<BTreeMap<K, NonZeroUsize>>`. `hold_clearing` now increments a
per-key holder count (or inserts `NonZeroUsize::new(1)`); the guard's
`Drop` decrements and removes the entry only when the count reaches
zero.

Why: nested or concurrent holders for the same key previously released
the latch on the FIRST drop (set semantics — `insert` of an existing
key was a no-op, `remove` unconditionally removed). Two concurrent
`shielded_clear` invocations on the same handle (reachable through
`HandleStorage::with_item`'s read lock) would silently lapse one
clear's protection when the other's guard dropped. Refcount composes
the holders so the latch stays raised until the LAST guard drops.

The previous test `hold_clearing_nested_guards_share_the_latch`
documented the lapse via comment but never asserted the lapse —
thepastaclaw correctly flagged that it would pass under any
drop-only implementation. Renamed to
`hold_clearing_inner_drop_does_not_lapse_outer_protection` and rewritten
to assert non-vacuously: with both guards alive a start is refused;
after `drop(inner)` while `outer` is still alive, `is_clearing` MUST
still return true AND a fresh `start_clean` MUST still be refused; only
after `drop(outer)` does the latch fully release and a start succeed.
Pre-refactor this test fails on both the `is_clearing` assertion and
the post-inner-drop `is_running` assertion.

`Drop` guards against an underflow (debug_assert) if the entry is
missing — defensive against out-of-band removal that no current call
site does, but cheap to encode.

## (2) Public `quiesce()` gated against the clearing latch

`CoordinatorLifecycle::quiesce` constructs a local `AtomicFlagGuard` on
the same `quiescing` atomic that `clear_shielded_inner` holds raised
continuously via `hold_quiescing_gate`. The guard's `Drop` would
unconditionally lower the gate — lapsing the clear's "no new pass"
barrier and letting a direct `sync_now`/`sync_wallet` slip past
`begin_pass` to re-persist into the wiping store.

Mirror the `spawn_periodic_loop` fix from 432309b: bail at the
start of `quiesce` if `registry.is_clearing(self.worker)` returns
true. Return `WorkerStatus::NotRunning` — the in-flight clear is
already cancel-joining the worker under its own gate, so there is no
new work for `quiesce` to do. Today only `ShieldedSyncManager::quiesce`
is reachable via `shielded_sync_arc`, but the carry-forward concern is
that the trap is one accidental host call away.

Regression test `shielded_quiesce_during_clear_preserves_quiescing_gate`
raises both guards `clear_shielded_inner` raises (latch + gate), calls
`shielded_sync_manager.quiesce()`, asserts the call returns
`NotRunning` AND the gate stays raised. Pre-fix this fails on the gate
assertion.

Verified post-fix:
- cargo fmt --all --check clean.
- cargo clippy --workspace --features shielded --all-targets
  -- -D warnings clean.
- dash-async lib: 35 passed (renamed test stays in place).
- platform-wallet --features shielded: 321 passed (was 320; +1 quiesce
  regression).
- platform-wallet --lib default: 217 passed (unchanged).

Closes the deferred thepastaclaw 3480216821 (refcount suggestion),
3480216826 (test-doesn't-exercise nitpick — the renamed test now does),
and 3480287466 (prior-3 carry-forward, public quiesce vs. held gate).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Carried forward from prior review (still valid at bcf9388): prior-2 (blocking — Swift clearLocalState still wipes SwiftData on ErrorShutdownIncomplete), prior-3 (suggestion — public quiesce() drops AtomicFlagGuard on the shared atomic), prior-4 (suggestion — ClearingGuard uses set membership), prior-5 (nitpick — nested-guards test doesn't exercise the documented lapse). prior-1 is partially addressed: the new is_clearing short-circuit + SEC-002 regression test close the common race, but the check and the gate-write are still two distinct critical sections, leaving a narrow residual TOCTOU window — carried forward as a suggestion. No new defects identified in the latest delta (rs-platform-wallet-storage / secrets / keyless rehydration look careful: fail-closed topology check, transactional rollback, per-row skip taxonomy, structured FFI outcome).

🔴 1 blocking | 🟡 3 suggestion(s) | 💬 1 nitpick(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] `clearLocalState` wipes SwiftData rows even when `clearShielded()` throws `ErrorShutdownIncomplete` — violates the documented FFI contract
  The Rust FFI contract for `ErrorShutdownIncomplete` on `platform_wallet_manager_shielded_clear` (rs-platform-wallet-ffi/src/error.rs around the `ShieldedShutdownIncomplete → ErrorShutdownIncomplete` mapping, plus shielded_sync.rs:454) is explicit: when the manager surfaces this code, the Rust shielded store was left intact because Clear aborted before touching it (drain non-clean / orphan-alive backstop). The host must retry the clear and must NOT commit its own persistence wipe; doing so desyncs host rows from the still-populated shared tree. ShieldedService.swift:608-616 catches the throw, logs via `SDKLogger.error`, then falls through unconditionally into the SwiftData delete block at 623-628 (`PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, `PersistentShieldedActivity`). The comment at 606-607 ("Best-effort — failure logs but doesn't abort the wipe") codifies the exact split-brain the FFI contract was designed to prevent. The Swift surface already distinguishes this code: `PlatformWalletResultCode.errorShutdownIncomplete` exists (PlatformWalletResult.swift:47) and decodes into `PlatformWalletError.shutdownIncomplete(detail)` (line 240), so the host can branch on it. Minimum fix: on the typed `shutdownIncomplete` case (or any thrown error from `clearShielded()`, since the Rust contract is fail-closed there), short-circuit the SwiftData wipe and surface the error to the user instead of proceeding.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:137-148: [prior-1, partial fix landed] Residual TOCTOU between `is_clearing` check and `reopen_quiescing_gate`
  The new `is_clearing(self.worker)` short-circuit at line 145 closes the common race (clear is already holding the latch when spawn arrives), and the SEC-002 regression test in mod.rs:1071-1108 covers that ordering. But the check (line 145) and the gate-lower (line 148) are not done under a single critical section: `is_clearing` takes `lock_clearing`, returns the boolean, releases the lock; then `reopen_quiescing_gate` writes `false` to the shared atomic. If `clear_shielded_inner` acquires the latch AND raises the quiescing gate between those two points, the reopen at 148 lowers a gate Clear believes it holds continuously across drain → liveness → wipe. In that window a direct `sync_now`/`sync_wallet` going through `begin_pass` (lines 306-341) observes `quiescing == false`, claims the `is_syncing` slot, runs the pass to completion, and persists into the store about to be wiped. `quiesce_under_held_gate`'s `debug_assert!` (line 230) catches the inconsistency only in debug builds; release silently proceeds with the drain. The window is microseconds-narrow, but the design invariant is continuous gate-hold. Structural fix: have the registry expose an atomic observe-and-act under one `lock_clearing()` scope (e.g. `with_clearing_check`) so the gate write is sequenced with the latch read, or move the gate write inside the registry slot critical section alongside the latch check.
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:207-214: [prior-3, carried forward] Public `quiesce()` drops an `AtomicFlagGuard` on the same shared atomic that `clear_shielded` holds raised
  `CoordinatorLifecycle::quiesce` constructs `_quiescing_gate = AtomicFlagGuard::new(&self.quiescing)` over the same `Arc<AtomicBool>` that `clear_shielded_inner` holds raised via `hold_quiescing_gate`. `AtomicFlagGuard::Drop` unconditionally writes `false`, so any concurrent caller routed through the public `quiesce()` while a clear is in flight will lower the clear-flow's continuously-held gate. `ShieldedSyncManager::quiesce` remains `pub` and reachable through the manager's shielded accessor — one accidental call away from re-opening the same lapse Clear was hardened to prevent. Structural fix: refcount/owner-aware re-raise (so concurrent owners compose), or gate the public `quiesce` against the clearing latch and assert no other holder is active before its guard drop is allowed to fire.

In `packages/rs-dash-async/src/registry.rs`:
- [SUGGESTION] packages/rs-dash-async/src/registry.rs:596-613: [prior-4, carried forward] `ClearingGuard` uses set-membership — overlapping guards release the latch on the first drop
  `hold_clearing` does `lock_clearing().insert(key)` into a `BTreeSet<K>`, and `ClearingGuard::Drop` (lines 1099-1103) unconditionally removes the key. With two live guards for the same key (a nested or overlapping clear from a future caller, a test, or an accidental shielded re-clear before the first guard drops), the first drop removes the key while the second guard is still alive — silently lapsing the latch protection the still-live guard is supposed to provide. The new `is_clearing()` check (line 620) inherits the same accounting weakness because it reads the same set membership. The test at lines 1859-1883 acknowledges this in its comment but does not fix the type-level mismatch between the RAII guard's compositional appearance and its set-based semantics. Minimum fix: refcount per key (`BTreeMap<K, NonZeroUsize>`), or add a `debug_assert!` that the insert was novel so accidental nesting fails loudly in debug.

Note: GitHub refused to serve this PR's diff via gh pr diff because it exceeds the 20,000-line limit, so this exact-SHA review is posted body-only. Existing inline threads are reused rather than duplicated.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Latest delta resolves prior-4 and prior-5: ClearingGuard now refcounts via BTreeMap<K, NonZeroUsize> so nested/concurrent guards compose, and the new test hold_clearing_inner_drop_does_not_lapse_outer_protection is non-vacuous and asserts the lapse window. prior-2 (Swift ShieldedService wipe-on-error) remains unchanged and blocking. prior-3 is partially addressed by the new is_clearing bail in public quiesce(), but the same check-then-act TOCTOU pattern from prior-1 now exists at both spawn_periodic_loop (lines 145-148) and quiesce (lines 217-224); collapsed into a single structural finding.

🔴 1 blocking | 🟡 1 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] clearLocalState wipes SwiftData rows even when clearShielded() throws — violates the FFI contract
  Unchanged at 6b655aab. The Rust FFI contract for `platform_wallet_manager_shielded_clear` surfaces `ErrorShutdownIncomplete` specifically when the Rust shielded store was left intact (Clear aborted before touching it via the drain/orphan-alive backstop). The host MUST NOT commit its own persistence wipe in that case, or SwiftData rows desync from the still-populated shared tree — the exact split-brain this stack is designed to prevent.

  Lines 608-616 catch the throw, log via `SDKLogger.error`, then fall through unconditionally into the SwiftData delete block at 623-628 (`PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, `PersistentShieldedActivity`). The comment at 606-607 ("Best-effort — failure logs but doesn't abort the wipe") codifies the wrong invariant. The Swift surface already distinguishes this code path: `PlatformWalletResultCode.errorShutdownIncomplete` decodes into `PlatformWalletError.shutdownIncomplete(detail)`, so the host can branch on it. Minimum fix: on any thrown error from `clearShielded()` (since the Rust contract is fail-closed there), short-circuit the SwiftData wipe and surface the error to the user.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:145-224: [prior-1 + residual prior-3] Check-then-act TOCTOU around the shared `quiescing` atomic at both `spawn_periodic_loop` and `quiesce` call sites
  Same structural pattern at two call sites — neither performs the `is_clearing` check atomically with the subsequent write to the shared `quiescing` atomic.

  1. `spawn_periodic_loop` (145-148): `is_clearing(self.worker)` at 145 takes `lock_clearing`, returns the boolean, releases the lock; `reopen_quiescing_gate()` at 148 then writes `false` to the shared atomic. If `clear_shielded_inner` acquires the clearing latch AND raises the gate between those two operations, this `reopen_quiescing_gate()` lowers a gate Clear believes it holds continuously across drain → liveness → wipe — opening a window for a direct `sync_now`/`sync_wallet` to pass `begin_pass` and persist into the store about to be wiped (even though `start_thread` itself will subsequently be refused by the latch).

  2. `quiesce` (217-224): the new fix to bail when `is_clearing` is true closes the dominant case (clear already holds the latch), but introduces the same race shape — `is_clearing` at 217 → unconditional `self.quiescing.store(true, ...)` + `AtomicFlagGuard::new(&self.quiescing)` at 223-224, whose `Drop` (`atomic.rs:25-29`) unconditionally writes `false`. If `clear_shielded_inner` acquires the latch + `hold_quiescing_gate` in the sub-microsecond window between 217 and 223, the racing quiesce installs an `AtomicFlagGuard` over the same atomic the in-flight clear is holding raised; on the quiesce's return the guard drops and lowers the gate, lapsing Clear's continuously-held barrier. The new regression test `shielded_quiesce_during_clear_preserves_quiescing_gate` only exercises the ordering where the clearing latch is acquired BEFORE the quiesce arrives.

  Structural fix (resolves both call sites): expose an atomic observe-and-act on the registry under one `lock_clearing()` scope (e.g. a `with_clearing_check(F)` callback that runs F only while the clearing-set lock is held), or move the gate write inside the registry slot's critical section alongside the latch check. Alternatively, refcount the `quiescing` atomic so concurrent owners compose like `ClearingGuard` now does post-prior-4.

Note: Posted body-only because normal inline poster failed: command failed 1: python3 scripts/review_poster.py dashpay/platform 3954 6b655aa --dry-run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants