Skip to content

fix(packaged): reclaim stale namespace sidecars before launch (#4441)#4531

Open
YOMXXX wants to merge 1 commit into
nexu-io:mainfrom
YOMXXX:fix/issue-4441-stale-sidecar-reclaim
Open

fix(packaged): reclaim stale namespace sidecars before launch (#4441)#4531
YOMXXX wants to merge 1 commit into
nexu-io:mainfrom
YOMXXX:fix/issue-4441-stale-sidecar-reclaim

Conversation

@YOMXXX

@YOMXXX YOMXXX commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Fixes #4441

Why

I hit this while exercising packaged Nightly upgrade/reinstall flows: the app appeared to crash on launch, and the desktop log only showed a generic timed out waiting for sidecar status against web.sock. Digging in, the real cause was a still-alive web sidecar from the previous build (0.10.1) still bound to /tmp/open-design/ipc/release-nightly/web.sock.

The pain: the existing stale-socket cleanup (prepareIpcPath/staleUnixSocketExists) only handles a socket left behind by a dead process — it probes with a connection and, when a leftover process is still alive and answers, judges the socket "healthy" and never unlinks it. The new sidecar then dies with EADDRINUSE, and because the web waitForStatus call passed no child watch, it sat through the full 35s timeout and reported the wrong process. This is the P1/risk-high startup robustness gap in #4441: a live older-version sidecar is semantically stale for the namespace but was never reclaimed.

What users will see

Upgrading or reinstalling a packaged build no longer fails to open when an older runtime left a sidecar running. The launcher now reclaims the namespace on startup (stops the leftover daemon/web sidecars and clears their sockets) before spawning fresh ones. If a sidecar still can't start, the desktop log now names the failing process and points at its log instead of a generic 35s timeout.

Surface area

  • None — internal packaged-launcher startup robustness fix; no UI/CLI/API/i18n surface. (It does change startup behavior in that a leftover same-namespace sidecar is now stopped — but that is the fix for the crash, not an opt-in behavior, and a leftover desktop process is still left to the single-instance lock.)

Screenshots

N/A — no UI.

Bug fix verification

  • Test path that reproduces the bug: apps/packaged/tests/sidecars.test.ts
    • reclaimStaleNamespaceSidecars > stops leftover same-namespace daemon and web sidecars and clears their sockets (+ different-namespace and self-tree-exclusion cases)
    • waitForStatus app label > names the web sidecar when its child exits before reporting status
  • Did the test go red on main and green on this branch? yes — red as reclaimStaleNamespaceSidecars is not a function and as the exit error naming daemon instead of web; green after the fix.
  • End-to-end "packaged app no longer crashes" needs a packaged build to confirm by eye. To reproduce the live-leftover state from the issue: start an older packaged runtime, leave its web sidecar bound to /tmp/open-design/ipc/<namespace>/web.sock, then launch the new build — it should reclaim and start instead of timing out.

Validation

…o#4441)

An older packaged runtime can leave a still-alive daemon or web sidecar
bound to /tmp/open-design/ipc/<namespace>/<app>.sock. Because the leftover
process answers the new sidecar's stale-socket probe, the probe treats the
socket as healthy and never unlinks it, so the new sidecar dies with
EADDRINUSE while the launcher times out on a generic status wait that names
the wrong process.

- Add reclaimStaleNamespaceSidecars(): before spawning, stop leftover
  same-namespace daemon/web sidecars (excluding the current process tree)
  and clear their IPC sockets, so prepareIpcPath can then unlink the
  now-dead socket cleanly. Reuses the platform stamp-discovery primitives.
- Parametrize waitForStatus with an app label and watch the web child, so
  an EADDRINUSE web sidecar surfaces its real failure and log path
  immediately instead of a 35s timeout labelled "daemon".
- Cover both behaviours with packaged unit tests.
@YOMXXX YOMXXX requested a review from a team as a code owner June 18, 2026 13:17
@lefarcen

Copy link
Copy Markdown
Contributor

Hi @YOMXXX! Thanks for digging into this — the write-up around the stale live-sidecar vs dead-socket gap made the startup failure mode very easy to follow.

I've queued the PR through the normal reviewer path and linked it back to #4441 so the bug trail stays in one place.

💡 To drive this PR to merge hands-free, paste this to your AI coding agent (Claude Code / Codex / opencode / Cursor …):
Take over nexu-io/open-design#4531 until it merges — read https://raw.githubusercontent.com/nexu-io/looper/main/skills/pr-takeover/SKILL.md and follow it.

@lefarcen lefarcen requested a review from PerishCode June 18, 2026 13:20
@lefarcen lefarcen added size/M PR changes 100-300 lines risk/high High risk: apps/desktop, daemon, auth, migration, workflows, package deps type/bugfix Bug fix labels Jun 18, 2026

@PerishCode PerishCode left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YOMXXX I reviewed the changed ranges in apps/packaged/src/sidecars.ts and apps/packaged/tests/sidecars.test.ts. The reclaim path stays on the shared sidecar/platform primitives, scopes termination to same-namespace packaged daemon/web sidecars while protecting the current process tree, and the new tests cover the stale sidecar reclaim, namespace isolation, self-tree exclusion, and web fast-fail diagnostic behavior.

I could not rerun the local package commands in this prepared checkout because node_modules is missing (vitest, tsc, and tsx were unavailable), but the live PR checks show the workspace/static gates and unit-test job passing for this head. Thanks for tightening this startup failure mode and making the diagnostics more direct.

🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.

@lefarcen lefarcen added this pull request to the merge queue Jun 19, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

risk/high High risk: apps/desktop, daemon, auth, migration, workflows, package deps size/M PR changes 100-300 lines type/bugfix Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nightly can crash or exit on launch when stale sidecars or IPC sockets remain

3 participants