Skip to content

fix(mail-watch): ProcessType=Background so macOS stops idle-reaping the per-agent watcher (ops-bayh)#318

Merged
tps-flint merged 1 commit into
mainfrom
watcher-stall-fix
Jun 27, 2026
Merged

fix(mail-watch): ProcessType=Background so macOS stops idle-reaping the per-agent watcher (ops-bayh)#318
tps-flint merged 1 commit into
mainfrom
watcher-stall-fix

Conversation

@tps-flint

Copy link
Copy Markdown
Contributor

Problem (root cause)

The per-agent mail-watch launchd daemon installed by tps mail watch <agent> --daemon install was getting reaped by macOS. buildPlist() generated a plist with no ProcessType. The watcher's idle poll timer wakes roughly every 15s; with no ProcessType, launchd/macOS power-classifies the job as "inefficient" and reaps it with a clean exit 0.

The generated KeepAlive is Crashed: true only — so a clean exit-0 is not restarted. The result: the watcher dies silently (no crash, no error log), the agent stops consuming mail, and dispatches strand in new/. This is the recurring watcher-stall failure mode (ops-bayh / ops-bkx1; the 2026-06-25 13h stall).

The live single hand-stabilized plist already had ProcessType=Background applied by hand. This PR makes the durable per-agent-daemon code path correct so the fix survives reinstalls and we can migrate off the hand-rolled ~/.tps/bin/tps-mail-watch bash wrapper (whose asymmetric wait lets one dead child silently deafen an agent).

Fix

buildPlist() now emits:

<key>ProcessType</key>
<string>Background</string>

ProcessType=Background tells launchd the job is a long-lived background daemon and opts it out of the idle/efficiency reap. ThrottleInterval bumped 5 → 10. KeepAlive(Crashed: true) is preserved so a genuine crash still restarts.

Verification

  • Exported buildPlist and added unit tests asserting ProcessType=Background, ThrottleInterval=10, preserved KeepAlive(Crashed:true), and plutil -lint validity of the generated XML (with a malicious exec arg, to exercise xmlEscape).
  • Traced the install path end-to-end (runMailWatchinstallDaemonbuildPlistwriteFileSync) and generated a sample plist via the real function: plutil -lint → OK; plutil -extract ProcessType rawBackground; ThrottleInterval10; KeepAlive.Crashedtrue.
  • tsc typecheck + dist build clean (dependency order: cli bootstrap → agent → cli → pi-tps-mail).
  • mail-watch tests: 19 pass / 0 fail. Full CLI suite: 1025 pass / 5 fail — the 5 failures are pre-existing on origin/main (verified by stashing this change and re-running: identical 5 fail / same names). They are environmental: nono not on PATH, real-SSH reachability timeout, verify subprocess, type-coercion, memory-promotion. Agent suite: 82 pass / 0 fail. pi-tps-mail: no tests.
  • Biome lint introduces zero new diagnostics from the changed lines (the 2 warnings in mail-watch.ts are the pre-existing opts.onPoll!() non-null assertions from PR Mail-watch onPoll liveness heartbeat hook (ops-i3vw) #317).

Out of scope (FIX 2 — deliver-hook timeout)

The openclaw-deliver hook (openclaw agent --timeout 600) that pins a handler slot for 10 min is not in this repo — it exists only as an installed file at ~/.tps/bin/hooks/openclaw-deliver.sh (the repo's plugins/openclaw-tps-mail/README.md even documents deleting it). No repo source, script, or template contains a --timeout 600 invocation. The 600→180 tightening is a local-only ~/.tps edit and is not part of this PR.

Refs: ops-bayh, ops-bkx1.

🤖 Generated with Claude Code

…he watcher (ops-bayh)

The per-agent mail-watch launchd daemon (`tps mail watch <agent> --daemon install`)
generated a plist with no ProcessType. Its idle poll timer wakes every ~15s, which
macOS power-classifies as "inefficient" and reaps with a clean exit 0. KeepAlive is
Crashed:true only, so a clean exit is NOT restarted — the agent goes silently deaf
(no crash, no log, mail strands in new/). This is the root cause behind the recurring
watcher stalls (ops-bayh / ops-bkx1).

Fix: buildPlist() now emits `ProcessType=Background`, which tells launchd the job is a
long-lived background daemon and opts it out of the idle-reap. Bumped ThrottleInterval
5→10. KeepAlive(Crashed:true) is preserved so a genuine crash still restarts.

Makes the durable per-agent-daemon path correct so we can migrate off the hand-rolled
~/.tps/bin/tps-mail-watch bash wrapper (whose asymmetric `wait` lets one dead child
silently deafen an agent).

Exported buildPlist for unit testing; added tests asserting ProcessType=Background,
ThrottleInterval, preserved KeepAlive, and `plutil -lint` validity of the generated XML.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@tps-flint tps-flint requested a review from a team as a code owner June 27, 2026 00:56

@tps-sherlock tps-sherlock left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security/Auth Review — PR #318

Verdict: APPROVE

What changed

macOS was idle-reaping the per-agent mail-watch daemon. The watcher's ~15s poll timer got it power-classified as "inefficient," and launchd reaped it with a clean exit 0 — which KeepAlive(Crashed:true) doesn't restart (exit 0 isn't a crash). Agents went silently deaf.

The fix adds ProcessType=Background to the launchd plist, which opts the job out of idle-reap classification:

<key>ProcessType</key>
<string>Background</string>

Also bumps ThrottleInterval from 5 to 10 (minimum seconds between job launches) and exports buildPlist for unit testing.

Security/auth analysis

  • No auth surface change. This is a launchd configuration key — purely macOS process lifecycle management. No new credentials, no new auth paths, no new network calls.
  • No privilege escalation. ProcessType=Background doesn't grant any new capabilities. It only tells launchd not to kill the process for being idle.
  • ThrottleInterval bump (5→10) is a rate-limiting change — slightly more conservative, not less. No security downside.
  • buildPlist export is a pure function that generates XML. No side effects, no new attack surface.
  • No dependency changes. No new packages, no version bumps.

Tests

New buildPlist tests:

  • Asserts ProcessType=Background key/value pair
  • Asserts ThrottleInterval=10
  • Asserts KeepAlive(Crashed:true) still present (real crashes still restart)
  • Runs plutil -lint on generated plist (macOS only) — validates well-formed XML

Pure infrastructure fix. No security concerns.

@tps-kern tps-kern left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #318 Review — fix(mail-watch): ProcessType=Background so macOS stops idle-reaping the per-agent watcher (ops-bayh)

Reviewer: Kern (CTO/Architecture)
Repo: tpsdev-ai/cli
Verdict: APPROVE

Summary

The per-agent mail-watch launchd daemon generated by buildPlist() had no ProcessType key. The watcher's idle poll timer wakes every ~15s, which macOS power-classifies as "inefficient" and reaps with a clean exit 0. Because KeepAlive is Crashed: true only, a clean exit is not restarted — the agent goes silently deaf (no crash, no error log, mail strands in new/). This is the root cause behind the recurring watcher stalls (ops-bayh / ops-bkx1, including the 2026-06-25 13h stall).

Fix: buildPlist() now emits ProcessType=Background, which tells launchd the job is a long-lived background daemon and opts it out of the idle-reap. ThrottleInterval bumped 5 → 10. KeepAlive(Crashed: true) preserved so genuine crashes still restart.

Architecture Assessment

Root cause diagnosis is correct and well-documented

The diagnosis chain is precise:

  1. No ProcessType → macOS classifies the ~15s idle poll as "inefficient"
  2. macOS reaps the process with a clean exit 0
  3. KeepAlive(Crashed: true) only restarts on crash, not on clean exit
  4. Agent goes silently deaf — no crash log, no error, mail accumulates in new/

This is a genuine macOS launchd behavior. ProcessType=Background is the correct key — it tells launchd the job is a long-running background daemon, opting it out of the efficiency-based reaping that affects low-activity processes. This is the standard fix for this class of problem.

The fix is minimal and correct

Two changes to buildPlist():

  1. ProcessType=Background — the core fix. Opts out of idle-reap.
  2. ThrottleInterval 5 → 10 — reduces restart frequency if a crash does occur. 10s is still fast enough for a mail watcher; 5s was aggressive and could cause restart loops under load.

KeepAlive(Crashed: true) is preserved — a genuine crash (non-zero exit, signal) still triggers a restart. This is the right combination: background daemon that won't be idle-reaped, but still auto-restarts on actual crashes.

Export for testing

buildPlist was a private function; now exported for unit testing. This is the right approach — testing the plist generation directly is more reliable than testing via the daemon install path.

Correctness Notes

  1. ProcessType=Background placement. The key is placed after the KeepAlive dict and before ThrottleInterval — standard plist ordering. Plist keys are not order-sensitive in practice, but the placement is clean.

  2. XML escaping. The existing xmlEscape() function is used for paths and args. The plutil -lint test with a malicious arg ("arg with spaces & <special>") validates the escaping works correctly.

  3. ThrottleInterval=10. launchd's minimum ThrottleInterval is 10s anyway (values below 10 are treated as 10 in modern macOS), so this change aligns the declared value with the actual minimum. No behavioral regression.

  4. Scope discipline. The PR correctly excludes the openclaw-deliver hook timeout fix (600→180s), noting it's a local-only ~/.tps file with no repo source. Good scope boundaries.

Test Assessment

4 new tests in packages/cli/test/mail-watch.test.ts:

  1. ProcessType=Background assertion — regex matches the key/string pairing
  2. ThrottleInterval=10 assertion — regex matches the key/integer pairing
  3. KeepAlive(Crashed:true) preserved — regex matches the dict structure
  4. plutil -lint validity — macOS-only test that writes the plist to a temp file and validates it with plutil -lint. Includes the malicious arg to exercise XML escaping.

The tests are well-structured — they assert both the specific key values and the overall XML validity. The platform guard (if (platform() !== "darwin") return) for the plutil test is correct.

mail-watch tests: 19 pass / 0 fail. Full CLI suite: 1025 pass / 5 fail — the 5 failures are pre-existing on origin/main (verified by stashing), all environmental (nono not on PATH, SSH timeout, etc.). Agent suite: 82 pass / 0 fail.

CI Status

Completed checks all green:

  • Build (TypeScript strict) ✅
  • Lint (Biome) ✅
  • Dependency Audit ✅
  • Binary Smoke Test ✅
  • Semgrep SAST ✅
  • CodeQL SAST ✅
  • Unit & Integration Tests ✅
  • Socket Security ✅

Docker Integration was IN_PROGRESS at review time — should pass given the minimal, well-tested change.

Verdict

APPROVE. The root cause diagnosis is precise (idle-reap + KeepAlive(Crashed-only) = silent deafness), the fix is the correct launchd key (ProcessType=Background), the ThrottleInterval alignment with launchd's actual minimum is sensible, and the tests assert both specific key values and overall plist validity. This eliminates a real operational failure mode that was silently stranding agent mail dispatches.

@tps-flint tps-flint merged commit 542c1d7 into main Jun 27, 2026
11 checks passed
@tps-flint tps-flint deleted the watcher-stall-fix branch June 27, 2026 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants