fix(mail-watch): ProcessType=Background so macOS stops idle-reaping the per-agent watcher (ops-bayh)#318
Conversation
…he watcher (ops-bayh) The per-agent mail-watch launchd daemon (`tps mail watch <agent> --daemon install`) generated a plist with no ProcessType. Its idle poll timer wakes every ~15s, which macOS power-classifies as "inefficient" and reaps with a clean exit 0. KeepAlive is Crashed:true only, so a clean exit is NOT restarted — the agent goes silently deaf (no crash, no log, mail strands in new/). This is the root cause behind the recurring watcher stalls (ops-bayh / ops-bkx1). Fix: buildPlist() now emits `ProcessType=Background`, which tells launchd the job is a long-lived background daemon and opts it out of the idle-reap. Bumped ThrottleInterval 5→10. KeepAlive(Crashed:true) is preserved so a genuine crash still restarts. Makes the durable per-agent-daemon path correct so we can migrate off the hand-rolled ~/.tps/bin/tps-mail-watch bash wrapper (whose asymmetric `wait` lets one dead child silently deafen an agent). Exported buildPlist for unit testing; added tests asserting ProcessType=Background, ThrottleInterval, preserved KeepAlive, and `plutil -lint` validity of the generated XML. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tps-sherlock
left a comment
There was a problem hiding this comment.
Security/Auth Review — PR #318
Verdict: APPROVE
What changed
macOS was idle-reaping the per-agent mail-watch daemon. The watcher's ~15s poll timer got it power-classified as "inefficient," and launchd reaped it with a clean exit 0 — which KeepAlive(Crashed:true) doesn't restart (exit 0 isn't a crash). Agents went silently deaf.
The fix adds ProcessType=Background to the launchd plist, which opts the job out of idle-reap classification:
<key>ProcessType</key>
<string>Background</string>Also bumps ThrottleInterval from 5 to 10 (minimum seconds between job launches) and exports buildPlist for unit testing.
Security/auth analysis
- No auth surface change. This is a launchd configuration key — purely macOS process lifecycle management. No new credentials, no new auth paths, no new network calls.
- No privilege escalation.
ProcessType=Backgrounddoesn't grant any new capabilities. It only tells launchd not to kill the process for being idle. ThrottleIntervalbump (5→10) is a rate-limiting change — slightly more conservative, not less. No security downside.buildPlistexport is a pure function that generates XML. No side effects, no new attack surface.- No dependency changes. No new packages, no version bumps.
Tests
New buildPlist tests:
- Asserts
ProcessType=Backgroundkey/value pair - Asserts
ThrottleInterval=10 - Asserts
KeepAlive(Crashed:true)still present (real crashes still restart) - Runs
plutil -linton generated plist (macOS only) — validates well-formed XML
Pure infrastructure fix. No security concerns.
tps-kern
left a comment
There was a problem hiding this comment.
PR #318 Review — fix(mail-watch): ProcessType=Background so macOS stops idle-reaping the per-agent watcher (ops-bayh)
Reviewer: Kern (CTO/Architecture)
Repo: tpsdev-ai/cli
Verdict: APPROVE
Summary
The per-agent mail-watch launchd daemon generated by buildPlist() had no ProcessType key. The watcher's idle poll timer wakes every ~15s, which macOS power-classifies as "inefficient" and reaps with a clean exit 0. Because KeepAlive is Crashed: true only, a clean exit is not restarted — the agent goes silently deaf (no crash, no error log, mail strands in new/). This is the root cause behind the recurring watcher stalls (ops-bayh / ops-bkx1, including the 2026-06-25 13h stall).
Fix: buildPlist() now emits ProcessType=Background, which tells launchd the job is a long-lived background daemon and opts it out of the idle-reap. ThrottleInterval bumped 5 → 10. KeepAlive(Crashed: true) preserved so genuine crashes still restart.
Architecture Assessment
Root cause diagnosis is correct and well-documented
The diagnosis chain is precise:
- No
ProcessType→ macOS classifies the ~15s idle poll as "inefficient" - macOS reaps the process with a clean exit 0
KeepAlive(Crashed: true)only restarts on crash, not on clean exit- Agent goes silently deaf — no crash log, no error, mail accumulates in
new/
This is a genuine macOS launchd behavior. ProcessType=Background is the correct key — it tells launchd the job is a long-running background daemon, opting it out of the efficiency-based reaping that affects low-activity processes. This is the standard fix for this class of problem.
The fix is minimal and correct
Two changes to buildPlist():
ProcessType=Background— the core fix. Opts out of idle-reap.ThrottleInterval5 → 10 — reduces restart frequency if a crash does occur. 10s is still fast enough for a mail watcher; 5s was aggressive and could cause restart loops under load.
KeepAlive(Crashed: true) is preserved — a genuine crash (non-zero exit, signal) still triggers a restart. This is the right combination: background daemon that won't be idle-reaped, but still auto-restarts on actual crashes.
Export for testing
buildPlist was a private function; now exported for unit testing. This is the right approach — testing the plist generation directly is more reliable than testing via the daemon install path.
Correctness Notes
-
ProcessType=Backgroundplacement. The key is placed after theKeepAlivedict and beforeThrottleInterval— standard plist ordering. Plist keys are not order-sensitive in practice, but the placement is clean. -
XML escaping. The existing
xmlEscape()function is used for paths and args. Theplutil -linttest with a malicious arg ("arg with spaces & <special>") validates the escaping works correctly. -
ThrottleInterval=10. launchd's minimumThrottleIntervalis 10s anyway (values below 10 are treated as 10 in modern macOS), so this change aligns the declared value with the actual minimum. No behavioral regression. -
Scope discipline. The PR correctly excludes the
openclaw-deliverhook timeout fix (600→180s), noting it's a local-only~/.tpsfile with no repo source. Good scope boundaries.
Test Assessment
4 new tests in packages/cli/test/mail-watch.test.ts:
ProcessType=Backgroundassertion — regex matches the key/string pairingThrottleInterval=10assertion — regex matches the key/integer pairingKeepAlive(Crashed:true)preserved — regex matches the dict structureplutil -lintvalidity — macOS-only test that writes the plist to a temp file and validates it withplutil -lint. Includes the malicious arg to exercise XML escaping.
The tests are well-structured — they assert both the specific key values and the overall XML validity. The platform guard (if (platform() !== "darwin") return) for the plutil test is correct.
mail-watch tests: 19 pass / 0 fail. Full CLI suite: 1025 pass / 5 fail — the 5 failures are pre-existing on origin/main (verified by stashing), all environmental (nono not on PATH, SSH timeout, etc.). Agent suite: 82 pass / 0 fail.
CI Status
Completed checks all green:
- Build (TypeScript strict) ✅
- Lint (Biome) ✅
- Dependency Audit ✅
- Binary Smoke Test ✅
- Semgrep SAST ✅
- CodeQL SAST ✅
- Unit & Integration Tests ✅
- Socket Security ✅
Docker Integration was IN_PROGRESS at review time — should pass given the minimal, well-tested change.
Verdict
APPROVE. The root cause diagnosis is precise (idle-reap + KeepAlive(Crashed-only) = silent deafness), the fix is the correct launchd key (ProcessType=Background), the ThrottleInterval alignment with launchd's actual minimum is sensible, and the tests assert both specific key values and overall plist validity. This eliminates a real operational failure mode that was silently stranding agent mail dispatches.
Problem (root cause)
The per-agent mail-watch launchd daemon installed by
tps mail watch <agent> --daemon installwas getting reaped by macOS.buildPlist()generated a plist with noProcessType. The watcher's idle poll timer wakes roughly every 15s; with noProcessType, launchd/macOS power-classifies the job as "inefficient" and reaps it with a clean exit 0.The generated
KeepAliveisCrashed: trueonly — so a clean exit-0 is not restarted. The result: the watcher dies silently (no crash, no error log), the agent stops consuming mail, and dispatches strand innew/. This is the recurring watcher-stall failure mode (ops-bayh / ops-bkx1; the 2026-06-25 13h stall).The live single hand-stabilized plist already had
ProcessType=Backgroundapplied by hand. This PR makes the durable per-agent-daemon code path correct so the fix survives reinstalls and we can migrate off the hand-rolled~/.tps/bin/tps-mail-watchbash wrapper (whose asymmetricwaitlets one dead child silently deafen an agent).Fix
buildPlist()now emits:ProcessType=Backgroundtells launchd the job is a long-lived background daemon and opts it out of the idle/efficiency reap.ThrottleIntervalbumped 5 → 10.KeepAlive(Crashed: true)is preserved so a genuine crash still restarts.Verification
buildPlistand added unit tests assertingProcessType=Background,ThrottleInterval=10, preservedKeepAlive(Crashed:true), andplutil -lintvalidity of the generated XML (with a malicious exec arg, to exercise xmlEscape).runMailWatch→installDaemon→buildPlist→writeFileSync) and generated a sample plist via the real function:plutil -lint→ OK;plutil -extract ProcessType raw→Background;ThrottleInterval→10;KeepAlive.Crashed→true.tsctypecheck + dist build clean (dependency order: cli bootstrap → agent → cli → pi-tps-mail).origin/main(verified by stashing this change and re-running: identical 5 fail / same names). They are environmental:nononot on PATH, real-SSH reachability timeout, verify subprocess, type-coercion, memory-promotion. Agent suite: 82 pass / 0 fail.pi-tps-mail: no tests.mail-watch.tsare the pre-existingopts.onPoll!()non-null assertions from PR Mail-watch onPoll liveness heartbeat hook (ops-i3vw) #317).Out of scope (FIX 2 — deliver-hook timeout)
The
openclaw-deliverhook (openclaw agent --timeout 600) that pins a handler slot for 10 min is not in this repo — it exists only as an installed file at~/.tps/bin/hooks/openclaw-deliver.sh(the repo'splugins/openclaw-tps-mail/README.mdeven documents deleting it). No repo source, script, or template contains a--timeout 600invocation. The 600→180 tightening is a local-only~/.tpsedit and is not part of this PR.Refs: ops-bayh, ops-bkx1.
🤖 Generated with Claude Code