Skip to content

feat(daemon): split opaque execution_failed into close-reason sub-details#4502

Open
lefarcen wants to merge 1 commit into
mainfrom
fix/classify-execution-failed-subtypes
Open

feat(daemon): split opaque execution_failed into close-reason sub-details#4502
lefarcen wants to merge 1 commit into
mainfrom
fix/classify-execution-failed-subtypes

Conversation

@lefarcen

Copy link
Copy Markdown
Contributor

Why

I'm taking over the run-reliability lane (#3408). Profiling current-version (0.10.0+) failures in PostHog, the single largest failure_detail is the opaque execution_failed~4.5k/week across providers (opencode, codex_cli, claude_code, gemini, …). It's the AGENT_EXECUTION_FAILED catch-all that fires when no text pattern matched, so on the dashboard it reads as "the agent failed and we don't know why."

But the daemon does already know more: the runtime_close diagnostic carries rpc_close_reason, which splits this bucket into three genuinely distinct shapes — mid-stream agent error (stream_error, ~51%), bare non-zero exit (exit_nonzero, ~44%), and ACP fatal close (fatal_rpc_error). That split only lived on PostHog's rpc_close_reason field, so the canonical failure_detail — and therefore the Langfuse sink, which classifies through the same classifyRunFailure — stayed opaque, and the two sinks disagreed per run.

What users will see

No user-facing UI change. This is internal failure-analytics: dashboards and Langfuse traces keyed on failure_detail now see stream_error / exit_nonzero / fatal_rpc_error instead of one undifferentiated execution_failed, so the largest failure bucket becomes actionable.

How

classifyRunFailure already receives the run's events, so it reads the latest runtime_close diagnostic and promotes the generic execution_failed detail to the matching close reason. Only the opaque catch-all is refined — specific details (exit_code, cli_not_installed, …) and all retry behavior are untouched, so the change is observational-only. Three values added to TrackingRunFailureDetail.

This is the first step of the #3408 follow-up; mining the stream_error texts (1,082/wk carry a langfuse_trace_id) for real provider/tool error patterns is a Langfuse-backed follow-up PR.

Surface area

  • API / contract — added stream_error / exit_nonzero / fatal_rpc_error to TrackingRunFailureDetail in packages/contracts
  • UI / Keyboard shortcut / CLI / Extension point / i18n / new dependency
  • Default behavior change — none for users; changes the failure_detail value emitted for a subset of failed runs (analytics only)

Validation

  • pnpm --dir apps/daemon exec vitest run -c vitest.config.ts tests/run-failure-classification.test.ts — 44/44 (6 new red specs: each close reason → its detail, plus guards that a missing/unknown diagnostic stays execution_failed and a specific detail is never relabeled)
  • pnpm --filter @open-design/contracts typecheck, pnpm --filter @open-design/daemon typecheck
  • pnpm guard

Relates to #3408.

…ails

`AGENT_EXECUTION_FAILED` whose text matches no classifier pattern currently
collapses into a single opaque `failure_detail: execution_failed`. PostHog shows
this is the largest current-version failure detail (~4.5k/wk across providers),
and it hides three genuinely distinct shapes that the daemon already
distinguishes via the `runtime_close` diagnostic's `rpc_close_reason`:
mid-stream agent error (stream_error, ~51%), bare non-zero exit (exit_nonzero,
~44%), and ACP fatal close (fatal_rpc_error).

That split lived only on PostHog's `rpc_close_reason` field, so the canonical
`failure_detail` — and therefore the Langfuse sink, which classifies through the
same `classifyRunFailure` — stayed opaque and the two sinks disagreed.

`classifyRunFailure` already receives the run events, so it reads the
`runtime_close` diagnostic and promotes the generic `execution_failed` detail to
the specific close reason. Only the opaque catch-all is refined; specific
details (exit_code, cli_not_installed, …) and retry behavior are untouched, so
this is observational-only.

This is the first step of the #3408 follow-up that turns the opaque bucket into
named sub-buckets; mining the stream_error texts for real provider/tool error
patterns is a Langfuse-backed follow-up.

Relates to #3408.
@lefarcen lefarcen requested a review from PerishCode June 18, 2026 05:00
@lefarcen lefarcen added size/M PR changes 100-300 lines risk/high High risk: apps/desktop, daemon, auth, migration, workflows, package deps type/feature New feature labels Jun 18, 2026

@PerishCode PerishCode left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lefarcen I reviewed the changed classifier path, the added daemon regression coverage, and the analytics contract enum extension. The implementation keeps the refinement scoped to the generic AGENT_EXECUTION_FAILED / execution_failed bucket, preserves specific process-exit details, and aligns the new failure_detail values with the existing runtime_close close-reason vocabulary. CI is green; my local focused test/typecheck attempt could not run because this prepared worktree has no installed dependencies. Nice work tightening the analytics signal without broadening runtime behavior.

🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.

lefarcen pushed a commit that referenced this pull request Jun 18, 2026
…n_failed)

Combines the P0-b fix_config fix and the P1 process_exit/execution_failed
deepening into one spec under 'engineering-view failure reduction' (the ~7% we
can actually fix). Slice 1 = fix_config: codex writes service_tier="default"
(Langfuse-confirmed), but codex-config-normalize.ts only handles "priority" —
generalize to normalize any value not in {fast,flex}. Slice 2 = execution_failed
deepening (#4502 done the close-reason split; next is Langfuse-mining the
stream_error texts to add real classifier patterns). Slice 3 = the already-named
real bugs in process_exit (spawn_ebadf/eperm 149/wk, agent_protocol_error 255,
fabricated_role_marker 310). Grounded with the 7d process_exit breakdown + the
confirmed 'default' value + code anchors.
@lefarcen lefarcen added this pull request to the merge queue Jun 19, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

risk/high High risk: apps/desktop, daemon, auth, migration, workflows, package deps size/M PR changes 100-300 lines type/feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants