feat(daemon): split opaque execution_failed into close-reason sub-details by lefarcen · Pull Request #4502 · nexu-io/open-design

lefarcen · 2026-06-18T04:57:44Z

Why

I'm taking over the run-reliability lane (#3408). Profiling current-version (0.10.0+) failures in PostHog, the single largest failure_detail is the opaque execution_failed — ~4.5k/week across providers (opencode, codex_cli, claude_code, gemini, …). It's the AGENT_EXECUTION_FAILED catch-all that fires when no text pattern matched, so on the dashboard it reads as "the agent failed and we don't know why."

But the daemon does already know more: the runtime_close diagnostic carries rpc_close_reason, which splits this bucket into three genuinely distinct shapes — mid-stream agent error (stream_error, ~51%), bare non-zero exit (exit_nonzero, ~44%), and ACP fatal close (fatal_rpc_error). That split only lived on PostHog's rpc_close_reason field, so the canonical failure_detail — and therefore the Langfuse sink, which classifies through the same classifyRunFailure — stayed opaque, and the two sinks disagreed per run.

What users will see

No user-facing UI change. This is internal failure-analytics: dashboards and Langfuse traces keyed on failure_detail now see stream_error / exit_nonzero / fatal_rpc_error instead of one undifferentiated execution_failed, so the largest failure bucket becomes actionable.

How

classifyRunFailure already receives the run's events, so it reads the latest runtime_close diagnostic and promotes the generic execution_failed detail to the matching close reason. Only the opaque catch-all is refined — specific details (exit_code, cli_not_installed, …) and all retry behavior are untouched, so the change is observational-only. Three values added to TrackingRunFailureDetail.

This is the first step of the #3408 follow-up; mining the stream_error texts (1,082/wk carry a langfuse_trace_id) for real provider/tool error patterns is a Langfuse-backed follow-up PR.

Surface area

API / contract — added stream_error / exit_nonzero / fatal_rpc_error to TrackingRunFailureDetail in packages/contracts
UI / Keyboard shortcut / CLI / Extension point / i18n / new dependency
Default behavior change — none for users; changes the failure_detail value emitted for a subset of failed runs (analytics only)

Validation

pnpm --dir apps/daemon exec vitest run -c vitest.config.ts tests/run-failure-classification.test.ts — 44/44 (6 new red specs: each close reason → its detail, plus guards that a missing/unknown diagnostic stays execution_failed and a specific detail is never relabeled)
pnpm --filter @open-design/contracts typecheck, pnpm --filter @open-design/daemon typecheck
pnpm guard

Relates to #3408.

…ails `AGENT_EXECUTION_FAILED` whose text matches no classifier pattern currently collapses into a single opaque `failure_detail: execution_failed`. PostHog shows this is the largest current-version failure detail (~4.5k/wk across providers), and it hides three genuinely distinct shapes that the daemon already distinguishes via the `runtime_close` diagnostic's `rpc_close_reason`: mid-stream agent error (stream_error, ~51%), bare non-zero exit (exit_nonzero, ~44%), and ACP fatal close (fatal_rpc_error). That split lived only on PostHog's `rpc_close_reason` field, so the canonical `failure_detail` — and therefore the Langfuse sink, which classifies through the same `classifyRunFailure` — stayed opaque and the two sinks disagreed. `classifyRunFailure` already receives the run events, so it reads the `runtime_close` diagnostic and promotes the generic `execution_failed` detail to the specific close reason. Only the opaque catch-all is refined; specific details (exit_code, cli_not_installed, …) and retry behavior are untouched, so this is observational-only. This is the first step of the #3408 follow-up that turns the opaque bucket into named sub-buckets; mining the stream_error texts for real provider/tool error patterns is a Langfuse-backed follow-up. Relates to #3408.

PerishCode

@lefarcen I reviewed the changed classifier path, the added daemon regression coverage, and the analytics contract enum extension. The implementation keeps the refinement scoped to the generic AGENT_EXECUTION_FAILED / execution_failed bucket, preserves specific process-exit details, and aligns the new failure_detail values with the existing runtime_close close-reason vocabulary. CI is green; my local focused test/typecheck attempt could not run because this prepared worktree has no installed dependencies. Nice work tightening the analytics signal without broadening runtime behavior.

_{🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.}

…n_failed) Combines the P0-b fix_config fix and the P1 process_exit/execution_failed deepening into one spec under 'engineering-view failure reduction' (the ~7% we can actually fix). Slice 1 = fix_config: codex writes service_tier="default" (Langfuse-confirmed), but codex-config-normalize.ts only handles "priority" — generalize to normalize any value not in {fast,flex}. Slice 2 = execution_failed deepening (#4502 done the close-reason split; next is Langfuse-mining the stream_error texts to add real classifier patterns). Slice 3 = the already-named real bugs in process_exit (spawn_ebadf/eperm 149/wk, agent_protocol_error 255, fabricated_role_marker 310). Grounded with the 7d process_exit breakdown + the confirmed 'default' value + code anchors.

lefarcen requested a review from PerishCode June 18, 2026 05:00

lefarcen added size/M PR changes 100-300 lines risk/high High risk: apps/desktop, daemon, auth, migration, workflows, package deps type/feature New feature labels Jun 18, 2026

PerishCode approved these changes Jun 18, 2026

View reviewed changes

lefarcen mentioned this pull request Jun 18, 2026

Reliability push: classify run failures, correlate PostHog with Langfuse, and add safe retries #3408

Open

lefarcen added this pull request to the merge queue Jun 19, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(daemon): split opaque execution_failed into close-reason sub-details#4502

feat(daemon): split opaque execution_failed into close-reason sub-details#4502
lefarcen wants to merge 1 commit into
mainfrom
fix/classify-execution-failed-subtypes

lefarcen commented Jun 18, 2026

Uh oh!

PerishCode left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lefarcen commented Jun 18, 2026

Why

What users will see

How

Surface area

Validation

Uh oh!

PerishCode left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants