docs(spec): stability & performance — current state + fix/optimization plan by lefarcen · Pull Request #4504 · nexu-io/open-design

lefarcen · 2026-06-18T05:08:57Z

Why

Taking over the run-reliability/performance lane (#3408). The most-felt latency in the product is the ~10s of dead time before any token appears on an AMR or claude_code turn — and it repeats on every message, not just the first. This spec defines how we profile and fix it. Per the request, it is a plan with grounding data first, not code.

What users will see

Nothing yet — this is a checked-in spec under specs/current/. The eventual outcome is faster time-to-first-token on continued turns.

What the data already shows (in the spec)

PostHog startup timing: the entire ~10s lives in the opaque spawn_to_first_token segment (AMR p50 10.4s / claude_code 8.2s; p90 24–34s). Queue + spawn + prompt-build are ~0.5s combined.
A turn-ordinal cut proves claude_code re-pays a full cold start every message (turn 1 = 8.2s, turn 2+ = 9.3s — no reuse benefit), which is exactly Claude Code adapter spawns a new session per message instead of resuming (token waste + lost edit state) #3380. AMR shows only marginal reuse.

The plan

Phase 1 (observability only) — sub-instrument spawn_to_first_token into cli_ready_ms / session_init_ms / model_first_token_ms so we can tell cold-start re-init apart from context-growth and provider first-token latency.
Phase 2 (experiments) — PostHog re-cut on the sub-segments + a local A/B (fresh vs resumed session), with a decision rule: proceed only if session_init_ms is a material share.
Phase 3 (fix, gated on Phase 2) — claude_code session resume (Claude Code adapter spawns a new session per message instead of resuming (token waste + lost edit state) #3380) / adopt fix: reuse ACP sessions per conversation #3535's ACP session reuse; validate via TTFT drop on continued turns, keeping correctness (no lost edit state).

Surface area

UI / CLI / API / contract / i18n / dependency — none
None — spec document only

The spec also documents why this line is out of scope for the #3545 QA gate: it removes dead time without changing model inputs/outputs, so there is no quality-regression surface (unlike token/context reduction, #3547).

Relates to #3408, #3380.

Proposes the plan for the ~10s pre-first-token latency users feel on every AMR/claude_code turn. PostHog timing shows the entire cost lives in the opaque spawn_to_first_token segment (AMR p50 10.4s / claude_code 8.2s), and a turn- ordinal cut proves claude_code re-pays a full cold start every message (no session reuse) — i.e. #3380. The spec defines sub-instrumentation to split that segment (cli_ready / session_init / model_first_token), a data + local A/B experiment phase to confirm session init dominates before any fix, and the session-resume fix gated on that evidence. Notes why this line is out of scope for the #3545 QA gate (no model-input/output change, no quality surface). Relates to #3408, #3380.

lefarcen · 2026-06-18T05:12:48Z

This is nicely grounded already — the PostHog split plus the turn-ordinal cut make the spec direction easy to follow.

One small PR-body follow-up before pool review: could you add a ## Validation section to the PR description summarizing what was checked for this spec PR (for example: doc-only change, no runtime behavior changed, no commands/tests run)? The spec file itself has the QA-boundary section already; this is just to make the PR template complete in the body.

nettee

I found three spec blockers in the changed ranges: the cli_ready_ms contract is still undefined across runtimes, the Phase 2 go/no-go rule is not yet testable as written, and the blanket QA-gate exemption conflicts with the spec's own correctness risk for session reuse.

_{🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.}

nettee

Two spec gaps need to be closed before this plan is safe to implement: the new timing fields need to stay on the existing PostHog/Langfuse contract path, and the QA-boundary section should not exempt Phase 3 session resume from correctness validation.

_{🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.}

Add the mandatory Sources / 事实源 section (repo, branch, exact file:line anchors for the timing pipeline and the #3380 session locus, how to pull, the PostHog queries behind the tables, and access prereqs) so reviewers can ground- check against real code. Add Why (用例/痛点 split), Goals/Non-goals, Alternatives considered, Risks & mitigations, and Implementation slices per the template. Optimization method unchanged: profile-first (sub-instrument spawn_to_first_token) → confirm session_init dominates → session reuse (#3380).

nettee

I found three blocking spec gaps in the changed ranges: the Phase 1 cli_ready_ms contract is still undefined across runtimes, the Phase 2 go/no-go gate is not reproducible as written, and the QA-boundary section still exempts Phase 3 even though the same spec treats session resume as correctness-sensitive.

_{🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.}

lefarcen · 2026-06-18T05:31:54Z

Hey @lefarcen — @nettee's current-head review now covers the blocking work clearly: the cli_ready_ms contract still needs one auditable definition across runtimes, the Phase 2 go/no-go rule needs concrete pass criteria, and the QA-boundary section needs to narrow the exemption so Phase 3 keeps correctness validation in scope.

Once those points are addressed — plus the earlier ## Validation section in the PR body — this should be in much better shape for the next pass.

💡 To drive this PR to merge hands-free, paste this to your AI coding agent (Claude Code / Codex / opencode / Cursor …):
Take over nexu-io/open-design#4504 until it merges — read https://raw.githubusercontent.com/nexu-io/looper/main/skills/pr-takeover/SKILL.md and follow it.

…audit Fold in the root cause of the low cache-hit rate, verified against latest origin/main: (1) volatile system blocks (MCP list / memory / file list / run context) sit in the prompt prefix BEFORE the stable append-only transcript (server.ts:7670), so any change drifts the prefix and invalidates the whole cache; (2) buildDaemonTranscript flattens history to content-only and drops the native thinking/tool_use/tool_result structure, feeding a fresh session each turn (daemon.ts:222); (3) the 5-min cache TTL expires between human turns. Split the fix into a cheap prefix-stabilization win (3a, no gate) and the bigger ACP-native session reuse (3b, needs vela session/load). Honest audit: the PostHog cache data is real production-client telemetry; the earlier local `claude -p` split bypassed the daemon and is dropped.

nettee

I found one blocking issue in the changed range: the new Phase 3 plan now includes prompt-shape and session-lifecycle changes, but the QA-boundary section still treats this line as if it never changes model inputs or behavior. That mismatch would let the implementation skip the validation this repo already expects for prompt/session changes.

_{🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.}

lefarcen · 2026-06-18T06:04:09Z

Quick follow-up on the new head: @nettee's latest blocking review has narrowed to one remaining issue — the QA-boundary section still over-exempts Phase 3 even though the current plan now includes prompt-shape and session-lifecycle changes.

If you tighten that section so Phase 1/2 can stay measurement-only while Phase 3 explicitly keeps targeted correctness validation in scope — and add the ## Validation section in the PR body — this should be ready for another pass.

…oduction - Phase 3a guard: a structural invariant built on the existing prompt-stack telemetry (prompt-telemetry.ts per-section fingerprints + SECTION_PRIORITY). Classify each PromptTelemetrySectionKind STABLE vs VOLATILE; assert the cacheable-prefix fingerprint is byte-stable when only volatile inputs change, and that every kind has a classification — so a future feature that adds an unclassified prompt section fails the test until it declares its side. Makes prefix stability self-enforcing as features evolve, not a one-time reorder. - Reproduction appendix: the exact PostHog HogQL queries behind every table, the Langfuse trace-fetch method, and an explicit warning not to extrapolate from a bare `claude -p` run (it bypasses the daemon).

nettee

I found one blocking issue in the changed range: the QA-boundary section still exempts the whole line from #3545 even though the current Phase 3 plan now changes prompt ordering and cross-turn session state, which this repo already treats as correctness-sensitive.

_{🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.}

Dedicated AMR optimization spec, root-caused to verified code: vela acp_runtime.go advertises loadSession:false + single-session + no session/load handler, so AMR cannot reuse a session and re-sends the flattened transcript every turn, re-processing ~100-153k uncached tokens per turn. Two levers: (A) session reuse — universal must-fix (vela: loadSession true + session/load + session persistence; daemon: one long-lived ACP connection per conversation), helps every caching model; (B) cache_control passthrough + stable prefix — only for explicit-cache models (Claude/Gemini); auto-cache models (DeepSeek — AMR's top model — and OpenAI) need only a stable prefix. Cross-user shared cache (AMR shared backend) makes the universal turn-1 prefix TTL-immune; turn-2+ uses the 1h extended TTL. Expected: AMR continued-turn TTFT ~11s -> ~6-7s (~40%, ~90% of runs) plus a token-cost cut that eases insufficient_balance (#4455). Includes the per-model cache table, pricing model, regression guard, and reproduction queries.

nettee

I found four blocking spec gaps that still leave the implementation path ambiguous on the current head: Phase 1 still does not define one auditable cli_ready_ms contract, the Phase 2 go/no-go gate is not reproducible, and both specs still over-exempt prompt/session changes from the correctness validation this repo already expects.

_{🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.}

lefarcen · 2026-06-18T08:33:40Z

Quick pass on the new head: @nettee's current-head review now spans both spec files, and the remaining blockers are broader again — the cli_ready_ms contract still needs one auditable definition, the Phase 2 gate still needs reproducible pass criteria, and the QA-boundary / correctness-validation story now needs to stay aligned across both specs as Phase 3 expands.

If you tighten those points in the two spec docs — plus add the ## Validation section in the PR body — this should be in much better shape for the next review round.

Caching already runs at the Vela Link gateway (prompt_cache.go, ephemeral/no TTL); no ACP cache_control passthrough needed. Corrections: DeepSeek read is 0.5x (vela billing), not 0.1x; AMR head models are DeepSeek/GLM/Gemini. Session reuse is an architecture change (vela tears down opencode temp home per turn; opencode durable session reload unverified), not a config flip. Reordered: Step 1 = enable 1h TTL + stable prefix at the gateway (smaller); Step 2 = session reuse (a cross-repo project). Marks this optimization as a project, not low- hanging fruit.

nettee

I found four blocking spec gaps that still leave the implementation path ambiguous on the current head: cli_ready_ms still lacks one auditable cross-runtime definition, the Phase 2 gate is still not reproducible, and both specs still over-exempt Phase 3 prompt/session changes from correctness validation.

_{🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.}

lefarcen · 2026-06-18T09:03:14Z

Quick follow-up on the current head: @nettee's blocking review is still pointing at the same four must-clarify spec points before implementation is safe to start — one auditable cli_ready_ms contract, a reproducible Phase 2 go/no-go rule, and narrower QA-boundary language in both specs so Phase 3 prompt/session changes keep explicit correctness validation in scope.

If you tighten those points in the two spec docs — plus add the ## Validation section in the PR body — this should be ready for another pass.

…n plan Master living doc for #3408. Part 1 (现状, reviewer background): how we measure (product-view ~13.5% vs engineering-view ~7% failure rate; the raw ~22% is inflated by user-action + old-version noise ~3x), the failure map (who's at fault), the perf map (TTFT mostly model-first-byte + measurement artifacts), and user-reach as a priority axis. Part 2 (计划): the full prioritized plan — P0-a metric calibration (done), P0-b fix_config codex normalizer gap, P1 process_exit deepening + reduce_context total-budget truncation, P2 switch_model catalog hygiene + TTFT metric semantics, P3 AMR latency (vela cross-repo project). Plus the explicitly-excluded phantoms (auth preflight, bun install, etc.) with rationale. Each item drills into its own spec.

…n_failed) Combines the P0-b fix_config fix and the P1 process_exit/execution_failed deepening into one spec under 'engineering-view failure reduction' (the ~7% we can actually fix). Slice 1 = fix_config: codex writes service_tier="default" (Langfuse-confirmed), but codex-config-normalize.ts only handles "priority" — generalize to normalize any value not in {fast,flex}. Slice 2 = execution_failed deepening (#4502 done the close-reason split; next is Langfuse-mining the stream_error texts to add real classifier patterns). Slice 3 = the already-named real bugs in process_exit (spawn_ebadf/eperm 149/wk, agent_protocol_error 255, fabricated_role_marker 310). Grounded with the 7d process_exit breakdown + the confirmed 'default' value + code anchors.

Align all four #3408 stability/performance specs with the repo's English convention. Translation + light design-level restructure (via codex GPT-5.5, human-verified): all numeric data, file:line anchors, IDs, SQL/command snippets, and table structures preserved verbatim; only prose and table text translated. Each doc now marked 'Design doc (human/reviewer-facing); implementation runbooks per slice are written separately at build time.'

…MR step 0/1/2, QA-gate corrections)

…/OpenAI/Gemini confirmed, DeepSeek lead model unverified)

…achedReadTokens) + within-turn vs cross-turn cache framing

…events data (turn-2+ first call ~21% hit, ~24.5k uncached, ~12s)

…R with production cross-turn cache data

nettee

I found three blocking spec issues that still make the implementation path ambiguous on the current head: Phase 1 still does not define one auditable cli_ready_ms contract, the Phase 2 gate is still not reproducible enough to drive a consistent go/no-go call, and the AMR acceptance section still points at the superseded ~153k metric instead of the corrected first-call baseline.

_{🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.}

lefarcen · 2026-06-18T13:33:28Z

Hey @lefarcen — @nettee's latest current-head review now pins the remaining blockers down pretty clearly: the Phase 1 cli_ready_ms contract still needs one auditable cross-runtime definition, the Phase 2 go/no-go rule still needs reproducible pass criteria, and the AMR acceptance text should switch off the superseded ~153k baseline to the corrected turn-2+ first-call cohort the spec now treats as authoritative.

Once those three points are tightened — plus the still-missing ## Validation section in the PR body — this should be in much better shape for the next pass.

…tal change set) + address review (cli_ready_ms/Phase2 gate contract, purge stale 153k baseline)

… cross-turn data (153k=context size not uncached)

nettee

A couple of the new spec sections still contradict later decisions in the same document set, so I'd tighten those before anyone uses this PR as the implementation guide.

_{🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.}

nettee · 2026-06-19T04:15:35Z

+
+1. **Phase 1 — Add subsegment instrumentation to `spawn_to_first_token` (pure observability, ship first)**: Add two timestamps between `processSpawnedAt`→`firstTokenAt`: `cliReadyAt` and `sessionInitDoneAt`, and emit `cli_ready_ms` / `session_init_ms` / `model_first_token_ms` on `run_finished` (sum of the three ≈ `spawn_to_first_token_ms`; also emit `remainder` for audit). **`cli_ready_ms` contract — one auditable marker per runtime family (no per-owner ambiguity)**: `claude-stream-json` = first JSONL line parsed from stdout; `acp-json-rpc` (AMR/devin/hermes) = first well-formed ACP JSON-RPC message received; `json-event-stream` (codex/gemini/cursor) = first decoded stream event; `plain` = first non-empty stdout chunk. `sessionInitDoneAt` = the resume/`session/new` handshake ack for ACP, or the first model-bound request for stream agents. These markers are declared in one place (the runtime def) so two owners cannot record different events. The new fields ride the **existing shared analytics contract + Langfuse mirror** (same path as the other `run_finished` reliability fields — not a divergent surface). No behavior change.
+2. **Phase 2 — Confirm experimentally**: Rerun turn-1 vs turn-2+ in PostHog with the new subsegments (expected: `session_init_ms` stays flat across turns = cold start repaid, `model_first_token_ms` grows with context); local A/B (same conversation fresh vs resumed, data injected only through production HTTP APIs). **Reproducible go/no-go gate**: population = successful `claude_code` runs, last 7d, turn-2+ only; statistic = **both p50 and p90** of `session_init_ms / spawn_to_first_token_ms`; threshold = enter Phase 3 if **p50 ≥ 30%** (or p90 ≥ 40%); local A/B = ≥ 30 paired fresh-vs-resumed runs. **Tie-break**: if the PostHog cut and the local A/B disagree, the PostHog production cut decides (A/B only sizes the effect). If `cli_ready_ms` dominates instead, choose a warmed process pool; otherwise classify as provider-side and stop with a recorded conclusion.
+3. **Phase 3a — Stabilize cacheable prefix (cheap, independent, requires QA/state-continuity gate)**: Move or freeze volatile system blocks (file list / MCP list / personal memory / run context) from **before** the stable transcript to after it, so the append-only transcript becomes a prefix that can really hit; confirm vela/opencode gateway passes through `cache_control`. Prefix order is model input, so this gives immediate benefit (lower cache-miss rate) only after quality/state-continuity acceptance.


This Phase 3a step still says to "confirm vela/opencode gateway passes through cache_control", but the sibling AMR spec in this same PR has already retired that premise: amr-latency-session-reuse-prompt-cache.md:162-166 says Vela Link already injects cache directives and that the remaining gap is TTL support / provider coverage, not ACP passthrough. Leaving both versions in the checked-in plan matters because one implementer can read this file and spend time on a transport path the other file has already ruled out, while another implementer follows the AMR feasibility section and skips it. Please align this file to the later feasibility result, for example by replacing the passthrough check with the real open work (1h TTL where supported, explicit-cache coverage, and stable-prefix ordering), or by scoping the passthrough note to non-Vela paths only.
_{🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.}

nettee · 2026-06-19T04:15:35Z

+
+## Open questions
+
+- What exact signal should be the reliable "CLI ready" marker for each runtime? (claude_code / ACP / plain stdout differ) — this is the critical dependency for Phase 1.


The Phase 1 contract at line 86 now defines exact per-runtime cli_ready_ms markers, but this open-question bullet still says the exact signal is unresolved. That contradiction matters because it re-opens one of the main blockers this PR already closed: an implementer or reviewer can point at this section and treat the contract as unsettled even though the earlier section now makes it auditable. Please either remove this bullet or narrow it to the remaining real question, such as whether any adapter family still needs a null/timing_unavailable_reason fallback because the defined marker cannot actually be observed in practice.
_{🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.}

lefarcen requested a review from nettee June 18, 2026 05:12

lefarcen added size/M PR changes 100-300 lines risk/medium Medium risk: regular code changes type/docs Documentation changes only labels Jun 18, 2026

This was referenced Jun 18, 2026

Reliability push: classify run failures, correlate PostHog with Langfuse, and add safe retries #3408

Open

Claude Code adapter spawns a new session per message instead of resuming (token waste + lost edit state) #3380

Open

nettee requested changes Jun 18, 2026

View reviewed changes

Comment thread specs/current/agent-startup-latency-profiling.md Outdated

Comment thread specs/current/agent-startup-latency-profiling.md Outdated

Comment thread specs/current/agent-startup-latency-profiling.md Outdated

nettee requested changes Jun 18, 2026

View reviewed changes

Comment thread specs/current/agent-startup-latency-profiling.md Outdated

Comment thread specs/current/agent-startup-latency-profiling.md Outdated

nettee requested changes Jun 18, 2026

View reviewed changes

Comment thread specs/current/agent-startup-latency-profiling.md Outdated

Comment thread specs/current/agent-startup-latency-profiling.md Outdated

Comment thread specs/current/agent-startup-latency-profiling.md Outdated

nettee requested changes Jun 18, 2026

View reviewed changes

Comment thread specs/current/agent-startup-latency-profiling.md Outdated

nettee requested changes Jun 18, 2026

View reviewed changes

Comment thread specs/current/agent-startup-latency-profiling.md Outdated

nettee requested changes Jun 18, 2026

View reviewed changes

lefarcen added 3 commits June 18, 2026 17:44

lefarcen changed the title ~~docs(spec): agent startup latency profiling and session-reuse~~ docs(spec): stability & performance — current state + fix/optimization plan Jun 18, 2026

lefarcen added 2 commits June 18, 2026 18:36

docs(spec): apply spec-battle review fixes (fix_config remove-line, A…

b3e985e

…MR step 0/1/2, QA-gate corrections)

docs(spec): pin cross-user cache feasibility to provider docs (Claude…

6d40a8b

…/OpenAI/Gemini confirmed, DeepSeek lead model unverified)

lefarcen added 3 commits June 18, 2026 19:32

docs(spec): correct AMR usage telemetry (vela #277/#288 now reports c…

7e6f9a8

…achedReadTokens) + within-turn vs cross-turn cache framing

docs(spec): replace 153k-uncached premise with production link.usage_…

fb1cc5a

…events data (turn-2+ first call ~21% hit, ~24.5k uncached, ~12s)

docs(spec): master-plan — add P2 prefix-stabilization + correct P3 AM…

cb25d72

…R with production cross-turn cache data

nettee requested changes Jun 18, 2026

View reviewed changes

Comment thread specs/current/agent-startup-latency-profiling.md Outdated

Comment thread specs/current/agent-startup-latency-profiling.md Outdated

Comment thread specs/current/amr-latency-session-reuse-prompt-cache.md Outdated

lefarcen added 2 commits June 19, 2026 12:08

docs(spec): move #3408 stability/perf specs to specs/change (incremen…

bd1cb19

…tal change set) + address review (cli_ready_ms/Phase2 gate contract, purge stale 153k baseline)

docs(spec): align remaining AMR perf-map/verified-facts to first-call…

c31d2be

… cross-turn data (153k=context size not uncached)

lefarcen requested a review from nettee June 19, 2026 04:09

nettee reviewed Jun 19, 2026

View reviewed changes


		## Open questions

		- What exact signal should be the reliable "CLI ready" marker for each runtime? (claude_code / ACP / plain stdout differ) — this is the critical dependency for Phase 1.

Conversation

lefarcen commented Jun 18, 2026

Why

What users will see

What the data already shows (in the spec)

The plan

Surface area

Uh oh!

lefarcen commented Jun 18, 2026

Uh oh!

nettee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nettee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nettee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lefarcen commented Jun 18, 2026

Uh oh!

nettee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lefarcen commented Jun 18, 2026

Uh oh!

nettee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nettee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lefarcen commented Jun 18, 2026

Uh oh!

nettee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lefarcen commented Jun 18, 2026

Uh oh!

nettee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lefarcen commented Jun 18, 2026

Uh oh!

nettee left a comment

Choose a reason for hiding this comment

Uh oh!

nettee Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

nettee Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development