docs(spec): stability & performance — current state + fix/optimization plan#4504
docs(spec): stability & performance — current state + fix/optimization plan#4504lefarcen wants to merge 16 commits into
Conversation
Proposes the plan for the ~10s pre-first-token latency users feel on every AMR/claude_code turn. PostHog timing shows the entire cost lives in the opaque spawn_to_first_token segment (AMR p50 10.4s / claude_code 8.2s), and a turn- ordinal cut proves claude_code re-pays a full cold start every message (no session reuse) — i.e. #3380. The spec defines sub-instrumentation to split that segment (cli_ready / session_init / model_first_token), a data + local A/B experiment phase to confirm session init dominates before any fix, and the session-resume fix gated on that evidence. Notes why this line is out of scope for the #3545 QA gate (no model-input/output change, no quality surface). Relates to #3408, #3380.
|
This is nicely grounded already — the PostHog split plus the turn-ordinal cut make the spec direction easy to follow. One small PR-body follow-up before pool review: could you add a |
nettee
left a comment
There was a problem hiding this comment.
I found three spec blockers in the changed ranges: the cli_ready_ms contract is still undefined across runtimes, the Phase 2 go/no-go rule is not yet testable as written, and the blanket QA-gate exemption conflicts with the spec's own correctness risk for session reuse.
nettee
left a comment
There was a problem hiding this comment.
Two spec gaps need to be closed before this plan is safe to implement: the new timing fields need to stay on the existing PostHog/Langfuse contract path, and the QA-boundary section should not exempt Phase 3 session resume from correctness validation.
🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.Add the mandatory Sources / 事实源 section (repo, branch, exact file:line anchors for the timing pipeline and the #3380 session locus, how to pull, the PostHog queries behind the tables, and access prereqs) so reviewers can ground- check against real code. Add Why (用例/痛点 split), Goals/Non-goals, Alternatives considered, Risks & mitigations, and Implementation slices per the template. Optimization method unchanged: profile-first (sub-instrument spawn_to_first_token) → confirm session_init dominates → session reuse (#3380).
nettee
left a comment
There was a problem hiding this comment.
I found three blocking spec gaps in the changed ranges: the Phase 1 cli_ready_ms contract is still undefined across runtimes, the Phase 2 go/no-go gate is not reproducible as written, and the QA-boundary section still exempts Phase 3 even though the same spec treats session resume as correctness-sensitive.
|
Hey @lefarcen — @nettee's current-head review now covers the blocking work clearly: the Once those points are addressed — plus the earlier
|
…audit Fold in the root cause of the low cache-hit rate, verified against latest origin/main: (1) volatile system blocks (MCP list / memory / file list / run context) sit in the prompt prefix BEFORE the stable append-only transcript (server.ts:7670), so any change drifts the prefix and invalidates the whole cache; (2) buildDaemonTranscript flattens history to content-only and drops the native thinking/tool_use/tool_result structure, feeding a fresh session each turn (daemon.ts:222); (3) the 5-min cache TTL expires between human turns. Split the fix into a cheap prefix-stabilization win (3a, no gate) and the bigger ACP-native session reuse (3b, needs vela session/load). Honest audit: the PostHog cache data is real production-client telemetry; the earlier local `claude -p` split bypassed the daemon and is dropped.
nettee
left a comment
There was a problem hiding this comment.
I found one blocking issue in the changed range: the new Phase 3 plan now includes prompt-shape and session-lifecycle changes, but the QA-boundary section still treats this line as if it never changes model inputs or behavior. That mismatch would let the implementation skip the validation this repo already expects for prompt/session changes.
🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.|
Quick follow-up on the new head: @nettee's latest blocking review has narrowed to one remaining issue — the QA-boundary section still over-exempts Phase 3 even though the current plan now includes prompt-shape and session-lifecycle changes. If you tighten that section so Phase 1/2 can stay measurement-only while Phase 3 explicitly keeps targeted correctness validation in scope — and add the |
…oduction - Phase 3a guard: a structural invariant built on the existing prompt-stack telemetry (prompt-telemetry.ts per-section fingerprints + SECTION_PRIORITY). Classify each PromptTelemetrySectionKind STABLE vs VOLATILE; assert the cacheable-prefix fingerprint is byte-stable when only volatile inputs change, and that every kind has a classification — so a future feature that adds an unclassified prompt section fails the test until it declares its side. Makes prefix stability self-enforcing as features evolve, not a one-time reorder. - Reproduction appendix: the exact PostHog HogQL queries behind every table, the Langfuse trace-fetch method, and an explicit warning not to extrapolate from a bare `claude -p` run (it bypasses the daemon).
nettee
left a comment
There was a problem hiding this comment.
I found one blocking issue in the changed range: the QA-boundary section still exempts the whole line from #3545 even though the current Phase 3 plan now changes prompt ordering and cross-turn session state, which this repo already treats as correctness-sensitive.
Dedicated AMR optimization spec, root-caused to verified code: vela acp_runtime.go advertises loadSession:false + single-session + no session/load handler, so AMR cannot reuse a session and re-sends the flattened transcript every turn, re-processing ~100-153k uncached tokens per turn. Two levers: (A) session reuse — universal must-fix (vela: loadSession true + session/load + session persistence; daemon: one long-lived ACP connection per conversation), helps every caching model; (B) cache_control passthrough + stable prefix — only for explicit-cache models (Claude/Gemini); auto-cache models (DeepSeek — AMR's top model — and OpenAI) need only a stable prefix. Cross-user shared cache (AMR shared backend) makes the universal turn-1 prefix TTL-immune; turn-2+ uses the 1h extended TTL. Expected: AMR continued-turn TTFT ~11s -> ~6-7s (~40%, ~90% of runs) plus a token-cost cut that eases insufficient_balance (#4455). Includes the per-model cache table, pricing model, regression guard, and reproduction queries.
nettee
left a comment
There was a problem hiding this comment.
I found four blocking spec gaps that still leave the implementation path ambiguous on the current head: Phase 1 still does not define one auditable cli_ready_ms contract, the Phase 2 go/no-go gate is not reproducible, and both specs still over-exempt prompt/session changes from the correctness validation this repo already expects.
|
Quick pass on the new head: @nettee's current-head review now spans both spec files, and the remaining blockers are broader again — the If you tighten those points in the two spec docs — plus add the |
Caching already runs at the Vela Link gateway (prompt_cache.go, ephemeral/no TTL); no ACP cache_control passthrough needed. Corrections: DeepSeek read is 0.5x (vela billing), not 0.1x; AMR head models are DeepSeek/GLM/Gemini. Session reuse is an architecture change (vela tears down opencode temp home per turn; opencode durable session reload unverified), not a config flip. Reordered: Step 1 = enable 1h TTL + stable prefix at the gateway (smaller); Step 2 = session reuse (a cross-repo project). Marks this optimization as a project, not low- hanging fruit.
nettee
left a comment
There was a problem hiding this comment.
I found four blocking spec gaps that still leave the implementation path ambiguous on the current head: cli_ready_ms still lacks one auditable cross-runtime definition, the Phase 2 gate is still not reproducible, and both specs still over-exempt Phase 3 prompt/session changes from correctness validation.
|
Quick follow-up on the current head: @nettee's blocking review is still pointing at the same four must-clarify spec points before implementation is safe to start — one auditable If you tighten those points in the two spec docs — plus add the |
…n plan Master living doc for #3408. Part 1 (现状, reviewer background): how we measure (product-view ~13.5% vs engineering-view ~7% failure rate; the raw ~22% is inflated by user-action + old-version noise ~3x), the failure map (who's at fault), the perf map (TTFT mostly model-first-byte + measurement artifacts), and user-reach as a priority axis. Part 2 (计划): the full prioritized plan — P0-a metric calibration (done), P0-b fix_config codex normalizer gap, P1 process_exit deepening + reduce_context total-budget truncation, P2 switch_model catalog hygiene + TTFT metric semantics, P3 AMR latency (vela cross-repo project). Plus the explicitly-excluded phantoms (auth preflight, bun install, etc.) with rationale. Each item drills into its own spec.
…n_failed)
Combines the P0-b fix_config fix and the P1 process_exit/execution_failed
deepening into one spec under 'engineering-view failure reduction' (the ~7% we
can actually fix). Slice 1 = fix_config: codex writes service_tier="default"
(Langfuse-confirmed), but codex-config-normalize.ts only handles "priority" —
generalize to normalize any value not in {fast,flex}. Slice 2 = execution_failed
deepening (#4502 done the close-reason split; next is Langfuse-mining the
stream_error texts to add real classifier patterns). Slice 3 = the already-named
real bugs in process_exit (spawn_ebadf/eperm 149/wk, agent_protocol_error 255,
fabricated_role_marker 310). Grounded with the 7d process_exit breakdown + the
confirmed 'default' value + code anchors.
Align all four #3408 stability/performance specs with the repo's English convention. Translation + light design-level restructure (via codex GPT-5.5, human-verified): all numeric data, file:line anchors, IDs, SQL/command snippets, and table structures preserved verbatim; only prose and table text translated. Each doc now marked 'Design doc (human/reviewer-facing); implementation runbooks per slice are written separately at build time.'
…MR step 0/1/2, QA-gate corrections)
…/OpenAI/Gemini confirmed, DeepSeek lead model unverified)
…achedReadTokens) + within-turn vs cross-turn cache framing
…events data (turn-2+ first call ~21% hit, ~24.5k uncached, ~12s)
…R with production cross-turn cache data
nettee
left a comment
There was a problem hiding this comment.
I found three blocking spec issues that still make the implementation path ambiguous on the current head: Phase 1 still does not define one auditable cli_ready_ms contract, the Phase 2 gate is still not reproducible enough to drive a consistent go/no-go call, and the AMR acceptance section still points at the superseded ~153k metric instead of the corrected first-call baseline.
|
Hey @lefarcen — @nettee's latest current-head review now pins the remaining blockers down pretty clearly: the Phase 1 Once those three points are tightened — plus the still-missing |
…tal change set) + address review (cli_ready_ms/Phase2 gate contract, purge stale 153k baseline)
… cross-turn data (153k=context size not uncached)
nettee
left a comment
There was a problem hiding this comment.
A couple of the new spec sections still contradict later decisions in the same document set, so I'd tighten those before anyone uses this PR as the implementation guide.
🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.|
|
||
| 1. **Phase 1 — Add subsegment instrumentation to `spawn_to_first_token` (pure observability, ship first)**: Add two timestamps between `processSpawnedAt`→`firstTokenAt`: `cliReadyAt` and `sessionInitDoneAt`, and emit `cli_ready_ms` / `session_init_ms` / `model_first_token_ms` on `run_finished` (sum of the three ≈ `spawn_to_first_token_ms`; also emit `remainder` for audit). **`cli_ready_ms` contract — one auditable marker per runtime family (no per-owner ambiguity)**: `claude-stream-json` = first JSONL line parsed from stdout; `acp-json-rpc` (AMR/devin/hermes) = first well-formed ACP JSON-RPC message received; `json-event-stream` (codex/gemini/cursor) = first decoded stream event; `plain` = first non-empty stdout chunk. `sessionInitDoneAt` = the resume/`session/new` handshake ack for ACP, or the first model-bound request for stream agents. These markers are declared in one place (the runtime def) so two owners cannot record different events. The new fields ride the **existing shared analytics contract + Langfuse mirror** (same path as the other `run_finished` reliability fields — not a divergent surface). No behavior change. | ||
| 2. **Phase 2 — Confirm experimentally**: Rerun turn-1 vs turn-2+ in PostHog with the new subsegments (expected: `session_init_ms` stays flat across turns = cold start repaid, `model_first_token_ms` grows with context); local A/B (same conversation fresh vs resumed, data injected only through production HTTP APIs). **Reproducible go/no-go gate**: population = successful `claude_code` runs, last 7d, turn-2+ only; statistic = **both p50 and p90** of `session_init_ms / spawn_to_first_token_ms`; threshold = enter Phase 3 if **p50 ≥ 30%** (or p90 ≥ 40%); local A/B = ≥ 30 paired fresh-vs-resumed runs. **Tie-break**: if the PostHog cut and the local A/B disagree, the PostHog production cut decides (A/B only sizes the effect). If `cli_ready_ms` dominates instead, choose a warmed process pool; otherwise classify as provider-side and stop with a recorded conclusion. | ||
| 3. **Phase 3a — Stabilize cacheable prefix (cheap, independent, requires QA/state-continuity gate)**: Move or freeze volatile system blocks (file list / MCP list / personal memory / run context) from **before** the stable transcript to after it, so the append-only transcript becomes a prefix that can really hit; confirm vela/opencode gateway passes through `cache_control`. Prefix order is model input, so this gives immediate benefit (lower cache-miss rate) only after quality/state-continuity acceptance. |
There was a problem hiding this comment.
This Phase 3a step still says to "confirm vela/opencode gateway passes through cache_control", but the sibling AMR spec in this same PR has already retired that premise: amr-latency-session-reuse-prompt-cache.md:162-166 says Vela Link already injects cache directives and that the remaining gap is TTL support / provider coverage, not ACP passthrough. Leaving both versions in the checked-in plan matters because one implementer can read this file and spend time on a transport path the other file has already ruled out, while another implementer follows the AMR feasibility section and skips it. Please align this file to the later feasibility result, for example by replacing the passthrough check with the real open work (1h TTL where supported, explicit-cache coverage, and stable-prefix ordering), or by scoping the passthrough note to non-Vela paths only.
|
|
||
| ## Open questions | ||
|
|
||
| - What exact signal should be the reliable "CLI ready" marker for each runtime? (claude_code / ACP / plain stdout differ) — this is the critical dependency for Phase 1. |
There was a problem hiding this comment.
The Phase 1 contract at line 86 now defines exact per-runtime cli_ready_ms markers, but this open-question bullet still says the exact signal is unresolved. That contradiction matters because it re-opens one of the main blockers this PR already closed: an implementer or reviewer can point at this section and treat the contract as unsettled even though the earlier section now makes it auditable. Please either remove this bullet or narrow it to the remaining real question, such as whether any adapter family still needs a null/timing_unavailable_reason fallback because the defined marker cannot actually be observed in practice.
Why
Taking over the run-reliability/performance lane (#3408). The most-felt latency in the product is the ~10s of dead time before any token appears on an AMR or
claude_codeturn — and it repeats on every message, not just the first. This spec defines how we profile and fix it. Per the request, it is a plan with grounding data first, not code.What users will see
Nothing yet — this is a checked-in spec under
specs/current/. The eventual outcome is faster time-to-first-token on continued turns.What the data already shows (in the spec)
spawn_to_first_tokensegment (AMR p50 10.4s / claude_code 8.2s; p90 24–34s). Queue + spawn + prompt-build are ~0.5s combined.claude_codere-pays a full cold start every message (turn 1 = 8.2s, turn 2+ = 9.3s — no reuse benefit), which is exactly Claude Code adapter spawns a new session per message instead of resuming (token waste + lost edit state) #3380. AMR shows only marginal reuse.The plan
spawn_to_first_tokenintocli_ready_ms/session_init_ms/model_first_token_msso we can tell cold-start re-init apart from context-growth and provider first-token latency.session_init_msis a material share.claude_codesession resume (Claude Code adapter spawns a new session per message instead of resuming (token waste + lost edit state) #3380) / adopt fix: reuse ACP sessions per conversation #3535's ACP session reuse; validate via TTFT drop on continued turns, keeping correctness (no lost edit state).Surface area
The spec also documents why this line is out of scope for the #3545 QA gate: it removes dead time without changing model inputs/outputs, so there is no quality-regression surface (unlike token/context reduction, #3547).
Relates to #3408, #3380.