Skip to content

docs(spec): stability & performance — current state + fix/optimization plan#4504

Open
lefarcen wants to merge 16 commits into
mainfrom
spec/agent-startup-latency
Open

docs(spec): stability & performance — current state + fix/optimization plan#4504
lefarcen wants to merge 16 commits into
mainfrom
spec/agent-startup-latency

Conversation

@lefarcen

Copy link
Copy Markdown
Contributor

Why

Taking over the run-reliability/performance lane (#3408). The most-felt latency in the product is the ~10s of dead time before any token appears on an AMR or claude_code turn — and it repeats on every message, not just the first. This spec defines how we profile and fix it. Per the request, it is a plan with grounding data first, not code.

What users will see

Nothing yet — this is a checked-in spec under specs/current/. The eventual outcome is faster time-to-first-token on continued turns.

What the data already shows (in the spec)

The plan

  1. Phase 1 (observability only) — sub-instrument spawn_to_first_token into cli_ready_ms / session_init_ms / model_first_token_ms so we can tell cold-start re-init apart from context-growth and provider first-token latency.
  2. Phase 2 (experiments) — PostHog re-cut on the sub-segments + a local A/B (fresh vs resumed session), with a decision rule: proceed only if session_init_ms is a material share.
  3. Phase 3 (fix, gated on Phase 2)claude_code session resume (Claude Code adapter spawns a new session per message instead of resuming (token waste + lost edit state) #3380) / adopt fix: reuse ACP sessions per conversation #3535's ACP session reuse; validate via TTFT drop on continued turns, keeping correctness (no lost edit state).

Surface area

  • UI / CLI / API / contract / i18n / dependency — none
  • None — spec document only

The spec also documents why this line is out of scope for the #3545 QA gate: it removes dead time without changing model inputs/outputs, so there is no quality-regression surface (unlike token/context reduction, #3547).

Relates to #3408, #3380.

Proposes the plan for the ~10s pre-first-token latency users feel on every
AMR/claude_code turn. PostHog timing shows the entire cost lives in the opaque
spawn_to_first_token segment (AMR p50 10.4s / claude_code 8.2s), and a turn-
ordinal cut proves claude_code re-pays a full cold start every message (no
session reuse) — i.e. #3380. The spec defines sub-instrumentation to split that
segment (cli_ready / session_init / model_first_token), a data + local A/B
experiment phase to confirm session init dominates before any fix, and the
session-resume fix gated on that evidence. Notes why this line is out of scope
for the #3545 QA gate (no model-input/output change, no quality surface).

Relates to #3408, #3380.
@lefarcen

Copy link
Copy Markdown
Contributor Author

This is nicely grounded already — the PostHog split plus the turn-ordinal cut make the spec direction easy to follow.

One small PR-body follow-up before pool review: could you add a ## Validation section to the PR description summarizing what was checked for this spec PR (for example: doc-only change, no runtime behavior changed, no commands/tests run)? The spec file itself has the QA-boundary section already; this is just to make the PR template complete in the body.

@nettee nettee left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found three spec blockers in the changed ranges: the cli_ready_ms contract is still undefined across runtimes, the Phase 2 go/no-go rule is not yet testable as written, and the blanket QA-gate exemption conflicts with the spec's own correctness risk for session reuse.

🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.

Comment thread specs/current/agent-startup-latency-profiling.md Outdated
Comment thread specs/current/agent-startup-latency-profiling.md Outdated
Comment thread specs/current/agent-startup-latency-profiling.md Outdated

@nettee nettee left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two spec gaps need to be closed before this plan is safe to implement: the new timing fields need to stay on the existing PostHog/Langfuse contract path, and the QA-boundary section should not exempt Phase 3 session resume from correctness validation.

🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.

Comment thread specs/current/agent-startup-latency-profiling.md Outdated
Comment thread specs/current/agent-startup-latency-profiling.md Outdated
Add the mandatory Sources / 事实源 section (repo, branch, exact file:line
anchors for the timing pipeline and the #3380 session locus, how to pull, the
PostHog queries behind the tables, and access prereqs) so reviewers can ground-
check against real code. Add Why (用例/痛点 split), Goals/Non-goals,
Alternatives considered, Risks & mitigations, and Implementation slices per the
template. Optimization method unchanged: profile-first (sub-instrument
spawn_to_first_token) → confirm session_init dominates → session reuse (#3380).

@nettee nettee left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found three blocking spec gaps in the changed ranges: the Phase 1 cli_ready_ms contract is still undefined across runtimes, the Phase 2 go/no-go gate is not reproducible as written, and the QA-boundary section still exempts Phase 3 even though the same spec treats session resume as correctness-sensitive.

🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.

Comment thread specs/current/agent-startup-latency-profiling.md Outdated
Comment thread specs/current/agent-startup-latency-profiling.md Outdated
Comment thread specs/current/agent-startup-latency-profiling.md Outdated
@lefarcen

Copy link
Copy Markdown
Contributor Author

Hey @lefarcen@nettee's current-head review now covers the blocking work clearly: the cli_ready_ms contract still needs one auditable definition across runtimes, the Phase 2 go/no-go rule needs concrete pass criteria, and the QA-boundary section needs to narrow the exemption so Phase 3 keeps correctness validation in scope.

Once those points are addressed — plus the earlier ## Validation section in the PR body — this should be in much better shape for the next pass.

💡 To drive this PR to merge hands-free, paste this to your AI coding agent (Claude Code / Codex / opencode / Cursor …):
Take over nexu-io/open-design#4504 until it merges — read https://raw.githubusercontent.com/nexu-io/looper/main/skills/pr-takeover/SKILL.md and follow it.

…audit

Fold in the root cause of the low cache-hit rate, verified against latest
origin/main: (1) volatile system blocks (MCP list / memory / file list / run
context) sit in the prompt prefix BEFORE the stable append-only transcript
(server.ts:7670), so any change drifts the prefix and invalidates the whole
cache; (2) buildDaemonTranscript flattens history to content-only and drops the
native thinking/tool_use/tool_result structure, feeding a fresh session each
turn (daemon.ts:222); (3) the 5-min cache TTL expires between human turns. Split
the fix into a cheap prefix-stabilization win (3a, no gate) and the bigger
ACP-native session reuse (3b, needs vela session/load). Honest audit: the
PostHog cache data is real production-client telemetry; the earlier local
`claude -p` split bypassed the daemon and is dropped.

@nettee nettee left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking issue in the changed range: the new Phase 3 plan now includes prompt-shape and session-lifecycle changes, but the QA-boundary section still treats this line as if it never changes model inputs or behavior. That mismatch would let the implementation skip the validation this repo already expects for prompt/session changes.

🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.

Comment thread specs/current/agent-startup-latency-profiling.md Outdated
@lefarcen

Copy link
Copy Markdown
Contributor Author

Quick follow-up on the new head: @nettee's latest blocking review has narrowed to one remaining issue — the QA-boundary section still over-exempts Phase 3 even though the current plan now includes prompt-shape and session-lifecycle changes.

If you tighten that section so Phase 1/2 can stay measurement-only while Phase 3 explicitly keeps targeted correctness validation in scope — and add the ## Validation section in the PR body — this should be ready for another pass.

…oduction

- Phase 3a guard: a structural invariant built on the existing prompt-stack
  telemetry (prompt-telemetry.ts per-section fingerprints + SECTION_PRIORITY).
  Classify each PromptTelemetrySectionKind STABLE vs VOLATILE; assert the
  cacheable-prefix fingerprint is byte-stable when only volatile inputs change,
  and that every kind has a classification — so a future feature that adds an
  unclassified prompt section fails the test until it declares its side. Makes
  prefix stability self-enforcing as features evolve, not a one-time reorder.
- Reproduction appendix: the exact PostHog HogQL queries behind every table,
  the Langfuse trace-fetch method, and an explicit warning not to extrapolate
  from a bare `claude -p` run (it bypasses the daemon).

@nettee nettee left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking issue in the changed range: the QA-boundary section still exempts the whole line from #3545 even though the current Phase 3 plan now changes prompt ordering and cross-turn session state, which this repo already treats as correctness-sensitive.

🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.

Comment thread specs/current/agent-startup-latency-profiling.md Outdated
Dedicated AMR optimization spec, root-caused to verified code: vela
acp_runtime.go advertises loadSession:false + single-session + no session/load
handler, so AMR cannot reuse a session and re-sends the flattened transcript
every turn, re-processing ~100-153k uncached tokens per turn. Two levers:
(A) session reuse — universal must-fix (vela: loadSession true + session/load +
session persistence; daemon: one long-lived ACP connection per conversation),
helps every caching model; (B) cache_control passthrough + stable prefix — only
for explicit-cache models (Claude/Gemini); auto-cache models (DeepSeek — AMR's
top model — and OpenAI) need only a stable prefix. Cross-user shared cache (AMR
shared backend) makes the universal turn-1 prefix TTL-immune; turn-2+ uses the
1h extended TTL. Expected: AMR continued-turn TTFT ~11s -> ~6-7s (~40%, ~90% of
runs) plus a token-cost cut that eases insufficient_balance (#4455). Includes the
per-model cache table, pricing model, regression guard, and reproduction queries.

@nettee nettee left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found four blocking spec gaps that still leave the implementation path ambiguous on the current head: Phase 1 still does not define one auditable cli_ready_ms contract, the Phase 2 go/no-go gate is not reproducible, and both specs still over-exempt prompt/session changes from the correctness validation this repo already expects.

🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.

Comment thread specs/current/agent-startup-latency-profiling.md Outdated
Comment thread specs/current/agent-startup-latency-profiling.md Outdated
Comment thread specs/current/agent-startup-latency-profiling.md Outdated
Comment thread specs/current/amr-latency-session-reuse-prompt-cache.md Outdated
@lefarcen

Copy link
Copy Markdown
Contributor Author

Quick pass on the new head: @nettee's current-head review now spans both spec files, and the remaining blockers are broader again — the cli_ready_ms contract still needs one auditable definition, the Phase 2 gate still needs reproducible pass criteria, and the QA-boundary / correctness-validation story now needs to stay aligned across both specs as Phase 3 expands.

If you tighten those points in the two spec docs — plus add the ## Validation section in the PR body — this should be in much better shape for the next review round.

Caching already runs at the Vela Link gateway (prompt_cache.go, ephemeral/no
TTL); no ACP cache_control passthrough needed. Corrections: DeepSeek read is
0.5x (vela billing), not 0.1x; AMR head models are DeepSeek/GLM/Gemini. Session
reuse is an architecture change (vela tears down opencode temp home per turn;
opencode durable session reload unverified), not a config flip. Reordered: Step
1 = enable 1h TTL + stable prefix at the gateway (smaller); Step 2 = session
reuse (a cross-repo project). Marks this optimization as a project, not low-
hanging fruit.

@nettee nettee left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found four blocking spec gaps that still leave the implementation path ambiguous on the current head: cli_ready_ms still lacks one auditable cross-runtime definition, the Phase 2 gate is still not reproducible, and both specs still over-exempt Phase 3 prompt/session changes from correctness validation.

🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.

Comment thread specs/current/agent-startup-latency-profiling.md Outdated
Comment thread specs/current/agent-startup-latency-profiling.md Outdated
Comment thread specs/current/agent-startup-latency-profiling.md Outdated
Comment thread specs/current/amr-latency-session-reuse-prompt-cache.md Outdated
@lefarcen

Copy link
Copy Markdown
Contributor Author

Quick follow-up on the current head: @nettee's blocking review is still pointing at the same four must-clarify spec points before implementation is safe to start — one auditable cli_ready_ms contract, a reproducible Phase 2 go/no-go rule, and narrower QA-boundary language in both specs so Phase 3 prompt/session changes keep explicit correctness validation in scope.

If you tighten those points in the two spec docs — plus add the ## Validation section in the PR body — this should be ready for another pass.

lefarcen added 3 commits June 18, 2026 17:44
…n plan

Master living doc for #3408. Part 1 (现状, reviewer background): how we measure
(product-view ~13.5% vs engineering-view ~7% failure rate; the raw ~22% is
inflated by user-action + old-version noise ~3x), the failure map (who's at
fault), the perf map (TTFT mostly model-first-byte + measurement artifacts), and
user-reach as a priority axis. Part 2 (计划): the full prioritized plan —
P0-a metric calibration (done), P0-b fix_config codex normalizer gap, P1
process_exit deepening + reduce_context total-budget truncation, P2 switch_model
catalog hygiene + TTFT metric semantics, P3 AMR latency (vela cross-repo
project). Plus the explicitly-excluded phantoms (auth preflight, bun install,
etc.) with rationale. Each item drills into its own spec.
…n_failed)

Combines the P0-b fix_config fix and the P1 process_exit/execution_failed
deepening into one spec under 'engineering-view failure reduction' (the ~7% we
can actually fix). Slice 1 = fix_config: codex writes service_tier="default"
(Langfuse-confirmed), but codex-config-normalize.ts only handles "priority" —
generalize to normalize any value not in {fast,flex}. Slice 2 = execution_failed
deepening (#4502 done the close-reason split; next is Langfuse-mining the
stream_error texts to add real classifier patterns). Slice 3 = the already-named
real bugs in process_exit (spawn_ebadf/eperm 149/wk, agent_protocol_error 255,
fabricated_role_marker 310). Grounded with the 7d process_exit breakdown + the
confirmed 'default' value + code anchors.
Align all four #3408 stability/performance specs with the repo's English
convention. Translation + light design-level restructure (via codex GPT-5.5,
human-verified): all numeric data, file:line anchors, IDs, SQL/command snippets,
and table structures preserved verbatim; only prose and table text translated.
Each doc now marked 'Design doc (human/reviewer-facing); implementation runbooks
per slice are written separately at build time.'
@lefarcen lefarcen changed the title docs(spec): agent startup latency profiling and session-reuse docs(spec): stability & performance — current state + fix/optimization plan Jun 18, 2026
lefarcen added 2 commits June 18, 2026 18:36
…/OpenAI/Gemini confirmed, DeepSeek lead model unverified)
lefarcen added 3 commits June 18, 2026 19:32
…achedReadTokens) + within-turn vs cross-turn cache framing
…events data (turn-2+ first call ~21% hit, ~24.5k uncached, ~12s)

@nettee nettee left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found three blocking spec issues that still make the implementation path ambiguous on the current head: Phase 1 still does not define one auditable cli_ready_ms contract, the Phase 2 gate is still not reproducible enough to drive a consistent go/no-go call, and the AMR acceptance section still points at the superseded ~153k metric instead of the corrected first-call baseline.

🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.

Comment thread specs/current/agent-startup-latency-profiling.md Outdated
Comment thread specs/current/agent-startup-latency-profiling.md Outdated
Comment thread specs/current/amr-latency-session-reuse-prompt-cache.md Outdated
@lefarcen

Copy link
Copy Markdown
Contributor Author

Hey @lefarcen@nettee's latest current-head review now pins the remaining blockers down pretty clearly: the Phase 1 cli_ready_ms contract still needs one auditable cross-runtime definition, the Phase 2 go/no-go rule still needs reproducible pass criteria, and the AMR acceptance text should switch off the superseded ~153k baseline to the corrected turn-2+ first-call cohort the spec now treats as authoritative.

Once those three points are tightened — plus the still-missing ## Validation section in the PR body — this should be in much better shape for the next pass.

lefarcen added 2 commits June 19, 2026 12:08
…tal change set) + address review (cli_ready_ms/Phase2 gate contract, purge stale 153k baseline)
… cross-turn data (153k=context size not uncached)
@lefarcen lefarcen requested a review from nettee June 19, 2026 04:09

@nettee nettee left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of the new spec sections still contradict later decisions in the same document set, so I'd tighten those before anyone uses this PR as the implementation guide.

🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.


1. **Phase 1 — Add subsegment instrumentation to `spawn_to_first_token` (pure observability, ship first)**: Add two timestamps between `processSpawnedAt`→`firstTokenAt`: `cliReadyAt` and `sessionInitDoneAt`, and emit `cli_ready_ms` / `session_init_ms` / `model_first_token_ms` on `run_finished` (sum of the three ≈ `spawn_to_first_token_ms`; also emit `remainder` for audit). **`cli_ready_ms` contract — one auditable marker per runtime family (no per-owner ambiguity)**: `claude-stream-json` = first JSONL line parsed from stdout; `acp-json-rpc` (AMR/devin/hermes) = first well-formed ACP JSON-RPC message received; `json-event-stream` (codex/gemini/cursor) = first decoded stream event; `plain` = first non-empty stdout chunk. `sessionInitDoneAt` = the resume/`session/new` handshake ack for ACP, or the first model-bound request for stream agents. These markers are declared in one place (the runtime def) so two owners cannot record different events. The new fields ride the **existing shared analytics contract + Langfuse mirror** (same path as the other `run_finished` reliability fields — not a divergent surface). No behavior change.
2. **Phase 2 — Confirm experimentally**: Rerun turn-1 vs turn-2+ in PostHog with the new subsegments (expected: `session_init_ms` stays flat across turns = cold start repaid, `model_first_token_ms` grows with context); local A/B (same conversation fresh vs resumed, data injected only through production HTTP APIs). **Reproducible go/no-go gate**: population = successful `claude_code` runs, last 7d, turn-2+ only; statistic = **both p50 and p90** of `session_init_ms / spawn_to_first_token_ms`; threshold = enter Phase 3 if **p50 ≥ 30%** (or p90 ≥ 40%); local A/B = ≥ 30 paired fresh-vs-resumed runs. **Tie-break**: if the PostHog cut and the local A/B disagree, the PostHog production cut decides (A/B only sizes the effect). If `cli_ready_ms` dominates instead, choose a warmed process pool; otherwise classify as provider-side and stop with a recorded conclusion.
3. **Phase 3a — Stabilize cacheable prefix (cheap, independent, requires QA/state-continuity gate)**: Move or freeze volatile system blocks (file list / MCP list / personal memory / run context) from **before** the stable transcript to after it, so the append-only transcript becomes a prefix that can really hit; confirm vela/opencode gateway passes through `cache_control`. Prefix order is model input, so this gives immediate benefit (lower cache-miss rate) only after quality/state-continuity acceptance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Phase 3a step still says to "confirm vela/opencode gateway passes through cache_control", but the sibling AMR spec in this same PR has already retired that premise: amr-latency-session-reuse-prompt-cache.md:162-166 says Vela Link already injects cache directives and that the remaining gap is TTL support / provider coverage, not ACP passthrough. Leaving both versions in the checked-in plan matters because one implementer can read this file and spend time on a transport path the other file has already ruled out, while another implementer follows the AMR feasibility section and skips it. Please align this file to the later feasibility result, for example by replacing the passthrough check with the real open work (1h TTL where supported, explicit-cache coverage, and stable-prefix ordering), or by scoping the passthrough note to non-Vela paths only.

🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.


## Open questions

- What exact signal should be the reliable "CLI ready" marker for each runtime? (claude_code / ACP / plain stdout differ) — this is the critical dependency for Phase 1.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Phase 1 contract at line 86 now defines exact per-runtime cli_ready_ms markers, but this open-question bullet still says the exact signal is unresolved. That contradiction matters because it re-opens one of the main blockers this PR already closed: an implementer or reviewer can point at this section and treat the contract as unsettled even though the earlier section now makes it auditable. Please either remove this bullet or narrow it to the remaining real question, such as whether any adapter family still needs a null/timing_unavailable_reason fallback because the defined marker cannot actually be observed in practice.

🔁 Powered by Looper · runner=reviewer · agent=codex · An autonomous AI dev team for your GitHub repos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

risk/medium Medium risk: regular code changes size/M PR changes 100-300 lines type/docs Documentation changes only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants