Skip to content

[P1] validate_setup must fail fast when claude-code main/fallback provider can't spawn in a nested Claude session #12

@anombyte93

Description

@anombyte93

Problem

When the main (or fallback) model role is configured with the claude-code provider and the engine runs inside a Claude Code / MCP session, validate_setup (the SETUP gate, namespaced mcp__atlas-engine__validate_setup) returns ready=true and the pipeline advances SETUP -> DISCOVER. The failure only surfaces much later, at parse_prd, as an opaque Claude Code process exited with code 1 with empty stderr. The mechanism: the claude-code provider is a CLI-borrowing provider — for the model-spawning roles it shells out claude -p "<prompt>", and a nested Claude session refuses the recursive launch (the child inherits CLAUDECODE=1). Setup should fail fast here with a clear diagnostic and a remediation, before any spawn is attempted.

Current Behavior

validate_setup() lives at mcp-server/capabilities.py:298-450 and runs EXACTLY 6 checks: binary, version, project, config, provider_main, provider_research.

  • Check 5 (provider_main, lines 386-413) parses .taskmaster/config.json (loaded at line 393) and reads only models.<role>.modelId:
    • main_model = models.get("main", {}).get("modelId") (line 395)
    • research_model = models.get("research", {}).get("modelId") (line 396)
    • fallback_model = models.get("fallback", {}).get("modelId") (line 397)
    • provider_ok = bool(main_model) (line 398)
  • It never reads models.main.provider. A config with provider="claude-code" + modelId="sonnet" passes provider_main (modelId is truthy), so critical_failures=0 and ready=true. The word "provider" in the check id is cosmetic — no provider VALUE is inspected.
  • There is zero env-var inspection anywhere in the plugin: a grep over mcp-server/ finds no reference to CLAUDECODE or CLAUDE_CODE_CHILD_SESSION, and capabilities.py does not even import os. The only os.environ touch is taskmaster.py:14 (env = os.environ.copy() to strip TASK_MASTER_PROJECT_ROOT before spawning the CLI — unrelated). No "am I a nested Claude session?" detection exists.

preflight() (mcp-server/pipeline.py:179-234, namespaced mcp__atlas-engine__engine_preflight) reads only pipeline.json + taskmaster tag state and recommends parse_prd (line 205-206); it performs no provider or env inspection and does not load config.json.

The SETUP skill exit gate (skills/setup/SKILL.md, Step 4 probe + Exit gate) advances SETUP -> DISCOVER on validate_setup readiness, passing the result dict as evidence — so the bad config flows straight through to parse_prd. parse_prd's actual claude -p spawn is in the external task-master-ai CLI, not in this repo, so the fix cannot live at the spawn site; it must be a fail-fast refusal in the SETUP gate.

(Naming note: there is no Python function literally named engine_preflight. The MCP tools are preflight (pipeline.py) and validate_setup (capabilities.py); the atlas-engine runtime namespaces them as engine_preflight / validate_setup.)

Expected Behavior

When the main or fallback role uses a CLI-spawning provider (claude-code and siblings) AND a nested-Claude environment signal is present (CLAUDECODE and/or CLAUDE_CODE_CHILD_SESSION set), validate_setup must emit a critical (non-warning) check that fails. This flips ready=false via the existing critical_failures aggregation (lines 432-436) and blocks the SETUP -> DISCOVER advance. The check detail must explain the recursive-spawn refusal, and the fix must steer to a non-spawning provider (anthropic/perplexity) or to running from a plain shell outside Claude Code — never to --set-main claude-code. The research role (plain-HTTP, e.g. perplexity) must NOT trip the guard.

Files to Touch

  • mcp-server/capabilities.py — PRIMARY. In/near Check 5 (after config is loaded at line 393), read models.main.provider and models.fallback.provider and check os.environ for CLAUDECODE / CLAUDE_CODE_CHILD_SESSION. Add a new critical check provider_spawnable (keep the 6 existing ids intact). Add import os at the top (lines 11-15) — it is not currently imported.
  • mcp-server/server.py — update the validate_setup tool docstring (line 73) from "Run the 6 Phase-0 SETUP checks" to 7 if a new check id is added.
  • tests/test_capabilities.py — update the hard-coded id list at line 183 (["binary","version","project","config","provider_main","provider_research"]) to include the new id.
  • mcp-server/pipeline.py — OPTIONAL secondary. preflight()'s recommended_action ladder (lines 203-219) could short-circuit parse_prd (line 205-206) via a shared helper for callers that bypass the SETUP gate. Only via a shared helper to avoid duplicating the config read.
  • skills/setup/SKILL.md — OPTIONAL doc note that a nested-Claude claude-code provider is a hard stop (the gate already blocks automatically once the critical check exists).

Researched Fix Approaches

1. [Recommended] — Env+config fail-fast inside validate_setup() (confidence: 92%)

  • Library/Config: task-master-ai@0.43.1 config key models.<role>.provider; Python os.environ (stdlib)
  • Pattern: After config.json loads (capabilities.py:393), read models.main.provider / models.fallback.provider (NOT modelId). Define a spawning-provider set {"claude-code","codex-cli","gemini-cli"}. Compute nested = bool(os.environ.get("CLAUDECODE") or os.environ.get("CLAUDE_CODE_CHILD_SESSION")). If a spawning provider is the main OR fallback role AND nested, append a NEW non-warning check provider_spawnable with passed=False. Leave provider_main's modelId logic intact (the 6 ids keep meaning); ADD the 7th. The existing critical_failures aggregation (lines 432-436) flips ready=false and the SETUP skill gate blocks before parse_prd spawns. Exclude the research role so perplexity never trips.
  • Why: Lowest-surface change — validate_setup() already owns the config.json parse and the critical_failures contract, and it auto-blocks the skill gate with no skill edit required.
  • Risk: tests/test_capabilities.py:183 hard-codes the 6-id list and server.py:73 says "6 Phase-0 SETUP checks" — both must update to 7. tests/test_integration.py asserts no fix string contains --set-main claude-code; the new fix must steer to anthropic/perplexity or a non-Claude shell, NOT --claude-code. CLAUDE_CODE_ENTRYPOINT alone can be sdk/mcp in non-nested contexts, so gate on CLAUDECODE / CLAUDE_CODE_CHILD_SESSION as primary signals and treat ENTRYPOINT as corroborating only.
  • Implementation hint:
SPAWNING_PROVIDERS = {"claude-code", "codex-cli", "gemini-cli"}
nested = bool(os.environ.get("CLAUDECODE") or os.environ.get("CLAUDE_CODE_CHILD_SESSION"))
main_provider = models.get("main", {}).get("provider")
fb_provider = models.get("fallback", {}).get("provider")
bad = nested and (main_provider in SPAWNING_PROVIDERS or fb_provider in SPAWNING_PROVIDERS)
checks.append({
    "id": "provider_spawnable",
    "name": "Main/fallback provider can spawn in this environment",
    "passed": not bad,
    "detail": (
        f"main provider '{main_provider}' must spawn a CLI but a nested Claude Code "
        f"session was detected (CLAUDECODE/CLAUDE_CODE_CHILD_SESSION set) — recursive "
        f"spawn is refused (exit 1, empty stderr)"
        if bad else f"main={main_provider}, fallback={fb_provider}; nested={nested}"
    ),
    "fix": (
        "Switch the main/fallback role off claude-code: set ANTHROPIC_API_KEY and run "
        "`task-master models --set-main sonnet --anthropic`, OR run task-master from a "
        "plain shell outside any Claude Code session."
        if bad else None
    ),
})
# verified env dump from a real nested run: CLAUDECODE=1, CLAUDE_CODE_CHILD_SESSION=1

2. [Alternative] — Add the guard to preflight() (pipeline.py) for an earlier refusal (confidence: 70%)

  • Library/Config: same key models.<role>.provider; os.environ (stdlib)
  • Pattern: Add the nested+spawning-provider detection to preflight()'s recommended_action ladder (pipeline.py:203-219); before recommending parse_prd (line 205-206), return recommended_action="fix_provider" with the diagnostic, so agents that call engine_preflight directly (bypassing the SETUP gate) still fail fast.
  • Why: preflight() is the tool literally named in the bug and is the recommender that today points at parse_prd (including into a polluted prior tag that already holds done tasks). Belt-and-suspenders with Approach 1.
  • Risk: preflight() does NOT currently load .taskmaster/config.json (only pipeline.json + tasks/state), so adding the check here duplicates the config read — DRY violation. Must be factored as a shared helper (e.g. spawn_block_reason(models) -> str | None in capabilities.py) imported by both. Do this IN ADDITION to Approach 1, not instead of it.

3. [Fallback] — Active spawn-probe: actually run claude -p "ok" (confidence: 58%)

  • Library/Config: the claude CLI -p/--print non-interactive flag (task-master-ai exposes NO doctor/health/test/dry-run command — verified against installed v0.43.1)
  • Pattern: When main/fallback provider is claude-code, run subprocess.run(["claude","-p","ping"], capture_output=True, text=True, timeout=10) inheriting the current env, mirroring what task-master will do; passed = rc == 0.
  • Why: Strictly more accurate than env-sniffing — also catches missing CLI, version mismatch, broken auth.
  • Risk: The probe ITSELF spawns claude -p inside the nested session — the very thing that crashes — and reports say a nested launch "will crash all active sessions", so probing could destabilise the parent. Adds latency/cost to every validate_setup. Only safe as a NON-nested augmentation: never probe when CLAUDECODE/CLAUDE_CODE_CHILD_SESSION is set (fail immediately per Approach 1); probe only to catch auth/version failures in a non-nested shell. Keep as a follow-on.

Reference

How task-master-ai (claude-task-master by eyaltoledano) handles this today: it does NOT detect or guard against it at all. The claude-code provider is CLI-borrowing — for the main/fallback roles it spawns claude -p "<prompt>" as a child process to reuse the authenticated session (no HTTP/API call), which is why the perplexity research role (plain HTTP) is unaffected. There is NO provider-callability test surface: verified against installed task-master-ai@0.43.1task-master --help exposes no doctor/health/validate/test/dry-run (only an unrelated validate-dependencies), and task-master models --help exposes only --set-main/--set-research/--set-fallback/--setup plus per-provider allow flags, no --test/--dry-run. The failure is a known, OPEN upstream bug: eyaltoledano/claude-task-master#1509 ("Claude Code and Codex CLI Provider Failures in parse-prd" — exit code 1, AI_APICallError, "fails silently", un-root-caused) and #928. The mechanism is documented in anthropics/claude-agent-sdk-python#573 ("Subprocess inherits CLAUDECODE=1 env var, preventing SDK usage from Claude Code hooks/plugins") and the Claude Code 2 nested-session guard ("Claude Code cannot be launched inside another Claude Code session... unset the CLAUDECODE environment variable"). Upstream's only workaround is to unset CLAUDECODE at the CLI spawn site — external to this plugin — so the correct in-repo fix is a fail-fast preflight refusal. Peers (claude-agent-sdk, clay #161, paperclip #560) all hit the identical nesting wall and handle it by detecting the env var and erroring early — exactly Approach 1.

Acceptance Criteria

  • When .taskmaster/config.json has models.main.provider == "claude-code" AND CLAUDECODE=1 (or CLAUDE_CODE_CHILD_SESSION=1) is in the environment, validate_setup() returns ready=false with critical_failures >= 1 and a failed check whose detail names the nested-Claude / recursive-spawn refusal.
  • The same condition with models.fallback.provider == "claude-code" (main set to anthropic) also fails the new check.
  • With NO nested-Claude env signal (CLAUDECODE and CLAUDE_CODE_CHILD_SESSION unset), a claude-code main provider passes the new check (ready=true if all other checks pass) — the guard fires only inside a nested session.
  • A research role using a non-spawning provider (e.g. perplexity) never trips the new check, regardless of nested env.
  • The remediation fix string for the failing check does NOT contain --set-main claude-code / --set-research claude-code / --set-fallback claude-code (passes tests/test_integration.py regression guard) and steers to anthropic/perplexity or a non-Claude shell.
  • tests/test_capabilities.py:183 id-list assertion is updated and passes; server.py validate_setup docstring count matches the new number of checks.
  • The SETUP -> DISCOVER advance in skills/setup/SKILL.md is blocked (advance_phase not called / fails its evidence gate) when the new critical check fails.
  • parse_prd succeeds and writes tasks to a fresh tag when invoked from inside a Claude Code / MCP session — i.e. with a non-spawning provider (anthropic main) configured, the end-to-end SETUP -> parse_prd path completes without the exit code 1 failure. (Inverse: with claude-code main in a nested session, setup refuses BEFORE parse_prd is reached.)
  • npm test / the Python test suite (pytest tests/) passes after the id-list and docstring updates.

Complexity: S

Trust Level: HINT (not specification)

The researched approaches above are starting points. Before implementing:

  1. Verify the library/config exists as stated (e.g. task-master models, read .taskmaster/config.json for the models.main.provider key).
  2. Check that imports/keys match reality (capabilities.py does NOT currently import os — add it).
  3. Try the recommended approach — if it works in 1-2 attempts, use it.
  4. If it fails, do NOT keep retrying — research why, explore alternatives.
  5. The acceptance criteria are the real spec, not the approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingconfigProvider / config handlingdxDeveloper experiencepreflightPreflight / setup validation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions