Problem
When the main (or fallback) model role is configured with the claude-code provider and the engine runs inside a Claude Code / MCP session, validate_setup (the SETUP gate, namespaced mcp__atlas-engine__validate_setup) returns ready=true and the pipeline advances SETUP -> DISCOVER. The failure only surfaces much later, at parse_prd, as an opaque Claude Code process exited with code 1 with empty stderr. The mechanism: the claude-code provider is a CLI-borrowing provider — for the model-spawning roles it shells out claude -p "<prompt>", and a nested Claude session refuses the recursive launch (the child inherits CLAUDECODE=1). Setup should fail fast here with a clear diagnostic and a remediation, before any spawn is attempted.
Current Behavior
validate_setup() lives at mcp-server/capabilities.py:298-450 and runs EXACTLY 6 checks: binary, version, project, config, provider_main, provider_research.
- Check 5 (
provider_main, lines 386-413) parses .taskmaster/config.json (loaded at line 393) and reads only models.<role>.modelId:
main_model = models.get("main", {}).get("modelId") (line 395)
research_model = models.get("research", {}).get("modelId") (line 396)
fallback_model = models.get("fallback", {}).get("modelId") (line 397)
provider_ok = bool(main_model) (line 398)
- It never reads
models.main.provider. A config with provider="claude-code" + modelId="sonnet" passes provider_main (modelId is truthy), so critical_failures=0 and ready=true. The word "provider" in the check id is cosmetic — no provider VALUE is inspected.
- There is zero env-var inspection anywhere in the plugin: a grep over
mcp-server/ finds no reference to CLAUDECODE or CLAUDE_CODE_CHILD_SESSION, and capabilities.py does not even import os. The only os.environ touch is taskmaster.py:14 (env = os.environ.copy() to strip TASK_MASTER_PROJECT_ROOT before spawning the CLI — unrelated). No "am I a nested Claude session?" detection exists.
preflight() (mcp-server/pipeline.py:179-234, namespaced mcp__atlas-engine__engine_preflight) reads only pipeline.json + taskmaster tag state and recommends parse_prd (line 205-206); it performs no provider or env inspection and does not load config.json.
The SETUP skill exit gate (skills/setup/SKILL.md, Step 4 probe + Exit gate) advances SETUP -> DISCOVER on validate_setup readiness, passing the result dict as evidence — so the bad config flows straight through to parse_prd. parse_prd's actual claude -p spawn is in the external task-master-ai CLI, not in this repo, so the fix cannot live at the spawn site; it must be a fail-fast refusal in the SETUP gate.
(Naming note: there is no Python function literally named engine_preflight. The MCP tools are preflight (pipeline.py) and validate_setup (capabilities.py); the atlas-engine runtime namespaces them as engine_preflight / validate_setup.)
Expected Behavior
When the main or fallback role uses a CLI-spawning provider (claude-code and siblings) AND a nested-Claude environment signal is present (CLAUDECODE and/or CLAUDE_CODE_CHILD_SESSION set), validate_setup must emit a critical (non-warning) check that fails. This flips ready=false via the existing critical_failures aggregation (lines 432-436) and blocks the SETUP -> DISCOVER advance. The check detail must explain the recursive-spawn refusal, and the fix must steer to a non-spawning provider (anthropic/perplexity) or to running from a plain shell outside Claude Code — never to --set-main claude-code. The research role (plain-HTTP, e.g. perplexity) must NOT trip the guard.
Files to Touch
mcp-server/capabilities.py — PRIMARY. In/near Check 5 (after config is loaded at line 393), read models.main.provider and models.fallback.provider and check os.environ for CLAUDECODE / CLAUDE_CODE_CHILD_SESSION. Add a new critical check provider_spawnable (keep the 6 existing ids intact). Add import os at the top (lines 11-15) — it is not currently imported.
mcp-server/server.py — update the validate_setup tool docstring (line 73) from "Run the 6 Phase-0 SETUP checks" to 7 if a new check id is added.
tests/test_capabilities.py — update the hard-coded id list at line 183 (["binary","version","project","config","provider_main","provider_research"]) to include the new id.
mcp-server/pipeline.py — OPTIONAL secondary. preflight()'s recommended_action ladder (lines 203-219) could short-circuit parse_prd (line 205-206) via a shared helper for callers that bypass the SETUP gate. Only via a shared helper to avoid duplicating the config read.
skills/setup/SKILL.md — OPTIONAL doc note that a nested-Claude claude-code provider is a hard stop (the gate already blocks automatically once the critical check exists).
Researched Fix Approaches
1. [Recommended] — Env+config fail-fast inside validate_setup() (confidence: 92%)
- Library/Config:
task-master-ai@0.43.1 config key models.<role>.provider; Python os.environ (stdlib)
- Pattern: After config.json loads (capabilities.py:393), read
models.main.provider / models.fallback.provider (NOT modelId). Define a spawning-provider set {"claude-code","codex-cli","gemini-cli"}. Compute nested = bool(os.environ.get("CLAUDECODE") or os.environ.get("CLAUDE_CODE_CHILD_SESSION")). If a spawning provider is the main OR fallback role AND nested, append a NEW non-warning check provider_spawnable with passed=False. Leave provider_main's modelId logic intact (the 6 ids keep meaning); ADD the 7th. The existing critical_failures aggregation (lines 432-436) flips ready=false and the SETUP skill gate blocks before parse_prd spawns. Exclude the research role so perplexity never trips.
- Why: Lowest-surface change —
validate_setup() already owns the config.json parse and the critical_failures contract, and it auto-blocks the skill gate with no skill edit required.
- Risk:
tests/test_capabilities.py:183 hard-codes the 6-id list and server.py:73 says "6 Phase-0 SETUP checks" — both must update to 7. tests/test_integration.py asserts no fix string contains --set-main claude-code; the new fix must steer to anthropic/perplexity or a non-Claude shell, NOT --claude-code. CLAUDE_CODE_ENTRYPOINT alone can be sdk/mcp in non-nested contexts, so gate on CLAUDECODE / CLAUDE_CODE_CHILD_SESSION as primary signals and treat ENTRYPOINT as corroborating only.
- Implementation hint:
SPAWNING_PROVIDERS = {"claude-code", "codex-cli", "gemini-cli"}
nested = bool(os.environ.get("CLAUDECODE") or os.environ.get("CLAUDE_CODE_CHILD_SESSION"))
main_provider = models.get("main", {}).get("provider")
fb_provider = models.get("fallback", {}).get("provider")
bad = nested and (main_provider in SPAWNING_PROVIDERS or fb_provider in SPAWNING_PROVIDERS)
checks.append({
"id": "provider_spawnable",
"name": "Main/fallback provider can spawn in this environment",
"passed": not bad,
"detail": (
f"main provider '{main_provider}' must spawn a CLI but a nested Claude Code "
f"session was detected (CLAUDECODE/CLAUDE_CODE_CHILD_SESSION set) — recursive "
f"spawn is refused (exit 1, empty stderr)"
if bad else f"main={main_provider}, fallback={fb_provider}; nested={nested}"
),
"fix": (
"Switch the main/fallback role off claude-code: set ANTHROPIC_API_KEY and run "
"`task-master models --set-main sonnet --anthropic`, OR run task-master from a "
"plain shell outside any Claude Code session."
if bad else None
),
})
# verified env dump from a real nested run: CLAUDECODE=1, CLAUDE_CODE_CHILD_SESSION=1
2. [Alternative] — Add the guard to preflight() (pipeline.py) for an earlier refusal (confidence: 70%)
- Library/Config: same key
models.<role>.provider; os.environ (stdlib)
- Pattern: Add the nested+spawning-provider detection to
preflight()'s recommended_action ladder (pipeline.py:203-219); before recommending parse_prd (line 205-206), return recommended_action="fix_provider" with the diagnostic, so agents that call engine_preflight directly (bypassing the SETUP gate) still fail fast.
- Why:
preflight() is the tool literally named in the bug and is the recommender that today points at parse_prd (including into a polluted prior tag that already holds done tasks). Belt-and-suspenders with Approach 1.
- Risk:
preflight() does NOT currently load .taskmaster/config.json (only pipeline.json + tasks/state), so adding the check here duplicates the config read — DRY violation. Must be factored as a shared helper (e.g. spawn_block_reason(models) -> str | None in capabilities.py) imported by both. Do this IN ADDITION to Approach 1, not instead of it.
3. [Fallback] — Active spawn-probe: actually run claude -p "ok" (confidence: 58%)
- Library/Config: the
claude CLI -p/--print non-interactive flag (task-master-ai exposes NO doctor/health/test/dry-run command — verified against installed v0.43.1)
- Pattern: When main/fallback provider is claude-code, run
subprocess.run(["claude","-p","ping"], capture_output=True, text=True, timeout=10) inheriting the current env, mirroring what task-master will do; passed = rc == 0.
- Why: Strictly more accurate than env-sniffing — also catches missing CLI, version mismatch, broken auth.
- Risk: The probe ITSELF spawns
claude -p inside the nested session — the very thing that crashes — and reports say a nested launch "will crash all active sessions", so probing could destabilise the parent. Adds latency/cost to every validate_setup. Only safe as a NON-nested augmentation: never probe when CLAUDECODE/CLAUDE_CODE_CHILD_SESSION is set (fail immediately per Approach 1); probe only to catch auth/version failures in a non-nested shell. Keep as a follow-on.
Reference
How task-master-ai (claude-task-master by eyaltoledano) handles this today: it does NOT detect or guard against it at all. The claude-code provider is CLI-borrowing — for the main/fallback roles it spawns claude -p "<prompt>" as a child process to reuse the authenticated session (no HTTP/API call), which is why the perplexity research role (plain HTTP) is unaffected. There is NO provider-callability test surface: verified against installed task-master-ai@0.43.1 — task-master --help exposes no doctor/health/validate/test/dry-run (only an unrelated validate-dependencies), and task-master models --help exposes only --set-main/--set-research/--set-fallback/--setup plus per-provider allow flags, no --test/--dry-run. The failure is a known, OPEN upstream bug: eyaltoledano/claude-task-master#1509 ("Claude Code and Codex CLI Provider Failures in parse-prd" — exit code 1, AI_APICallError, "fails silently", un-root-caused) and #928. The mechanism is documented in anthropics/claude-agent-sdk-python#573 ("Subprocess inherits CLAUDECODE=1 env var, preventing SDK usage from Claude Code hooks/plugins") and the Claude Code 2 nested-session guard ("Claude Code cannot be launched inside another Claude Code session... unset the CLAUDECODE environment variable"). Upstream's only workaround is to unset CLAUDECODE at the CLI spawn site — external to this plugin — so the correct in-repo fix is a fail-fast preflight refusal. Peers (claude-agent-sdk, clay #161, paperclip #560) all hit the identical nesting wall and handle it by detecting the env var and erroring early — exactly Approach 1.
Acceptance Criteria
Complexity: S
Trust Level: HINT (not specification)
The researched approaches above are starting points. Before implementing:
- Verify the library/config exists as stated (e.g.
task-master models, read .taskmaster/config.json for the models.main.provider key).
- Check that imports/keys match reality (
capabilities.py does NOT currently import os — add it).
- Try the recommended approach — if it works in 1-2 attempts, use it.
- If it fails, do NOT keep retrying — research why, explore alternatives.
- The acceptance criteria are the real spec, not the approach.
Problem
When the
main(orfallback) model role is configured with theclaude-codeprovider and the engine runs inside a Claude Code / MCP session,validate_setup(the SETUP gate, namespacedmcp__atlas-engine__validate_setup) returnsready=trueand the pipeline advances SETUP -> DISCOVER. The failure only surfaces much later, atparse_prd, as an opaqueClaude Code process exited with code 1with empty stderr. The mechanism: theclaude-codeprovider is a CLI-borrowing provider — for the model-spawning roles it shells outclaude -p "<prompt>", and a nested Claude session refuses the recursive launch (the child inheritsCLAUDECODE=1). Setup should fail fast here with a clear diagnostic and a remediation, before any spawn is attempted.Current Behavior
validate_setup()lives atmcp-server/capabilities.py:298-450and runs EXACTLY 6 checks:binary,version,project,config,provider_main,provider_research.provider_main, lines 386-413) parses.taskmaster/config.json(loaded at line 393) and reads onlymodels.<role>.modelId:main_model = models.get("main", {}).get("modelId")(line 395)research_model = models.get("research", {}).get("modelId")(line 396)fallback_model = models.get("fallback", {}).get("modelId")(line 397)provider_ok = bool(main_model)(line 398)models.main.provider. A config withprovider="claude-code"+modelId="sonnet"passesprovider_main(modelId is truthy), socritical_failures=0andready=true. The word "provider" in the check id is cosmetic — no provider VALUE is inspected.mcp-server/finds no reference toCLAUDECODEorCLAUDE_CODE_CHILD_SESSION, andcapabilities.pydoes not even importos. The onlyos.environtouch istaskmaster.py:14(env = os.environ.copy()to stripTASK_MASTER_PROJECT_ROOTbefore spawning the CLI — unrelated). No "am I a nested Claude session?" detection exists.preflight()(mcp-server/pipeline.py:179-234, namespacedmcp__atlas-engine__engine_preflight) reads onlypipeline.json+ taskmaster tag state and recommendsparse_prd(line 205-206); it performs no provider or env inspection and does not loadconfig.json.The SETUP skill exit gate (
skills/setup/SKILL.md, Step 4 probe + Exit gate) advances SETUP -> DISCOVER onvalidate_setupreadiness, passing the result dict as evidence — so the bad config flows straight through toparse_prd.parse_prd's actualclaude -pspawn is in the externaltask-master-aiCLI, not in this repo, so the fix cannot live at the spawn site; it must be a fail-fast refusal in the SETUP gate.(Naming note: there is no Python function literally named
engine_preflight. The MCP tools arepreflight(pipeline.py) andvalidate_setup(capabilities.py); theatlas-engineruntime namespaces them asengine_preflight/validate_setup.)Expected Behavior
When the
mainorfallbackrole uses a CLI-spawning provider (claude-codeand siblings) AND a nested-Claude environment signal is present (CLAUDECODEand/orCLAUDE_CODE_CHILD_SESSIONset),validate_setupmust emit a critical (non-warning) check that fails. This flipsready=falsevia the existingcritical_failuresaggregation (lines 432-436) and blocks the SETUP -> DISCOVER advance. The checkdetailmust explain the recursive-spawn refusal, and thefixmust steer to a non-spawning provider (anthropic/perplexity) or to running from a plain shell outside Claude Code — never to--set-main claude-code. Theresearchrole (plain-HTTP, e.g. perplexity) must NOT trip the guard.Files to Touch
mcp-server/capabilities.py— PRIMARY. In/near Check 5 (after config is loaded at line 393), readmodels.main.providerandmodels.fallback.providerand checkos.environforCLAUDECODE/CLAUDE_CODE_CHILD_SESSION. Add a new critical checkprovider_spawnable(keep the 6 existing ids intact). Addimport osat the top (lines 11-15) — it is not currently imported.mcp-server/server.py— update thevalidate_setuptool docstring (line 73) from "Run the 6 Phase-0 SETUP checks" to 7 if a new check id is added.tests/test_capabilities.py— update the hard-coded id list at line 183 (["binary","version","project","config","provider_main","provider_research"]) to include the new id.mcp-server/pipeline.py— OPTIONAL secondary.preflight()'s recommended_action ladder (lines 203-219) could short-circuitparse_prd(line 205-206) via a shared helper for callers that bypass the SETUP gate. Only via a shared helper to avoid duplicating the config read.skills/setup/SKILL.md— OPTIONAL doc note that a nested-Claude claude-code provider is a hard stop (the gate already blocks automatically once the critical check exists).Researched Fix Approaches
1. [Recommended] — Env+config fail-fast inside validate_setup() (confidence: 92%)
task-master-ai@0.43.1config keymodels.<role>.provider; Pythonos.environ(stdlib)models.main.provider/models.fallback.provider(NOT modelId). Define a spawning-provider set{"claude-code","codex-cli","gemini-cli"}. Computenested = bool(os.environ.get("CLAUDECODE") or os.environ.get("CLAUDE_CODE_CHILD_SESSION")). If a spawning provider is the main OR fallback role ANDnested, append a NEW non-warning checkprovider_spawnablewithpassed=False. Leaveprovider_main's modelId logic intact (the 6 ids keep meaning); ADD the 7th. The existingcritical_failuresaggregation (lines 432-436) flipsready=falseand the SETUP skill gate blocks before parse_prd spawns. Exclude the research role so perplexity never trips.validate_setup()already owns the config.json parse and thecritical_failurescontract, and it auto-blocks the skill gate with no skill edit required.tests/test_capabilities.py:183hard-codes the 6-id list andserver.py:73says "6 Phase-0 SETUP checks" — both must update to 7.tests/test_integration.pyasserts no fix string contains--set-main claude-code; the newfixmust steer to anthropic/perplexity or a non-Claude shell, NOT--claude-code.CLAUDE_CODE_ENTRYPOINTalone can besdk/mcpin non-nested contexts, so gate onCLAUDECODE/CLAUDE_CODE_CHILD_SESSIONas primary signals and treatENTRYPOINTas corroborating only.2. [Alternative] — Add the guard to preflight() (pipeline.py) for an earlier refusal (confidence: 70%)
models.<role>.provider;os.environ(stdlib)preflight()'s recommended_action ladder (pipeline.py:203-219); before recommendingparse_prd(line 205-206), returnrecommended_action="fix_provider"with the diagnostic, so agents that callengine_preflightdirectly (bypassing the SETUP gate) still fail fast.preflight()is the tool literally named in the bug and is the recommender that today points atparse_prd(including into a polluted prior tag that already holds done tasks). Belt-and-suspenders with Approach 1.preflight()does NOT currently load.taskmaster/config.json(only pipeline.json + tasks/state), so adding the check here duplicates the config read — DRY violation. Must be factored as a shared helper (e.g.spawn_block_reason(models) -> str | Nonein capabilities.py) imported by both. Do this IN ADDITION to Approach 1, not instead of it.3. [Fallback] — Active spawn-probe: actually run
claude -p "ok"(confidence: 58%)claudeCLI-p/--printnon-interactive flag (task-master-aiexposes NO doctor/health/test/dry-run command — verified against installed v0.43.1)subprocess.run(["claude","-p","ping"], capture_output=True, text=True, timeout=10)inheriting the current env, mirroring what task-master will do;passed = rc == 0.claude -pinside the nested session — the very thing that crashes — and reports say a nested launch "will crash all active sessions", so probing could destabilise the parent. Adds latency/cost to every validate_setup. Only safe as a NON-nested augmentation: never probe whenCLAUDECODE/CLAUDE_CODE_CHILD_SESSIONis set (fail immediately per Approach 1); probe only to catch auth/version failures in a non-nested shell. Keep as a follow-on.Reference
How
task-master-ai(claude-task-master by eyaltoledano) handles this today: it does NOT detect or guard against it at all. Theclaude-codeprovider is CLI-borrowing — for themain/fallbackroles it spawnsclaude -p "<prompt>"as a child process to reuse the authenticated session (no HTTP/API call), which is why the perplexity research role (plain HTTP) is unaffected. There is NO provider-callability test surface: verified against installedtask-master-ai@0.43.1—task-master --helpexposes no doctor/health/validate/test/dry-run (only an unrelatedvalidate-dependencies), andtask-master models --helpexposes only--set-main/--set-research/--set-fallback/--setupplus per-provider allow flags, no--test/--dry-run. The failure is a known, OPEN upstream bug: eyaltoledano/claude-task-master#1509 ("Claude Code and Codex CLI Provider Failures in parse-prd" — exit code 1, AI_APICallError, "fails silently", un-root-caused) and #928. The mechanism is documented in anthropics/claude-agent-sdk-python#573 ("Subprocess inherits CLAUDECODE=1 env var, preventing SDK usage from Claude Code hooks/plugins") and the Claude Code 2 nested-session guard ("Claude Code cannot be launched inside another Claude Code session... unset the CLAUDECODE environment variable"). Upstream's only workaround is to unsetCLAUDECODEat the CLI spawn site — external to this plugin — so the correct in-repo fix is a fail-fast preflight refusal. Peers (claude-agent-sdk, clay #161, paperclip #560) all hit the identical nesting wall and handle it by detecting the env var and erroring early — exactly Approach 1.Acceptance Criteria
.taskmaster/config.jsonhasmodels.main.provider == "claude-code"ANDCLAUDECODE=1(orCLAUDE_CODE_CHILD_SESSION=1) is in the environment,validate_setup()returnsready=falsewithcritical_failures >= 1and a failed check whosedetailnames the nested-Claude / recursive-spawn refusal.models.fallback.provider == "claude-code"(main set to anthropic) also fails the new check.CLAUDECODEandCLAUDE_CODE_CHILD_SESSIONunset), aclaude-codemain provider passes the new check (ready=trueif all other checks pass) — the guard fires only inside a nested session.researchrole using a non-spawning provider (e.g. perplexity) never trips the new check, regardless of nested env.fixstring for the failing check does NOT contain--set-main claude-code/--set-research claude-code/--set-fallback claude-code(passestests/test_integration.pyregression guard) and steers to anthropic/perplexity or a non-Claude shell.tests/test_capabilities.py:183id-list assertion is updated and passes;server.pyvalidate_setup docstring count matches the new number of checks.skills/setup/SKILL.mdis blocked (advance_phase not called / fails its evidence gate) when the new critical check fails.parse_prdsucceeds and writes tasks to a fresh tag when invoked from inside a Claude Code / MCP session — i.e. with a non-spawning provider (anthropic main) configured, the end-to-end SETUP -> parse_prd path completes without theexit code 1failure. (Inverse: with claude-code main in a nested session, setup refuses BEFORE parse_prd is reached.)npm test/ the Python test suite (pytest tests/) passes after the id-list and docstring updates.Complexity: S
Trust Level: HINT (not specification)
The researched approaches above are starting points. Before implementing:
task-master models, read .taskmaster/config.json for themodels.main.providerkey).capabilities.pydoes NOT currently importos— add it).