Skip to content

[P0] parse_prd/expand_tasks fail (exit 1, empty stderr) under claude-code provider inside a nested Claude Code/MCP session — add nested-context preflight + provider re-route #11

@anombyte93

Description

@anombyte93

Problem

When the Atlas engine runs parse_prd / expand_tasks from inside an existing Claude Code or MCP session (the normal autonomous-build path), they fail with exit 1 and empty stderr — no tasks are written and the error is opaque. The research-role path (perplexity) still works, which is why the failure looks selective and confusing.

Root cause (confirmed): the upstream task-master-ai CLI's claude-code provider spawns claude -p "<prompt>" as a child process to borrow the authenticated session. Inside an already-nested Claude Code session that recursive headless spawn is hard-refused ("Claude Code cannot be launched inside another Claude Code session"), so the child dies with exit 1 / empty stderr. This plugin cannot fix the upstream spawn — but it owns the preflight/validation/skill guards and currently has none: a repo-wide grep for CLAUDECODE / CLAUDE_CODE_ENTRYPOINT / claude -p / nested / headless returns zero matches in source. Worse, the plugin actively steers users toward claude-code via --claude-code fix hints. This is P0 because it silently breaks the core parse→expand pipeline in exactly the headless/MCP context Atlas is designed to run in.

Current Behavior

Verified against the source at /home/anombyte/Shade_Gen/Projects/prd-taskmaster-plugin:

  • No nested-context detection anywhere. mcp-server/taskmaster.py:_build_env (lines 11-15) is the only place the subprocess env is assembled — it does env = os.environ.copy(); env.pop("TASK_MASTER_PROJECT_ROOT", None) and never inspects CLAUDECODE / CLAUDE_CODE_ENTRYPOINT or the configured provider.
  • validate_setup only checks that a model is configured, not runnable. In mcp-server/capabilities.py:validate_setup (lines 298-450) Check 5 sets provider_ok = bool(main_model) — it never checks whether the provider can host a child spawn in the current context, and never key-checks ANTHROPIC_API_KEY / PERPLEXITY_API_KEY. Its fix hints actively recommend task-master models --set-main sonnet --claude-code and --set-research opus --claude-code.
  • The parse/expand skill steps have no spawn-failure branch. skills/generate/SKILL.md Step 4 (parse, lines ~189-206) routes through the main role (the claude-code child spawn) via mcp__task-master-ai__parse_prd / CLI task-master parse-prd; Step 6 (expand, lines ~232-278) shells task-master expand --all. The "Patience under slow providers" block (lines ~270-278) anticipates claude-code as the provider but only tells the agent to wait — there is no branch for the spawn failing outright. The listed "MCP fallback" mcp__plugin_atlas-go_go__tm_parse_prd points at a tool that does not exist (see below).
  • server.py does not implement the parse/expand tools. mcp-server/server.py (lines 30-224) registers an 18-tool surface (preflight, validate_setup, init_taskmaster, calc_tasks, …) and does not define parse_prd, expand_tasks, tm_parse_prd, tm_analyze_complexity, or tm_parallel_expand — confirming the spawn is upstream and the SKILL.md tm_parse_prd fallback is dead.
  • Stale-tag pollution risk (secondary). mcp-server/pipeline.py:preflight (lines 118-234) reads currentTag from .taskmaster/state.json and counts tasks per tag, but the parse_prd recommendation only fires when tasks_count == 0 (line ~209). A stale-but-nonempty master tag full of prior done tasks does not recommend a fresh tag — parse_prd would append into the polluted tag.

VERSION-SKEW WARNING (read before coding): the live tools mcp__atlas-engine__parse_prd / expand_tasks / tm_parallel_expand / engine_preflight are exposed by a newer/divergent engine build that is NOT the source in this repo (this repo's server.py lacks them). Confirm which server.py backs the live mcp__atlas-engine__* tools before landing any spawn-routing change. The setup / validate / skill-level guards in this task land in this repo; the model-spawn routing (if changed) may need to land in the newer build.

Expected Behavior

When parse_prd / expand_tasks are invoked from inside a nested Claude Code / MCP session:

  1. The engine detects the nested context (CLAUDECODE set or CLAUDE_CODE_ENTRYPOINT == "cli") before any model-spawning call.
  2. If the configured main/fallback provider is claude-code and ANTHROPIC_API_KEY is absent, the engine does one of: (a) re-route main+fallback to a key-based / HTTP provider (anthropic if keyed, else the already-keyed perplexity), or (b) surface a clear, actionable error instead of the opaque exit-1/empty-stderr — never silently produce no tasks.
  3. validate_setup reports the configured provider as not runnable in this context (downgraded readiness) rather than passing on bool(main_model), and its fix hint stops recommending bare --claude-code in a nested context.
  4. parse_prd then succeeds and writes tasks to a fresh tag when invoked from inside a Claude Code / MCP session.

Files to Touch

  • mcp-server/capabilities.py — in validate_setup (298-450): add a "provider runnable in this context" check. Detect CLAUDECODE/CLAUDE_CODE_ENTRYPOINT + models.main.provider == "claude-code" + missing ANTHROPIC_API_KEY → mark provider_main failed (or downgraded), set ready=false, and replace the --claude-code fix hints (lines ~412, ~425) with a key-based re-route command.
  • skills/setup/SKILL.md — Step 3 provider config (98-128): in a headless/nested/MCP context, prefer a key-based provider (anthropic if ANTHROPIC_API_KEY, else perplipxity) for main+fallback; stop offering --claude-code (line ~125) as the default in that context.
  • skills/generate/SKILL.md — Step 4 parse (189-206) and Step 6 expand (232-278): add a nested-session + claude-code preflight before the spawn, and an explicit error-path that converts exit-1/empty-stderr into an actionable message or a provider re-route. Remove/repair the dead mcp__plugin_atlas-go_go__tm_parse_prd fallback ref (line ~201).
  • mcp-server/taskmaster.py_build_env (11-15): the single subprocess-env chokepoint; centralize nested-context detection + provider/env adjustment here if a guarded Python wrapper is added.
  • mcp-server/pipeline.pypreflight / tag accounting (118-234, esp. recommended_action / recommended_tag at ~203-233): detect a stale-but-nonempty current tag and recommend a fresh tag so parse_prd does not append into a polluted master.
  • mcp-server/server.py (30-224) — only if the chosen fix is a guarded Python parse_prd wrapper; that is where it would be registered. (Confirm version-skew first.)

Researched Fix Approaches

1. [Recommended] — Nested-context preflight guard + re-route main/fallback to a key-based provider (confidence: 88%)

  • Library/Config: task-master-ai CLI task-master models --set-main=<modelId> / --set-fallback=<modelId> (provider INFERRED from modelId); config key .taskmaster/config.json → models.{main,fallback}.{provider,modelId}. Env markers CLAUDECODE, CLAUDE_CODE_ENTRYPOINT.
  • Pattern: In validate_setup (capabilities.py) and the setup/generate skills, add a "provider runnable in this context" check. If nested AND main.provider == "claude-code" AND no ANTHROPIC_API_KEY → the child spawn WILL fail; downgrade readiness and re-route main+fallback to anthropic (if keyed) else perplexity (already keyed). Keep research=perplexity.
  • Why: This is the only path that works headless/nested — perplexity/anthropic-API are plain HTTP (no claude -p child). Matches the run evidence exactly (exit 1, empty stderr, ANTHROPIC_API_KEY unset, perplexity research unaffected). The fixable surface is this plugin's preflight/validation, not the upstream spawn.
  • Risk: perplexity sonar models are research-tuned — main-role PRD-structuring quality may drop vs sonnet. modelId strings drift across TM versions (sonar-pro vs sonar-reasoning-pro) — pin against the installed task-master models --list. Confirm WHICH server.py backs the live mcp__atlas-engine__* before landing spawn-routing there.
  • Implementation hint:
# detection (e.g. in capabilities.py / taskmaster._build_env)
nested = bool(os.environ.get("CLAUDECODE")) or os.environ.get("CLAUDE_CODE_ENTRYPOINT") == "cli"
main_provider = cfg.get("models", {}).get("main", {}).get("provider")
if nested and main_provider == "claude-code" and not os.environ.get("ANTHROPIC_API_KEY"):
    # re-route (non-interactive, provider inferred from modelId)
    if os.environ.get("ANTHROPIC_API_KEY"):
        # anthropic API path — no spawn
        cmd = ["task-master", "models", "--set-main", "claude-3-5-sonnet-20241022",
               "--set-fallback", "claude-3-haiku-20240307"]
    else:
        cmd = ["task-master", "models", "--set-main", "sonar-pro", "--set-fallback", "sonar"]
    # DO NOT pass --claude-code in a nested context

2. [Alternative] — Provide ANTHROPIC_API_KEY so task-master uses the Anthropic HTTP API instead of the spawn (confidence: 72%)

  • Library/Config: ANTHROPIC_API_KEY env var; per upstream TM issue #1256, when ANTHROPIC_API_KEY is present TM uses the Anthropic API and bypasses the claude-code CLI spawn. Centralize in taskmaster.py:_build_env.
  • Pattern: Keep main as anthropic (or claude-code), but guarantee ANTHROPIC_API_KEY is in the subprocess env. The "bug" (#1256: key takes precedence over claude-code config) is the cure in a nested context — HTTP call, no recursive claude -p.
  • Why: Lowest-code-change unblock when an Anthropic key is available; eliminates the failing spawn without rewriting config.
  • Risk: In the observed run ANTHROPIC_API_KEY was UNSET, so this only helps if a key can be supplied — otherwise fall to Approach 1's perplexity re-route. Incurs Anthropic API cost (vs free subscription). Behavior confirmed only via issue thread, not a versioned doc — version-sensitive.
  • Implementation hint:
# fix message when nested + claude-code + no key:
# "export ANTHROPIC_API_KEY=... (Anthropic API used instead of the claude-code spawn), or re-route main to perplexity"
# to set explicitly with key in env:
#   task-master models --set-main claude-3-5-sonnet-20241022

3. [Fallback] — Strip CLAUDECODE/CLAUDE_CODE_ENTRYPOINT from the spawn env (confidence: 28%)

  • Library/Config: env override in taskmaster.py:_build_env; CLAUDECODE, CLAUDE_CODE_ENTRYPOINT.
  • Pattern: Before spawning task-master, env.pop("CLAUDECODE", None); env.pop("CLAUDE_CODE_ENTRYPOINT", None) so the child claude -p no longer self-detects a parent session (mirrors the claude-agent-sdk env={'CLAUDECODE':''} workaround).
  • Why: Most direct attempt to keep the free claude-code subscription path working headless; env-inheritance mechanism is verified.
  • Risk: STRONGLY DISCOURAGED as a sole fix. (a) With CLAUDECODE unset inside a session, claude --print can produce NO output (silent empty result) — reproducing the exact "no tasks written" symptom (claude-code#29543). (b) The bundled cli.js has ADDITIONAL nested detection (parent-process scan / lockfile) beyond the env var, so it can still hang or exit empty. (c) Nested sessions "share runtime resources and will crash all active sessions" per the CLI's own error — risking the PARENT session. Only use gated behind a fallback to Approach 1.
  • Implementation hint:
env = os.environ.copy()
env.pop("CLAUDECODE", None); env.pop("CLAUDE_CODE_ENTRYPOINT", None)
# subprocess.run(["task-master", ...], env=env)  # NEVER rely on this alone

Reference

task-master-ai (claude-task-master by eyaltoledano) selects providers via .taskmaster/config.jsonmodels.{main,research,fallback}.{provider,modelId,maxTokens,temperature}. Roles are set non-interactively with task-master models --set-main=<modelId> / --set-research=<modelId> / --set-fallback=<modelId>; provider is INFERRED from a known modelId or forced with flags (--ollama, --openrouter, --codex-cli, --claude-code). The claude-code provider is special: no API key, no HTTP call — it spawns the local claude CLI to borrow the Pro/Max session (docs/examples/claude-code-usage.md). That spawn is what breaks inside a nested session: the child inherits CLAUDECODE=1 and the CLI hard-refuses ("Claude Code cannot be launched inside another Claude Code session … unset the CLAUDECODE environment variable"), producing exit 1 / empty stderr / no tasks. TM gives the Anthropic API path precedence when ANTHROPIC_API_KEY is set (issue #1256), so a present key sidesteps the spawn. The clean headless/MCP pattern: detect nested via CLAUDECODE/CLAUDE_CODE_ENTRYPOINT and prefer a key-based/HTTP provider for main+fallback, never claude-code. Sources: claude-task-master docs/command-reference.md, docs/configuration.md, docs/examples/claude-code-usage.md, issues #705/#1193/#1256; anthropics/claude-agent-sdk-python#573 (subprocess inherits CLAUDECODE=1, exact rejection, env={'CLAUDECODE':''} workaround); anthropics/claude-code#32618 (nesting env vars) and #29543 (claude --print empty output when CLAUDECODE unset — why the unset hack is unreliable).

Acceptance Criteria

  • parse_prd succeeds and writes tasks to a fresh tag when invoked from inside a Claude Code / MCP session (i.e. with CLAUDECODE=1 in the environment): .taskmaster/tasks/tasks.json is created with the expected number of tasks under a non-polluted tag, and the call exits 0.
  • expand_tasks / task-master expand --all likewise completes and lands subtasks (tasks.json mtime advances, subtasks present) when run nested.
  • A new test asserts the nested-context detector returns True when CLAUDECODE is set OR CLAUDE_CODE_ENTRYPOINT == "cli", and False when neither is set.
  • validate_setup, run with CLAUDECODE=1 set, models.main.provider == "claude-code", and ANTHROPIC_API_KEY unset, returns ready: false with the provider_main (or a new provider-runnable) check failed, and the fix hint is a key-based re-route command — NOT bare task-master models --set-main sonnet --claude-code.
  • When the nested guard re-routes, .taskmaster/config.json models.main/models.fallback are rewritten to a key-based/HTTP provider (anthropic if ANTHROPIC_API_KEY present, else perplexity), models.research stays perplexity, and the new modelIds exist in task-master models --list.
  • If no re-route is possible (no usable key/provider), the parse/expand path surfaces an actionable error message naming the nested-session + claude-code cause, instead of exit-1/empty-stderr with no tasks.
  • pipeline.py:preflight recommends a fresh tag (recommended_action/recommended_tag) when the current tag is non-empty but fully done (stale master), so parse_prd does not append into a polluted tag.
  • The dead mcp__plugin_atlas-go_go__tm_parse_prd reference in skills/generate/SKILL.md is either removed or backed by an implemented tool in server.py.
  • grep -rE 'CLAUDECODE|CLAUDE_CODE_ENTRYPOINT' mcp-server/ returns at least one match in the guard implementation (it currently returns zero).

Complexity: M

Trust Level: HINT (not specification)

The researched approaches above are starting points. Before implementing:

  1. Verify the library/config exists as stated (task-master models --list, read .taskmaster/config.json, confirm --set-main/--set-fallback flags).
  2. Resolve the VERSION-SKEW first: confirm which server.py backs the live mcp__atlas-engine__* tools (this repo's server.py lacks parse_prd/expand_tasks). The setup/validate/skill guards land in THIS repo; spawn-routing may land in the newer build.
  3. Try the recommended approach (perplexity/anthropic re-route) — if it works in 1-2 attempts, use it.
  4. If it fails, do NOT keep retrying — research why (modelId drift, provider inference), explore the alternatives. Avoid Approach 3 as a sole fix (silent-empty output, risks the parent session).
  5. The acceptance criteria are the real spec, not the approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingconfigProvider / config handlingdxDeveloper experiencepreflightPreflight / setup validation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions