Skip to content

LLM completion calls need a universal wall-clock timeout (follow-up to #2994) #2995

@Vasilije1990

Description

@Vasilije1990

Background

PR #2994 shipped a low-risk interim fix for intermittent CI hangs (notably the LanceDB Tests job: ~50% exit 137 after ~25 min of silence). Root cause: LLM completion calls had no per-request timeout, so a stalled provider blocked an in-flight await for up to litellm's 600 s-per-attempt HTTP fallback — which the adapter's tenacity retry (stop_after_delay(128)) cannot interrupt mid-await — leading to multi-minute hangs and SIGKILL.

The interim fix (#2994) injects a default timeout into llm_args (LLM_CALL_TIMEOUT_SECONDS, default 120), honored by the litellm/OpenAI/Anthropic-backed adapters.

Why this follow-up is needed

The interim does not fully cover all paths or bound the worst case tightly:

  1. Native-SDK paths are not covered by llm_args.timeout in the same way:
    • anthropic (raw AsyncAnthropic via instructor.patch)
    • ollama (OpenAI SDK), azure managed-identity (native OpenAI SDK)
    • mistral native transcription, llama_cpp (explicitly excluded in the interim because its local mode forwards kwargs into an in-process call that rejects timeout)
    • Transcription/image call sites across adapters.
  2. Compounded retries (instructor max_retries=2 + litellm internal num_retries + the adapter's tenacity @retry) mean even with a per-request timeout the total can run several minutes on a persistent stall.
  3. asyncio.wait_for does NOT cleanly cancel instructor's retry loop at the deadline — verified empirically while validating docs(llm): document bounding a stalled LLM provider via LLM_ARGS #2994 (an outer wait_for(60) did not cut a stalled call at 60 s; it surfaced at ~175 s). So a naive outer wrapper is insufficient/needs care.

Proposed work

  • Add a universal wall-clock cap for every LLM call site (completion + transcription + image) across all adapters, including the native-SDK paths, in a way that actually interrupts a stalled call.
  • Decide the interaction with the existing tenacity @retry and litellm/instructor num_retries (likely constrain num_retries so one attempt ≈ the timeout, and let the cap be the hard backstop).
  • Handle llama_cpp (and any SDK that rejects a timeout kwarg) via its own SDK timeout param or a cancellation-safe wrapper.
  • Mirror the existing embeddings precedent (LiteLLMEmbeddingEngine already wraps calls in asyncio.wait_for).
  • Regression test offline using a stalling-endpoint harness.

Acceptance criteria

  • A stalled provider on any adapter raises/aborts within a bounded, configurable wall-clock limit (not 600 s × retries).
  • llama_cpp local mode and other native SDKs do not regress (no TypeError from an unsupported kwarg).
  • Slow-but-legitimate backends remain configurable (raise the limit).

Tracked from #2994.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions