You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #2994 shipped a low-risk interim fix for intermittent CI hangs (notably the LanceDB Tests job: ~50% exit 137 after ~25 min of silence). Root cause: LLM completion calls had no per-request timeout, so a stalled provider blocked an in-flight await for up to litellm's 600 s-per-attempt HTTP fallback — which the adapter's tenacity retry (stop_after_delay(128)) cannot interrupt mid-await — leading to multi-minute hangs and SIGKILL.
The interim fix (#2994) injects a default timeout into llm_args (LLM_CALL_TIMEOUT_SECONDS, default 120), honored by the litellm/OpenAI/Anthropic-backed adapters.
Why this follow-up is needed
The interim does not fully cover all paths or bound the worst case tightly:
Native-SDK paths are not covered by llm_args.timeout in the same way:
anthropic (raw AsyncAnthropic via instructor.patch)
mistral native transcription, llama_cpp (explicitly excluded in the interim because its local mode forwards kwargs into an in-process call that rejects timeout)
Transcription/image call sites across adapters.
Compounded retries (instructor max_retries=2 + litellm internal num_retries + the adapter's tenacity @retry) mean even with a per-request timeout the total can run several minutes on a persistent stall.
asyncio.wait_for does NOT cleanly cancel instructor's retry loop at the deadline — verified empirically while validating docs(llm): document bounding a stalled LLM provider via LLM_ARGS #2994 (an outer wait_for(60) did not cut a stalled call at 60 s; it surfaced at ~175 s). So a naive outer wrapper is insufficient/needs care.
Proposed work
Add a universal wall-clock cap for every LLM call site (completion + transcription + image) across all adapters, including the native-SDK paths, in a way that actually interrupts a stalled call.
Decide the interaction with the existing tenacity @retry and litellm/instructor num_retries (likely constrain num_retries so one attempt ≈ the timeout, and let the cap be the hard backstop).
Handle llama_cpp (and any SDK that rejects a timeout kwarg) via its own SDK timeout param or a cancellation-safe wrapper.
Mirror the existing embeddings precedent (LiteLLMEmbeddingEngine already wraps calls in asyncio.wait_for).
Regression test offline using a stalling-endpoint harness.
Acceptance criteria
A stalled provider on any adapter raises/aborts within a bounded, configurable wall-clock limit (not 600 s × retries).
llama_cpp local mode and other native SDKs do not regress (no TypeError from an unsupported kwarg).
Slow-but-legitimate backends remain configurable (raise the limit).
Background
PR #2994 shipped a low-risk interim fix for intermittent CI hangs (notably the LanceDB Tests job: ~50%
exit 137after ~25 min of silence). Root cause: LLM completion calls had no per-request timeout, so a stalled provider blocked an in-flightawaitfor up to litellm's 600 s-per-attempt HTTP fallback — which the adapter'stenacityretry (stop_after_delay(128)) cannot interrupt mid-await — leading to multi-minute hangs and SIGKILL.The interim fix (#2994) injects a default
timeoutintollm_args(LLM_CALL_TIMEOUT_SECONDS, default 120), honored by the litellm/OpenAI/Anthropic-backed adapters.Why this follow-up is needed
The interim does not fully cover all paths or bound the worst case tightly:
llm_args.timeoutin the same way:anthropic(rawAsyncAnthropicvia instructor.patch)ollama(OpenAI SDK),azuremanaged-identity (native OpenAI SDK)mistralnative transcription,llama_cpp(explicitly excluded in the interim because its local mode forwards kwargs into an in-process call that rejectstimeout)max_retries=2+ litellm internalnum_retries+ the adapter'stenacity @retry) mean even with a per-request timeout the total can run several minutes on a persistent stall.asyncio.wait_fordoes NOT cleanly cancel instructor's retry loop at the deadline — verified empirically while validating docs(llm): document bounding a stalled LLM provider via LLM_ARGS #2994 (an outerwait_for(60)did not cut a stalled call at 60 s; it surfaced at ~175 s). So a naive outer wrapper is insufficient/needs care.Proposed work
tenacity @retryand litellm/instructornum_retries(likely constrainnum_retriesso one attempt ≈ the timeout, and let the cap be the hard backstop).llama_cpp(and any SDK that rejects atimeoutkwarg) via its own SDK timeout param or a cancellation-safe wrapper.LiteLLMEmbeddingEnginealready wraps calls inasyncio.wait_for).Acceptance criteria
llama_cpplocal mode and other native SDKs do not regress (noTypeErrorfrom an unsupported kwarg).Tracked from #2994.