LLM completion calls need a universal wall-clock timeout (follow-up to #2994)

## Background

PR #2994 shipped a **low-risk interim fix** for intermittent CI hangs (notably the LanceDB Tests job: ~50% `exit 137` after ~25 min of silence). Root cause: LLM completion calls had **no per-request timeout**, so a stalled provider blocked an in-flight `await` for up to litellm's 600 s-per-attempt HTTP fallback — which the adapter's `tenacity` retry (`stop_after_delay(128)`) cannot interrupt mid-await — leading to multi-minute hangs and SIGKILL.

The interim fix (#2994) injects a default `timeout` into `llm_args` (`LLM_CALL_TIMEOUT_SECONDS`, default 120), honored by the litellm/OpenAI/Anthropic-backed adapters.

## Why this follow-up is needed

The interim does **not** fully cover all paths or bound the worst case tightly:

1. **Native-SDK paths are not covered by `llm_args.timeout`** in the same way:
   - `anthropic` (raw `AsyncAnthropic` via instructor.patch)
   - `ollama` (OpenAI SDK), `azure` managed-identity (native OpenAI SDK)
   - `mistral` native transcription, `llama_cpp` (explicitly *excluded* in the interim because its local mode forwards kwargs into an in-process call that rejects `timeout`)
   - Transcription/image call sites across adapters.
2. **Compounded retries** (instructor `max_retries=2` + litellm internal `num_retries` + the adapter's `tenacity @retry`) mean even with a per-request timeout the total can run several minutes on a persistent stall.
3. **`asyncio.wait_for` does NOT cleanly cancel instructor's retry loop at the deadline** — verified empirically while validating #2994 (an outer `wait_for(60)` did not cut a stalled call at 60 s; it surfaced at ~175 s). So a naive outer wrapper is insufficient/needs care.

## Proposed work

- Add a **universal wall-clock cap** for every LLM call site (completion + transcription + image) across **all** adapters, including the native-SDK paths, in a way that actually interrupts a stalled call.
- Decide the interaction with the existing `tenacity @retry` and litellm/instructor `num_retries` (likely constrain `num_retries` so one attempt ≈ the timeout, and let the cap be the hard backstop).
- Handle `llama_cpp` (and any SDK that rejects a `timeout` kwarg) via its own SDK timeout param or a cancellation-safe wrapper.
- Mirror the existing embeddings precedent (`LiteLLMEmbeddingEngine` already wraps calls in `asyncio.wait_for`).
- Regression test offline using a stalling-endpoint harness.

## Acceptance criteria

- A stalled provider on **any** adapter raises/aborts within a bounded, configurable wall-clock limit (not 600 s × retries).
- `llama_cpp` local mode and other native SDKs do not regress (no `TypeError` from an unsupported kwarg).
- Slow-but-legitimate backends remain configurable (raise the limit).

Tracked from #2994.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM completion calls need a universal wall-clock timeout (follow-up to #2994) #2995

Background

Why this follow-up is needed

Proposed work

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

LLM completion calls need a universal wall-clock timeout (follow-up to #2994) #2995

Description

Background

Why this follow-up is needed

Proposed work

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions