LanceDB Tests CI flake: process hangs at exit (subprocess-worker teardown), not an LLM stall

## Symptom

`Vector DB Tests / LanceDB Tests` (`uv run python ./cognee/tests/test_lancedb.py`) intermittently (~50%) ends in **exit 137 (SIGKILL) after ~25–31 min of log silence**. Per @Igorilic: *"the test finishes but CI never ends — more likely closing of subprocesses / subprocess-mode teardown."*

## Confirmed: it's NOT the LLM

An earlier hypothesis (stalled LLM completion with no timeout) was **ruled out**:
- The job still hangs ~30 min with a per-request `timeout` + `num_retries=0` applied (PR #2994's earlier iteration).
- "Test finishes but process never exits" points at interpreter-shutdown teardown, not a mid-test call.

## Local narrowing (no LLM)

Probes that **exit cleanly** (~3–6 s), so the bug is NOT in basic subprocess setup/teardown:
- LanceDB subprocess op only → clean.
- Graph (ladybug) + vector (lancedb) subprocess ops + `prune_system(metadata=True)` → clean.
- Ladybug op with engine **left in the cache, no prune/close** → clean (idle leaked engine reaps fine).

So the hang requires the **full loaded `cognify` + multi-search flow** (real data + concurrency), which couldn't be reproduced without LLM creds. Notably, `test_lancedb.py::main` ends with `test_vector_engine_search_none_limit()` (add + cognify + search) **after** the last `prune`, so graph/vector engines are left in the cache at process exit.

## Candidate causes (for whoever picks this up)

1. **`closing_lru_cache` has no `atexit` hook.** An engine still cached at interpreter exit never gets `close()`d. For the ladybug adapter that means its per-instance `ThreadPoolExecutor` (`cognee/infrastructure/databases/graph/ladybug/adapter.py:184`) is never `shutdown()`. `concurrent.futures` joins non-daemon executor threads at exit (`_python_exit`) — which **hangs if a worker thread is blocked** mid-operation against a dead/stuck subprocess (idle threads exit fine, which is why the simple probes pass).
2. **Reaper coverage:** `_reap_all_sessions_atexit` (`cognee_db_workers/harness.py:520`) force-terminates `SubprocessSession`s in `_all_sessions`, but registration only happens after `wait_for_ready` (`harness.py:725`) — a session caught mid-startup, or a `mp.Queue` feeder thread with unflushed data, may not be reaped.
3. **atexit ordering** between `concurrent.futures._python_exit`, `multiprocessing._exit_function`, and cognee's `_reap_all_sessions_atexit` (LIFO) can determine whether workers are killed before something tries to join a thread blocked on them.

## Suggested direction

- Register an `atexit` in `closing_lru_cache` that `cache_clear()`s all caches (closing every engine → shutting down ladybug executors / subprocess sessions deterministically) **before** `concurrent.futures`/`multiprocessing` atexit joins run; and/or make the ladybug `ThreadPoolExecutor` use daemon threads / register its shutdown.
- Add a CI-side `timeout-minutes` to the LanceDB job so a regression fails fast instead of burning ~30 min.

cc @Igorilic — flagging since this is the subprocess-worker layer you know best.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LanceDB Tests CI flake: process hangs at exit (subprocess-worker teardown), not an LLM stall #2997

Symptom

Confirmed: it's NOT the LLM

Local narrowing (no LLM)

Candidate causes (for whoever picks this up)

Suggested direction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

LanceDB Tests CI flake: process hangs at exit (subprocess-worker teardown), not an LLM stall #2997

Description

Symptom

Confirmed: it's NOT the LLM

Local narrowing (no LLM)

Candidate causes (for whoever picks this up)

Suggested direction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions