Skip to content

LanceDB Tests CI flake: process hangs at exit (subprocess-worker teardown), not an LLM stall #2997

@Vasilije1990

Description

@Vasilije1990

Symptom

Vector DB Tests / LanceDB Tests (uv run python ./cognee/tests/test_lancedb.py) intermittently (~50%) ends in exit 137 (SIGKILL) after ~25–31 min of log silence. Per @igorilic: "the test finishes but CI never ends — more likely closing of subprocesses / subprocess-mode teardown."

Confirmed: it's NOT the LLM

An earlier hypothesis (stalled LLM completion with no timeout) was ruled out:

Local narrowing (no LLM)

Probes that exit cleanly (~3–6 s), so the bug is NOT in basic subprocess setup/teardown:

  • LanceDB subprocess op only → clean.
  • Graph (ladybug) + vector (lancedb) subprocess ops + prune_system(metadata=True) → clean.
  • Ladybug op with engine left in the cache, no prune/close → clean (idle leaked engine reaps fine).

So the hang requires the full loaded cognify + multi-search flow (real data + concurrency), which couldn't be reproduced without LLM creds. Notably, test_lancedb.py::main ends with test_vector_engine_search_none_limit() (add + cognify + search) after the last prune, so graph/vector engines are left in the cache at process exit.

Candidate causes (for whoever picks this up)

  1. closing_lru_cache has no atexit hook. An engine still cached at interpreter exit never gets close()d. For the ladybug adapter that means its per-instance ThreadPoolExecutor (cognee/infrastructure/databases/graph/ladybug/adapter.py:184) is never shutdown(). concurrent.futures joins non-daemon executor threads at exit (_python_exit) — which hangs if a worker thread is blocked mid-operation against a dead/stuck subprocess (idle threads exit fine, which is why the simple probes pass).
  2. Reaper coverage: _reap_all_sessions_atexit (cognee_db_workers/harness.py:520) force-terminates SubprocessSessions in _all_sessions, but registration only happens after wait_for_ready (harness.py:725) — a session caught mid-startup, or a mp.Queue feeder thread with unflushed data, may not be reaped.
  3. atexit ordering between concurrent.futures._python_exit, multiprocessing._exit_function, and cognee's _reap_all_sessions_atexit (LIFO) can determine whether workers are killed before something tries to join a thread blocked on them.

Suggested direction

  • Register an atexit in closing_lru_cache that cache_clear()s all caches (closing every engine → shutting down ladybug executors / subprocess sessions deterministically) before concurrent.futures/multiprocessing atexit joins run; and/or make the ladybug ThreadPoolExecutor use daemon threads / register its shutdown.
  • Add a CI-side timeout-minutes to the LanceDB job so a regression fails fast instead of burning ~30 min.

cc @igorilic — flagging since this is the subprocess-worker layer you know best.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions