Symptom
Vector DB Tests / LanceDB Tests (uv run python ./cognee/tests/test_lancedb.py) intermittently (~50%) ends in exit 137 (SIGKILL) after ~25–31 min of log silence. Per @igorilic: "the test finishes but CI never ends — more likely closing of subprocesses / subprocess-mode teardown."
Confirmed: it's NOT the LLM
An earlier hypothesis (stalled LLM completion with no timeout) was ruled out:
Local narrowing (no LLM)
Probes that exit cleanly (~3–6 s), so the bug is NOT in basic subprocess setup/teardown:
- LanceDB subprocess op only → clean.
- Graph (ladybug) + vector (lancedb) subprocess ops +
prune_system(metadata=True) → clean.
- Ladybug op with engine left in the cache, no prune/close → clean (idle leaked engine reaps fine).
So the hang requires the full loaded cognify + multi-search flow (real data + concurrency), which couldn't be reproduced without LLM creds. Notably, test_lancedb.py::main ends with test_vector_engine_search_none_limit() (add + cognify + search) after the last prune, so graph/vector engines are left in the cache at process exit.
Candidate causes (for whoever picks this up)
closing_lru_cache has no atexit hook. An engine still cached at interpreter exit never gets close()d. For the ladybug adapter that means its per-instance ThreadPoolExecutor (cognee/infrastructure/databases/graph/ladybug/adapter.py:184) is never shutdown(). concurrent.futures joins non-daemon executor threads at exit (_python_exit) — which hangs if a worker thread is blocked mid-operation against a dead/stuck subprocess (idle threads exit fine, which is why the simple probes pass).
- Reaper coverage:
_reap_all_sessions_atexit (cognee_db_workers/harness.py:520) force-terminates SubprocessSessions in _all_sessions, but registration only happens after wait_for_ready (harness.py:725) — a session caught mid-startup, or a mp.Queue feeder thread with unflushed data, may not be reaped.
- atexit ordering between
concurrent.futures._python_exit, multiprocessing._exit_function, and cognee's _reap_all_sessions_atexit (LIFO) can determine whether workers are killed before something tries to join a thread blocked on them.
Suggested direction
- Register an
atexit in closing_lru_cache that cache_clear()s all caches (closing every engine → shutting down ladybug executors / subprocess sessions deterministically) before concurrent.futures/multiprocessing atexit joins run; and/or make the ladybug ThreadPoolExecutor use daemon threads / register its shutdown.
- Add a CI-side
timeout-minutes to the LanceDB job so a regression fails fast instead of burning ~30 min.
cc @igorilic — flagging since this is the subprocess-worker layer you know best.
Symptom
Vector DB Tests / LanceDB Tests(uv run python ./cognee/tests/test_lancedb.py) intermittently (~50%) ends in exit 137 (SIGKILL) after ~25–31 min of log silence. Per @igorilic: "the test finishes but CI never ends — more likely closing of subprocesses / subprocess-mode teardown."Confirmed: it's NOT the LLM
An earlier hypothesis (stalled LLM completion with no timeout) was ruled out:
timeout+num_retries=0applied (PR docs(llm): document bounding a stalled LLM provider via LLM_ARGS #2994's earlier iteration).Local narrowing (no LLM)
Probes that exit cleanly (~3–6 s), so the bug is NOT in basic subprocess setup/teardown:
prune_system(metadata=True)→ clean.So the hang requires the full loaded
cognify+ multi-search flow (real data + concurrency), which couldn't be reproduced without LLM creds. Notably,test_lancedb.py::mainends withtest_vector_engine_search_none_limit()(add + cognify + search) after the lastprune, so graph/vector engines are left in the cache at process exit.Candidate causes (for whoever picks this up)
closing_lru_cachehas noatexithook. An engine still cached at interpreter exit never getsclose()d. For the ladybug adapter that means its per-instanceThreadPoolExecutor(cognee/infrastructure/databases/graph/ladybug/adapter.py:184) is nevershutdown().concurrent.futuresjoins non-daemon executor threads at exit (_python_exit) — which hangs if a worker thread is blocked mid-operation against a dead/stuck subprocess (idle threads exit fine, which is why the simple probes pass)._reap_all_sessions_atexit(cognee_db_workers/harness.py:520) force-terminatesSubprocessSessions in_all_sessions, but registration only happens afterwait_for_ready(harness.py:725) — a session caught mid-startup, or amp.Queuefeeder thread with unflushed data, may not be reaped.concurrent.futures._python_exit,multiprocessing._exit_function, and cognee's_reap_all_sessions_atexit(LIFO) can determine whether workers are killed before something tries to join a thread blocked on them.Suggested direction
atexitinclosing_lru_cachethatcache_clear()s all caches (closing every engine → shutting down ladybug executors / subprocess sessions deterministically) beforeconcurrent.futures/multiprocessingatexit joins run; and/or make the ladybugThreadPoolExecutoruse daemon threads / register its shutdown.timeout-minutesto the LanceDB job so a regression fails fast instead of burning ~30 min.cc @igorilic — flagging since this is the subprocess-worker layer you know best.