Skip to content

feat: add ingestion recovery and concurrent add pipeline#104

Open
gwokhou wants to merge 2 commits into
VectifyAI:mainfrom
gwokhou:feat/concurrent-ingestion-recovery
Open

feat: add ingestion recovery and concurrent add pipeline#104
gwokhou wants to merge 2 commits into
VectifyAI:mainfrom
gwokhou:feat/concurrent-ingestion-recovery

Conversation

@gwokhou

@gwokhou gwokhou commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #86. That PR added cooperative KB mutation locks and atomic state writes, and left concurrent ingestion / transactional recovery for follow-up work. This PR adds per-ingest recovery journals plus concurrent directory preparation, while keeping live KB commits serialized under the mutation lock.

Changes

  • Add mutation snapshots, journals, rollback, and startup recovery for interrupted add mutations.
  • Stage converted artifacts before publishing them into the live KB.
  • Add configurable parallel file preparation for openkb add <directory>.
  • Keep commit/index/compile/registry/log mutation serialized for deterministic ordering and readable logs.
  • Preserve stable doc names for same-stem files and legacy registry entries.
  • Include PageIndex SQLite sidecars (pageindex.db-wal, pageindex.db-shm, pageindex.db-journal) in long-document rollback coverage.

Add pipeline

Before (#86 baseline):

  file A:
    [KB mutation lock]
      convert directly into live raw/ + wiki/sources/
      index / compile
      update hashes.json
      append wiki/log.md
    [/KB mutation lock]

  file B:
    same full serialized path

After (this PR):

  reservation:
    [KB mutation lock]
      recover pending journals
      reserve stable doc_name for each file
    [/KB mutation lock]

  prepare, parallel and outside live KB:
    worker 1: file A -> staging/raw + staging/wiki/sources
    worker 2: file B -> staging/raw + staging/wiki/sources
    worker 3: file C -> staging/raw + staging/wiki/sources

  commit, serialized per prepared file:
    [KB mutation lock]
      recover pending journals
      create mutation journal + snapshot live KB paths
      publish staged raw/source into live KB
      index / compile
      update hashes.json
      mark journal committed
    [/KB mutation lock]
    append wiki/log.md best-effort
    remove journal/backup best-effort

Only file-local preparation is parallel. Any live KB mutation remains serialized under the KB mutation lock, and each committed file gets its own recovery journal.

Related work / merge notes

Scope boundaries

  • Not a general database transaction system; the boundary is one ingest mutation and the live KB paths touched by this add path.
  • wiki/log.md is post-commit best-effort: successful ingests remain visible even if log append fails.
  • Cross-process directory adds may race on reserved doc names; commit detects the conflict and fails that file without corrupting the KB.

Validation

  • git diff --check upstream/main..HEAD
  • UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev pytest tests/test_add_command.py tests/test_mutation.py tests/test_converter.py

Result: 53 passed. Existing tests still emit compile_short_doc coroutine warning noise; no test failed.

@gwokhou gwokhou marked this pull request as ready for review June 18, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant