feat: add ingestion recovery and concurrent add pipeline by gwokhou · Pull Request #104 · VectifyAI/OpenKB

gwokhou · 2026-06-18T15:59:42Z

Summary

Follow-up to #86. That PR added cooperative KB mutation locks and atomic state writes, and left concurrent ingestion / transactional recovery for follow-up work. This PR adds per-ingest recovery journals plus concurrent directory preparation, while keeping live KB commits serialized under the mutation lock.

Changes

Add mutation snapshots, journals, rollback, and startup recovery for interrupted add mutations.
Stage converted artifacts before publishing them into the live KB.
Add configurable parallel file preparation for openkb add <directory>.
Keep commit/index/compile/registry/log mutation serialized for deterministic ordering and readable logs.
Preserve stable doc names for same-stem files and legacy registry entries.
Include PageIndex SQLite sidecars (pageindex.db-wal, pageindex.db-shm, pageindex.db-journal) in long-document rollback coverage.

Add pipeline

Before (#86 baseline):

  file A:
    [KB mutation lock]
      convert directly into live raw/ + wiki/sources/
      index / compile
      update hashes.json
      append wiki/log.md
    [/KB mutation lock]

  file B:
    same full serialized path

After (this PR):

  reservation:
    [KB mutation lock]
      recover pending journals
      reserve stable doc_name for each file
    [/KB mutation lock]

  prepare, parallel and outside live KB:
    worker 1: file A -> staging/raw + staging/wiki/sources
    worker 2: file B -> staging/raw + staging/wiki/sources
    worker 3: file C -> staging/raw + staging/wiki/sources

  commit, serialized per prepared file:
    [KB mutation lock]
      recover pending journals
      create mutation journal + snapshot live KB paths
      publish staged raw/source into live KB
      index / compile
      update hashes.json
      mark journal committed
    [/KB mutation lock]
    append wiki/log.md best-effort
    remove journal/backup best-effort

Only file-local preparation is parallel. Any live KB mutation remains serialized under the KB mutation lock, and each committed file gets its own recovery journal.

Related work / merge notes

performance #36: improves directory-add throughput, but does not claim million-document scalability; live KB mutation remains serialized.
feat(cli): import existing PageIndex Cloud indices via add --from-pageindex-cloud (closes #88) #97 / [Feature] Support importing existing PageIndex Cloud indices #88: cloud import is out of scope. If feat(cli): import existing PageIndex Cloud indices via add --from-pageindex-cloud (closes #88) #97 lands first, its pageindex_cloud path should be rebased through this recovery/commit boundary or explicitly kept separate.
Add SQLite-backed registry with JSON migration support #15: likely conflict. This PR keeps the JSON HashRegistry model and adds only HashRegistry.memory() for reservation. If Add SQLite-backed registry with JSON migration support #15 lands first, this branch needs a registry-abstraction rebase and snapshot coverage for hashes.db*.
feat: openkb visualize — interactive 3D knowledge graph #103: low-risk same-file overlap in openkb/cli.py.

Scope boundaries

Not a general database transaction system; the boundary is one ingest mutation and the live KB paths touched by this add path.
wiki/log.md is post-commit best-effort: successful ingests remain visible even if log append fails.
Cross-process directory adds may race on reserved doc names; commit detects the conflict and fails that file without corrupting the KB.

Validation

git diff --check upstream/main..HEAD
UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev pytest tests/test_add_command.py tests/test_mutation.py tests/test_converter.py

Result: 53 passed. Existing tests still emit compile_short_doc coroutine warning noise; no test failed.

gwokhou added 2 commits June 18, 2026 23:55

feat: add ingestion mutation recovery helpers

7086c0f

feat: add concurrent ingestion pipeline

3cfd681

gwokhou marked this pull request as ready for review June 18, 2026 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ingestion recovery and concurrent add pipeline#104

feat: add ingestion recovery and concurrent add pipeline#104
gwokhou wants to merge 2 commits into
VectifyAI:mainfrom
gwokhou:feat/concurrent-ingestion-recovery

gwokhou commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gwokhou commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Add pipeline

Related work / merge notes

Scope boundaries

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gwokhou commented Jun 18, 2026 •

edited

Loading