Skip to content

fix(pipelines): bound run_info data to prevent unbounded pipeline_runs growth#3075

Open
P7AC1D wants to merge 1 commit into
topoteretes:devfrom
P7AC1D:fix/bound-pipeline-run-info-data
Open

fix(pipelines): bound run_info data to prevent unbounded pipeline_runs growth#3075
P7AC1D wants to merge 1 commit into
topoteretes:devfrom
P7AC1D:fix/bound-pipeline-run-info-data

Conversation

@P7AC1D

@P7AC1D P7AC1D commented Jun 15, 2026

Copy link
Copy Markdown

Description

While debugging a local out-of-memory I traced it to the pipeline_runs table in the SQLite store growing to several GB (~23k rows, ~280 KB each). Every pipeline run logs a run_info row, and log_pipeline_run_start / complete / error store the input payload via the str(data) fallback whenever data is not a list of Data records. In my case raw text passed to add / remember was being stored verbatim on every run.

That column is never read back from the database anywhere (the apparent readers operate on the in-memory PipelineRunInfo returned by cognify() and only use .status / .pipeline_run_id), so it is write-only audit data that grows without limit, and opening the store pulls it into process memory.

This is the same failure mode #2549 fixed for the queries / results search-history tables; pipeline_runs was not covered there.

The fix extracts the shared branch into summarize_run_info_data() and caps the stringified payload at 512 chars with a truncation marker that records the original length. Empty input still maps to "None", and lists of Data records still reduce to their ids, so existing behaviour is unchanged for those cases.

Closes #3074.

Acceptance Criteria

  • run_info["data"] no longer stores unbounded payloads: large inputs are truncated to 512 chars followed by a [truncated, N chars total] marker.
  • Existing behaviour preserved for empty input ("None") and lists of Data (list of ids).
  • New unit test covers empty / Data-list / small / large payloads.

Local test run:

cognee/tests/unit/modules/pipelines/test_summarize_run_info_data.py ....   [4 passed]

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Code refactoring
  • Other (please specify):

Pre-submission Checklist

  • I have tested my changes thoroughly before submitting this PR (See CONTRIBUTING.md)
  • This PR contains minimal changes necessary to address the issue/feature
  • My code follows the project's coding standards and style guidelines
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if applicable)
  • All new and existing tests pass
  • I have searched existing PRs to ensure this change hasn't been submitted already
  • I have linked any relevant issues in the description
  • My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

…s growth

log_pipeline_run_start/complete/error stored the full stringified input
payload in run_info["data"] on every run via the str(data) fallback. This
column is never read back from the database, so for large inputs (e.g. raw
text passed to add/cognify) the pipeline_runs table grows without bound.

Extract the shared summarisation into summarize_run_info_data() and cap the
stringified payload at 512 chars with a truncation marker, mirroring the
intent of topoteretes#2549 for the search-history tables. Empty input and lists of Data
records are unchanged.
@P7AC1D P7AC1D requested a review from Vasilije1990 as a code owner June 15, 2026 08:32
@Vasilije1990

Copy link
Copy Markdown
Contributor

@P7AC1D does it make sense to truncate? This is useful for async runs when you want to get a status of the pipeline and understand how and where things are in the flow

We'll check internally and see to simplify, but in general, running sqlite is not recommended for prod workloads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants