fix(pipelines): bound run_info data to prevent unbounded pipeline_runs growth#3075
Open
P7AC1D wants to merge 1 commit into
Open
fix(pipelines): bound run_info data to prevent unbounded pipeline_runs growth#3075P7AC1D wants to merge 1 commit into
P7AC1D wants to merge 1 commit into
Conversation
…s growth log_pipeline_run_start/complete/error stored the full stringified input payload in run_info["data"] on every run via the str(data) fallback. This column is never read back from the database, so for large inputs (e.g. raw text passed to add/cognify) the pipeline_runs table grows without bound. Extract the shared summarisation into summarize_run_info_data() and cap the stringified payload at 512 chars with a truncation marker, mirroring the intent of topoteretes#2549 for the search-history tables. Empty input and lists of Data records are unchanged.
Contributor
|
@P7AC1D does it make sense to truncate? This is useful for async runs when you want to get a status of the pipeline and understand how and where things are in the flow We'll check internally and see to simplify, but in general, running sqlite is not recommended for prod workloads |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
While debugging a local out-of-memory I traced it to the
pipeline_runstable in the SQLite store growing to several GB (~23k rows, ~280 KB each). Every pipeline run logs arun_inforow, andlog_pipeline_run_start/complete/errorstore the input payload via thestr(data)fallback wheneverdatais not a list ofDatarecords. In my case raw text passed toadd/rememberwas being stored verbatim on every run.That column is never read back from the database anywhere (the apparent readers operate on the in-memory
PipelineRunInforeturned bycognify()and only use.status/.pipeline_run_id), so it is write-only audit data that grows without limit, and opening the store pulls it into process memory.This is the same failure mode #2549 fixed for the
queries/resultssearch-history tables;pipeline_runswas not covered there.The fix extracts the shared branch into
summarize_run_info_data()and caps the stringified payload at 512 chars with a truncation marker that records the original length. Empty input still maps to"None", and lists ofDatarecords still reduce to their ids, so existing behaviour is unchanged for those cases.Closes #3074.
Acceptance Criteria
run_info["data"]no longer stores unbounded payloads: large inputs are truncated to 512 chars followed by a[truncated, N chars total]marker."None") and lists ofData(list of ids).Local test run:
Type of Change
Pre-submission Checklist
CONTRIBUTING.md)DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.