feat(cli): add `re prune` to back up and delete old CM data across datasets by joe-prosser · Pull Request #447 · reinfer/cli

joe-prosser · 2026-06-09T14:45:23Z

What

Adds re prune: it deletes two kinds of old data across a set of datasets —
comments (from the datasets' sources) and emails (from those sources'
buckets) — older than a cutoff, writing a verified, write-once backup to disk
first so the run is recoverable.

re prune --datasets D1,D2 --older-than-days N --backup-dir DIR \
         [--mailbox M] [--include-annotated] [--dry-run] [--yes]

It backs up both the comments and the emails, verifies every backup file against
a checksum, and only then deletes — sourcing every deleted id from the verified
backup. Backups are written in re create comments / re create emails format,
so a run restores end-to-end. Annotated comments are kept by default;
--dry-run backs up without deleting.

🤖 Generated with Claude Code

…tasets `re prune --datasets D1,D2 --older-than-days N --backup-dir DIR` runs a single auditable backup -> verify -> delete sequence so the deleted set is provably a subset of a verified, write-once backup. Phases: resolve scope (datasets -> sources -> buckets); best-effort shared-source check (abort if an in-scope source belongs to a visible dataset not listed); write-once backup dir; stream annotation + deletion-set backups (CRC32 + line count per file, ids never held in memory); verify all backups against their checksums; two-pass verify-then-delete in batches of 32. Backups are written in `re create comments` / `re create emails` formats (full AnnotatedComment / full Email incl. mime_content), proven restorable end to end. Supports `--dry-run`, `--older-than-days` / `--before`, and `--yes` (required non-interactively). `--include-annotated` (default off) keeps annotated comments in the default mode; on, deletes them too (annotations still backed up). "Annotated" is scoped to the datasets the operator can see: a comment is kept only if its uid appears in the annotation backup, replacing the global `comment.has_annotations` flag. A comment annotated only in a dataset the operator can't access is treated as un-annotated and deleted in the default mode -- disclosed in the confirmation prompt. `--mailbox` scopes a run to a single mailbox: emails are filtered server-side by mailbox name, and comments are matched (case-insensitively) against the `Mailbox ID` user property that email parsing writes. A comment without that property is never matched, so a mailbox-scoped run only deletes comments it can positively attribute to the mailbox. A single up-front confirmation lists the limitations (comment attachment content not backed up; annotations in invisible datasets not backed up; whole-bucket-or-mailbox email deletion; the backup is the only undo and holds personal data) before any backup runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

tommilligan

LGTM - I like the backup, and then read from backup style for ensuring we have the data locally before actioning.

I was a bit confused about the hash after action pattern (see comments inline), but think as currently used it's fine; just potentially a bit of a footgun for future refactorings.

tommilligan · 2026-06-12T15:50:48Z

+        on_batch(&batch)?;
+    }
+
+    check_integrity(path, expected, count, crc32)


Here, your check_integrity is after you've actioned each batch - not before. So theoretically, I guess your ids could be corrupted in the file, you action deleting them an only realise the file was corrupt afterwards.

I would have expected an integrity check to be the first thing you do before any actions.

I see in the comments there are some notes about "this function should be called first with a no-op" - I'd say it's easier to just have hash verification as a completely separate function distinct from this batch-style processing, unless there's a reason to combine them.

joe-prosser self-assigned this Jun 9, 2026

joe-prosser requested a review from tommilligan June 9, 2026 14:50

tommilligan approved these changes Jun 15, 2026

View reviewed changes

joe-prosser added 4 commits June 29, 2026 18:16

cargo fmt

bfe2e0c

add verify_backup_records

9f1730c

fix test_prune_keeps_dismissed_only_reviewed_comment

0954362

de-flake test_prune_mailbox_filters_comments_by_user_property

95fdefd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cli): add `re prune` to back up and delete old CM data across datasets#447

feat(cli): add `re prune` to back up and delete old CM data across datasets#447
joe-prosser wants to merge 5 commits into
masterfrom
feat/RE-12394-prune

joe-prosser commented Jun 9, 2026

Uh oh!

tommilligan left a comment

Uh oh!

tommilligan Jun 12, 2026

Uh oh!

tommilligan Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

joe-prosser commented Jun 9, 2026

What

Uh oh!

tommilligan left a comment

Choose a reason for hiding this comment

Uh oh!

tommilligan Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

tommilligan Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants