Skip to content

feat(cli): add re prune to back up and delete old CM data across datasets#447

Open
joe-prosser wants to merge 5 commits into
masterfrom
feat/RE-12394-prune
Open

feat(cli): add re prune to back up and delete old CM data across datasets#447
joe-prosser wants to merge 5 commits into
masterfrom
feat/RE-12394-prune

Conversation

@joe-prosser

Copy link
Copy Markdown
Collaborator

What

Adds re prune: it deletes two kinds of old data across a set of datasets —
comments (from the datasets' sources) and emails (from those sources'
buckets) — older than a cutoff, writing a verified, write-once backup to disk
first so the run is recoverable.

re prune --datasets D1,D2 --older-than-days N --backup-dir DIR \
         [--mailbox M] [--include-annotated] [--dry-run] [--yes]

It backs up both the comments and the emails, verifies every backup file against
a checksum, and only then deletes — sourcing every deleted id from the verified
backup. Backups are written in re create comments / re create emails format,
so a run restores end-to-end. Annotated comments are kept by default;
--dry-run backs up without deleting.

🤖 Generated with Claude Code

…tasets

`re prune --datasets D1,D2 --older-than-days N --backup-dir DIR` runs a
single auditable backup -> verify -> delete sequence so the deleted set is
provably a subset of a verified, write-once backup.

Phases: resolve scope (datasets -> sources -> buckets); best-effort
shared-source check (abort if an in-scope source belongs to a visible
dataset not listed); write-once backup dir; stream annotation + deletion-set
backups (CRC32 + line count per file, ids never held in memory); verify all
backups against their checksums; two-pass verify-then-delete in batches of 32.
Backups are written in `re create comments` / `re create emails` formats
(full AnnotatedComment / full Email incl. mime_content), proven restorable
end to end. Supports `--dry-run`, `--older-than-days` / `--before`, and
`--yes` (required non-interactively).

`--include-annotated` (default off) keeps annotated comments in the default
mode; on, deletes them too (annotations still backed up). "Annotated" is
scoped to the datasets the operator can see: a comment is kept only if its
uid appears in the annotation backup, replacing the global
`comment.has_annotations` flag. A comment annotated only in a dataset the
operator can't access is treated as un-annotated and deleted in the default
mode -- disclosed in the confirmation prompt.

`--mailbox` scopes a run to a single mailbox: emails are filtered server-side
by mailbox name, and comments are matched (case-insensitively) against the
`Mailbox ID` user property that email parsing writes. A comment without that
property is never matched, so a mailbox-scoped run only deletes comments it
can positively attribute to the mailbox.

A single up-front confirmation lists the limitations (comment attachment
content not backed up; annotations in invisible datasets not backed up;
whole-bucket-or-mailbox email deletion; the backup is the only undo and holds
personal data) before any backup runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@joe-prosser joe-prosser self-assigned this Jun 9, 2026
@joe-prosser joe-prosser requested a review from tommilligan June 9, 2026 14:50

@tommilligan tommilligan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I like the backup, and then read from backup style for ensuring we have the data locally before actioning.

I was a bit confused about the hash after action pattern (see comments inline), but think as currently used it's fine; just potentially a bit of a footgun for future refactorings.

Comment thread cli/src/commands/prune.rs Outdated
on_batch(&batch)?;
}

check_integrity(path, expected, count, crc32)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, your check_integrity is after you've actioned each batch - not before. So theoretically, I guess your ids could be corrupted in the file, you action deleting them an only realise the file was corrupt afterwards.

I would have expected an integrity check to be the first thing you do before any actions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see in the comments there are some notes about "this function should be called first with a no-op" - I'd say it's easier to just have hash verification as a completely separate function distinct from this batch-style processing, unless there's a reason to combine them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants