feat(cli): add re prune to back up and delete old CM data across datasets#447
feat(cli): add re prune to back up and delete old CM data across datasets#447joe-prosser wants to merge 5 commits into
re prune to back up and delete old CM data across datasets#447Conversation
…tasets `re prune --datasets D1,D2 --older-than-days N --backup-dir DIR` runs a single auditable backup -> verify -> delete sequence so the deleted set is provably a subset of a verified, write-once backup. Phases: resolve scope (datasets -> sources -> buckets); best-effort shared-source check (abort if an in-scope source belongs to a visible dataset not listed); write-once backup dir; stream annotation + deletion-set backups (CRC32 + line count per file, ids never held in memory); verify all backups against their checksums; two-pass verify-then-delete in batches of 32. Backups are written in `re create comments` / `re create emails` formats (full AnnotatedComment / full Email incl. mime_content), proven restorable end to end. Supports `--dry-run`, `--older-than-days` / `--before`, and `--yes` (required non-interactively). `--include-annotated` (default off) keeps annotated comments in the default mode; on, deletes them too (annotations still backed up). "Annotated" is scoped to the datasets the operator can see: a comment is kept only if its uid appears in the annotation backup, replacing the global `comment.has_annotations` flag. A comment annotated only in a dataset the operator can't access is treated as un-annotated and deleted in the default mode -- disclosed in the confirmation prompt. `--mailbox` scopes a run to a single mailbox: emails are filtered server-side by mailbox name, and comments are matched (case-insensitively) against the `Mailbox ID` user property that email parsing writes. A comment without that property is never matched, so a mailbox-scoped run only deletes comments it can positively attribute to the mailbox. A single up-front confirmation lists the limitations (comment attachment content not backed up; annotations in invisible datasets not backed up; whole-bucket-or-mailbox email deletion; the backup is the only undo and holds personal data) before any backup runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tommilligan
left a comment
There was a problem hiding this comment.
LGTM - I like the backup, and then read from backup style for ensuring we have the data locally before actioning.
I was a bit confused about the hash after action pattern (see comments inline), but think as currently used it's fine; just potentially a bit of a footgun for future refactorings.
| on_batch(&batch)?; | ||
| } | ||
|
|
||
| check_integrity(path, expected, count, crc32) |
There was a problem hiding this comment.
Here, your check_integrity is after you've actioned each batch - not before. So theoretically, I guess your ids could be corrupted in the file, you action deleting them an only realise the file was corrupt afterwards.
I would have expected an integrity check to be the first thing you do before any actions.
There was a problem hiding this comment.
I see in the comments there are some notes about "this function should be called first with a no-op" - I'd say it's easier to just have hash verification as a completely separate function distinct from this batch-style processing, unless there's a reason to combine them.
What
Adds
re prune: it deletes two kinds of old data across a set of datasets —comments (from the datasets' sources) and emails (from those sources'
buckets) — older than a cutoff, writing a verified, write-once backup to disk
first so the run is recoverable.
It backs up both the comments and the emails, verifies every backup file against
a checksum, and only then deletes — sourcing every deleted id from the verified
backup. Backups are written in
re create comments/re create emailsformat,so a run restores end-to-end. Annotated comments are kept by default;
--dry-runbacks up without deleting.🤖 Generated with Claude Code