From 85731f9fbccb2bfc1425a3d6b50f84b7cf3817f5 Mon Sep 17 00:00:00 2001 From: "cortex-ai-agents[bot]" <279748396+cortex-ai-agents[bot]@users.noreply.github.com> Date: Mon, 29 Jun 2026 07:49:49 +0000 Subject: [PATCH] docs(failover): document postgres data-loss safeguard Co-Authored-By: Claude Opus 4.7 --- docs/reservations/failover-reservations.md | 25 ++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/docs/reservations/failover-reservations.md b/docs/reservations/failover-reservations.md index c2af22a8e..cd642d46c 100644 --- a/docs/reservations/failover-reservations.md +++ b/docs/reservations/failover-reservations.md @@ -26,15 +26,36 @@ The controller has two reconciliation modes: ```mermaid flowchart TD P1[List Hypervisors from K8s] + P1b["Build active-VM set from
Hypervisor CRD Status.Instances"] P2["List VMs from Postgres
(vm_source.go)"] - P3["Remove Invalid VMs from reservations
(e.g., vm:host mapping wrong or vm deleted)"] + P3["Remove Invalid VMs from reservations
(e.g., vm:host mapping wrong or vm deleted)
with postgres data-loss safeguard"] P4["Remove Non-eligible VMs from reservations
(via eligibility rules, reservation_eligibility.go)"] P5[Delete Empty Reservations] P6["Create/Assign Reservations
(reservation_scheduling.go)"] - P1 --> P2 --> P3 --> P4 --> P5 --> P6 + P1 --> P1b --> P2 --> P3 --> P4 --> P5 --> P6 ``` +#### Postgres Data-Loss Safeguard + +Before removing a VM from a failover reservation because it is missing from the +postgres-derived VM source, the controller cross-checks the Hypervisor CRD +`Status.Instances`. If the VM is still reported as active on any hypervisor, its +allocation is preserved in the reservation. + +This safeguard prevents a postgres data loss or restore event from cascading into +the mass deletion of all failover reservations. Without it, a wiped or partially +restored Nova database would make every VM appear "deleted," causing the +controller to empty and then garbage-collect all reservations -- leaving the +entire fleet without failover coverage until postgres recovers and the +reservations are rebuilt. + +The active-VM set (`vmsOnHypervisor`) is built once per reconciliation cycle by +iterating over all Hypervisor CRD `Status.Instances` entries that are marked +active. During the "Remove Invalid VMs" step, if a VM UUID is absent from the +postgres VM list but present in this set, the allocation is kept and a log +message is emitted. + ### Watch-based Reconciliation