[rel-v0.62] Fix preservation reconciliation to prevent deviation from regular flow for non-preservation-bound machines and prevent machines getting stuck in creation flow beyond CreationTimeout#1111
Conversation
…eservation state.
…hase. Ignore node not found errors.
…ines is not modified by `manageMachinePreservation`
…ss function calls.
…n manageMachinePreservation: This call is made to uncordon nodes that are preserved with preserve=now and the machine has recovered to Running. Preservation is not stopped in this case.
063f8db to
f8d53be
Compare
… reduced when possible
…edByMCM` to `PreserveMachineAnnotationValueAutoPreserved`
f8d53be to
200ce00
Compare
| } else if err != nil { | ||
| return | ||
| } |
There was a problem hiding this comment.
Similar here. err is populated only by c.getPreserveStateInfo(), and there we return if err != nil
| nodeValue string | ||
| machineValue string | ||
| lastAppliedNodeValue string | ||
| PreserveExpiryTimeSet bool |
There was a problem hiding this comment.
Need not be exported
| PreserveExpiryTimeSet bool | |
| preserveExpiryTimeSet bool |
| _, err := c.clearMachinePreserveExpiryTime(ctx, machine) | ||
| klog.Warningf("Node %q of machine %q not found. Proceeding to stop preservation on machine.", nodeName, machine.Name) | ||
| // Node not found, proceed to delete annotations and clear preserveExpiryTime on machine | ||
| machine, err = c.removePreserveAnnotationOnMachine(ctx, machine) |
There was a problem hiding this comment.
Is there a reason this is not gated with removePreservationAnnotations? I see that every other invocation is removePreserveAnnotationOnMachine() within this function is gated
| // PreventAutoPreserveAnnotationValues contains the values to check if a machine is already annotated for preservation, | ||
| // in which case it should not be auto-preserved. | ||
| var PreventAutoPreserveAnnotationValues = sets.New(PreserveMachineAnnotationValueNow, PreserveMachineAnnotationValueWhenFailed, PreserveMachineAnnotationValuePreservedByMCM, PreserveMachineAnnotationValueFalse) | ||
| var AllowedPreserveAnnotationValues = sets.New(PreserveMachineAnnotationValueNow, PreserveMachineAnnotationValueWhenFailed, PreserveMachineAnnotationValueAutoPreserved, PreserveMachineAnnotationValueFalse) |
There was a problem hiding this comment.
Yes! There is no use for preserve="". The user can either delete the annotation, or set it to false.
| machineAnnotationValue: machineutils.PreserveMachineAnnotationValuePreservedByMCM, | ||
| }, | ||
| }), | ||
| Entry("when node is annotated and preservation times out, should stop preservation", testCase{ |
There was a problem hiding this comment.
Why is this test removed? This looks like a valid case
There was a problem hiding this comment.
The test following this one covers this test case and also checks for annotation removal.
| retry: machineutils.LongRetry, | ||
| }, | ||
| }), | ||
| Entry("when invalid preserve annotation is added on node of un-preserved machine, should do nothing ", testCase{ |
There was a problem hiding this comment.
This looks like a valid case too
There was a problem hiding this comment.
The test following this one is a duplicate of this one.
| return true | ||
| } | ||
|
|
||
| func (c *controller) getPreserveStateInfo(machine *v1alpha1.Machine) (*preserveStateInfo, error) { |
There was a problem hiding this comment.
Please consider writing some unit tests for the new helper methods that you introduced
| if err != nil { | ||
| if apierrors.IsConflict(err) { | ||
| if apierrors.IsNotFound(err) { | ||
| klog.Warningf("Error during preservation flow:%v", err.Error()) |
There was a problem hiding this comment.
nit
| klog.Warningf("Error during preservation flow:%v", err.Error()) | |
| klog.Warningf("Error during preservation flow: %v", err.Error()) |
- nits - if not found, machine annotation value must be enforced - if machine annotation value is invalid, log and continue assuming it does not exist - gate `removePreserveAnnotationOnMachine` when node is not found
69b9eb6 to
cb105e7
Compare
d2f0d8f to
91b9476
Compare
r4mek
left a comment
There was a problem hiding this comment.
Thanks for the PR! I have posted few comments, some of which are more related to code cleanup which you can choose to do now or defer to a later PR.
…encing happens in future edits. Add removed test.
|
LGTM label has been added. DetailsGit tree hash: 880cd48b1d8ae1ea643961376d69f163bd383428 |
|
/lgtm with the understanding that code-simplification changes suggested will be deferred to another PR along with integration tests. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aaronfern The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
The previous implementation had several correctness issues in the machine preservation flow:
Runningmachine with a cordoned backing node — even one cordoned for reasons unrelated to preservation (e.g. Cluster Autoscaler scale-down) — would be unconditionally uncordoned on every reconcile pass.LastAppliedNodePreserveValueAnnotationsync in the defer compared annotations on a clone that had already been updated via UpdateStatus calls insidepreserveMachine/stopPreservationIfActive. Because UpdateStatus returns the server's stored metadata (not the in-memory mutation), the annotation diff was never detected and the sync was silently dropped. On paths where the sync did fire, it was using a stale ResourceVersion, producing spurious conflict errors.This PR fixes all of the above by:
preserveStateInfo structpopulated bygetPreserveStateInfoat the start of each reconcile, reading node and machine annotation state in a single pass and storing it for the rest of the function. This eliminates the double lister read and decouples the annotation sync from the defer.isMachinePreservationBoundgate: machines with no preservation state (noPreserveExpiryTime, no preserve annotation on machine or node, noLastAppliedNodePreserveValueAnnotation) skip the entire preservation flow andreturn LongRetry, nilimmediately.LastAppliedNodePreserveValueAnnotationsync to an explicitupdatePreserveAnnotationOnMachinecall at the end of the function viashouldAnnotationsBeUpdatedOnMachine, rather than a defer that compared potentially stale annotation maps.stopPreservationIfActive(step 4), so it fires only when preservation is actively being stopped for aRunning machine— not on every reconcile of any active machine. ThemanageMachinePreservationpath retains a narrower uncordon call gated on!preservationStopped && PreserveExpiryTime != nil && Phase == Runningto handle the case wherepreserveMachineis called on aRunningmachine (e.g.preserve=nowset by user).IsNotFoundhandling in the defer so transient not-found errors during preservation (e.g. machine deleted mid-reconcile) returnLongRetryrather thanShortRetry.removePreservationRelatedAnnotationsOnNode(now returns*corev1.Node) so that the subsequent uncordon step instopPreservationIfActiveoperates on a current node object rather than a potentially stale one.PreventAutoPreserveAnnotationValuesset in machineutils (it was identical toAllowedPreserveAnnotationValuesminus "");manageAutoPreservationOfFailedMachinesin machineset.go now usesAllowedPreserveAnnotationValuesdirectly.Which issue(s) this PR fixes:
Fixes #1110
Special notes for your reviewer:
Release note: