Skip to content

[rel-v0.62] Fix preservation reconciliation to prevent deviation from regular flow for non-preservation-bound machines and prevent machines getting stuck in creation flow beyond CreationTimeout#1111

Merged
gardener-prow[bot] merged 20 commits into
gardener:rel-v0.62from
thiyyakat:preservation-hot-fix
Jul 1, 2026
Merged

Conversation

@thiyyakat

@thiyyakat thiyyakat commented Jun 24, 2026

Copy link
Copy Markdown
Member

What this PR does / why we need it:
The previous implementation had several correctness issues in the machine preservation flow:

  1. Unconditional uncordon on every reconcile: uncordonNodeIfCordoned was called whenever IsMachineActive returned true and a node name was present. This meant any Running machine with a cordoned backing node — even one cordoned for reasons unrelated to preservation (e.g. Cluster Autoscaler scale-down) — would be unconditionally uncordoned on every reconcile pass.
  2. No gate for non-preservation-bound machines: Every machine was processed through the full preservation logic on every reconcile, even machines with no preservation state at all (no annotations, no PreserveExpiryTime). This caused unnecessary API calls on non-preserved machines.
  3. Stale clone causing conflict errors: The LastAppliedNodePreserveValueAnnotation sync in the defer compared annotations on a clone that had already been updated via UpdateStatus calls inside preserveMachine/stopPreservationIfActive. Because UpdateStatus returns the server's stored metadata (not the in-memory mutation), the annotation diff was never detected and the sync was silently dropped. On paths where the sync did fire, it was using a stale ResourceVersion, producing spurious conflict errors.

This PR fixes all of the above by:

  • Introducing preserveStateInfo struct populated by getPreserveStateInfo at the start of each reconcile, reading node and machine annotation state in a single pass and storing it for the rest of the function. This eliminates the double lister read and decouples the annotation sync from the defer.
  • Adding an explicit isMachinePreservationBound gate: machines with no preservation state (no PreserveExpiryTime, no preserve annotation on machine or node, no LastAppliedNodePreserveValueAnnotation) skip the entire preservation flow and return LongRetry, nil immediately.
  • Moving the LastAppliedNodePreserveValueAnnotation sync to an explicit updatePreserveAnnotationOnMachine call at the end of the function via shouldAnnotationsBeUpdatedOnMachine, rather than a defer that compared potentially stale annotation maps.
  • Moving uncordon logic into stopPreservationIfActive (step 4), so it fires only when preservation is actively being stopped for a Running machine — not on every reconcile of any active machine. The manageMachinePreservation path retains a narrower uncordon call gated on !preservationStopped && PreserveExpiryTime != nil && Phase == Running to handle the case where preserveMachine is called on a Running machine (e.g. preserve=now set by user).
  • Adding IsNotFound handling in the defer so transient not-found errors during preservation (e.g. machine deleted mid-reconcile) return LongRetryrather than ShortRetry.
  • Propagating the updated node object from removePreservationRelatedAnnotationsOnNode (now returns *corev1.Node) so that the subsequent uncordon step in stopPreservationIfActive operates on a current node object rather than a potentially stale one.
  • Removing the redundant PreventAutoPreserveAnnotationValues set in machineutils (it was identical to AllowedPreserveAnnotationValues minus ""); manageAutoPreservationOfFailedMachines in machineset.go now uses AllowedPreserveAnnotationValues directly.

Which issue(s) this PR fixes:
Fixes #1110

Special notes for your reviewer:

Release note:

Fix preservation reconcile loop to avoid unconditionally uncordoning nodes unrelated to machine preservation, prevent spurious writes on non-preserved machines, and eliminate conflict errors caused by stale machine object comparison in the defer annotation sync.

@thiyyakat thiyyakat requested a review from a team as a code owner June 24, 2026 10:19
@gardener-prow gardener-prow Bot added do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 24, 2026
@thiyyakat thiyyakat changed the title Preservation hot fix Fix preservation reconciliation to prevent deviation from regular flow for non-preservation-bound machines Jun 24, 2026
@thiyyakat thiyyakat added the kind/bug Bug label Jun 24, 2026
@gardener-prow gardener-prow Bot removed the do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Jun 24, 2026
@thiyyakat thiyyakat force-pushed the preservation-hot-fix branch from 063f8db to f8d53be Compare June 24, 2026 10:23
Comment thread pkg/util/provider/machineutils/utils.go Outdated
@thiyyakat thiyyakat force-pushed the preservation-hot-fix branch from f8d53be to 200ce00 Compare June 24, 2026 10:57
Comment thread pkg/util/provider/machinecontroller/machine.go
Comment on lines +794 to +796
} else if err != nil {
return
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here. err is populated only by c.getPreserveStateInfo(), and there we return if err != nil

@thiyyakat thiyyakat Jun 25, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in cb105e7

nodeValue string
machineValue string
lastAppliedNodeValue string
PreserveExpiryTimeSet bool

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need not be exported

Suggested change
PreserveExpiryTimeSet bool
preserveExpiryTimeSet bool

@thiyyakat thiyyakat Jun 25, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in cb105e7.

_, err := c.clearMachinePreserveExpiryTime(ctx, machine)
klog.Warningf("Node %q of machine %q not found. Proceeding to stop preservation on machine.", nodeName, machine.Name)
// Node not found, proceed to delete annotations and clear preserveExpiryTime on machine
machine, err = c.removePreserveAnnotationOnMachine(ctx, machine)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is not gated with removePreservationAnnotations? I see that every other invocation is removePreserveAnnotationOnMachine() within this function is gated

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in cb105e7

// PreventAutoPreserveAnnotationValues contains the values to check if a machine is already annotated for preservation,
// in which case it should not be auto-preserved.
var PreventAutoPreserveAnnotationValues = sets.New(PreserveMachineAnnotationValueNow, PreserveMachineAnnotationValueWhenFailed, PreserveMachineAnnotationValuePreservedByMCM, PreserveMachineAnnotationValueFalse)
var AllowedPreserveAnnotationValues = sets.New(PreserveMachineAnnotationValueNow, PreserveMachineAnnotationValueWhenFailed, PreserveMachineAnnotationValueAutoPreserved, PreserveMachineAnnotationValueFalse)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "" removed intentionally?

@thiyyakat thiyyakat Jun 25, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! There is no use for preserve="". The user can either delete the annotation, or set it to false.

machineAnnotationValue: machineutils.PreserveMachineAnnotationValuePreservedByMCM,
},
}),
Entry("when node is annotated and preservation times out, should stop preservation", testCase{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test removed? This looks like a valid case

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test following this one covers this test case and also checks for annotation removal.

retry: machineutils.LongRetry,
},
}),
Entry("when invalid preserve annotation is added on node of un-preserved machine, should do nothing ", testCase{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a valid case too

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test following this one is a duplicate of this one.

return true
}

func (c *controller) getPreserveStateInfo(machine *v1alpha1.Machine) (*preserveStateInfo, error) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider writing some unit tests for the new helper methods that you introduced

@thiyyakat thiyyakat Jun 25, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in cb105e7

if err != nil {
if apierrors.IsConflict(err) {
if apierrors.IsNotFound(err) {
klog.Warningf("Error during preservation flow:%v", err.Error())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
klog.Warningf("Error during preservation flow:%v", err.Error())
klog.Warningf("Error during preservation flow: %v", err.Error())

@thiyyakat thiyyakat Jun 25, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in cb105e7

@gagan16k gagan16k left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes!

Comment thread pkg/util/provider/machinecontroller/machine_util.go Outdated
Comment thread pkg/util/provider/machinecontroller/machine.go
Comment thread pkg/controller/deployment_machineset_util.go Outdated
Comment thread pkg/util/provider/machinecontroller/machine.go Outdated
Comment thread pkg/util/provider/machinecontroller/machine.go
Comment thread pkg/util/provider/machinecontroller/node.go
@gardener-prow gardener-prow Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 25, 2026
- nits
- if not found, machine annotation value must be enforced
- if machine annotation value is invalid, log and continue assuming it does not exist
- gate `removePreserveAnnotationOnMachine` when node is not found
@thiyyakat thiyyakat force-pushed the preservation-hot-fix branch from 69b9eb6 to cb105e7 Compare June 25, 2026 13:08

@gagan16k gagan16k left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments, PTAL

Comment thread pkg/util/provider/machinecontroller/machine.go
Comment thread pkg/util/provider/machinecontroller/machine.go Outdated
Comment thread pkg/util/provider/machinecontroller/machine.go Outdated
Comment thread pkg/util/provider/machinecontroller/machine.go Outdated
Comment thread pkg/util/provider/machinecontroller/machine.go
Comment thread pkg/util/provider/machinecontroller/machine.go Outdated
Comment thread pkg/util/provider/machinecontroller/node.go
Comment thread pkg/util/provider/machinecontroller/machine.go
@thiyyakat thiyyakat force-pushed the preservation-hot-fix branch from d2f0d8f to 91b9476 Compare June 26, 2026 03:17

@r4mek r4mek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I have posted few comments, some of which are more related to code cleanup which you can choose to do now or defer to a later PR.

Comment thread pkg/util/provider/machinecontroller/machine.go
Comment thread pkg/util/provider/machinecontroller/machine.go
Comment thread pkg/util/provider/machinecontroller/machine.go
Comment thread pkg/util/provider/machinecontroller/machine.go
Comment thread pkg/util/provider/machinecontroller/machine_util.go
Comment thread pkg/util/provider/machinecontroller/machine_util.go
Comment thread pkg/util/provider/machinecontroller/machine_util.go
Comment thread pkg/util/provider/machinecontroller/machine.go
Comment thread pkg/util/provider/machinecontroller/machine_util.go
…encing happens in future edits. Add removed test.
@thiyyakat thiyyakat changed the title Fix preservation reconciliation to prevent deviation from regular flow for non-preservation-bound machines [rel-v0.62] Fix preservation reconciliation to prevent deviation from regular flow for non-preservation-bound machines Jun 29, 2026
@thiyyakat thiyyakat changed the title [rel-v0.62] Fix preservation reconciliation to prevent deviation from regular flow for non-preservation-bound machines [rel-v0.62] Fix preservation reconciliation to prevent deviation from regular flow for non-preservation-bound machines and prevent machines getting stuck in creation flow beyond CreationTimeout Jul 1, 2026

@aaronfern aaronfern left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-prow gardener-prow Bot added the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2026
@gardener-prow

gardener-prow Bot commented Jul 1, 2026

Copy link
Copy Markdown

LGTM label has been added.

DetailsGit tree hash: 880cd48b1d8ae1ea643961376d69f163bd383428

@elankath

elankath commented Jul 1, 2026

Copy link
Copy Markdown
Member

/lgtm with the understanding that code-simplification changes suggested will be deferred to another PR along with integration tests.

@gardener-prow

gardener-prow Bot commented Jul 1, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aaronfern

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 1, 2026
@gardener-prow gardener-prow Bot merged commit 19f5500 into gardener:rel-v0.62 Jul 1, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/bug Bug lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants