Skip to content

fix: netplan: upgrade 1.1.2 -> 1.2.1 (Fedora f44 import) to fix grubazl4 rollback PID1 generator deadlock#17815

Open
bfjelds wants to merge 1 commit into
4.0from
user/bfjelds/netplan-1.2.1-upgrade
Open

fix: netplan: upgrade 1.1.2 -> 1.2.1 (Fedora f44 import) to fix grubazl4 rollback PID1 generator deadlock#17815
bfjelds wants to merge 1 commit into
4.0from
user/bfjelds/netplan-1.2.1-upgrade

Conversation

@bfjelds

@bfjelds bfjelds commented Jun 26, 2026

Copy link
Copy Markdown
Member

Summary

Upgrade netplan in Azure Linux 4.0 from 1.1.2 to 1.2.1 by moving the package's Fedora dist-git import pointer to the f44 branch head (which ships netplan 1.2.1-2). netplan 1.2.1 contains the upstream "split generate/configure" refactor that makes the boot-time systemd generator validation-only, eliminating a PID 1 self-deadlock that can freeze azurelinux boots with netplan configured.

Background: the boot freeze this fixes

azurelinux images with netplan configured freeze during boot with:

Failed to fork off sandboxing environment for executing generators: Protocol error
Freezing execution

Deadlock analysis

The freeze is a PID 1 self-deadlock between the netplan systemd generator and systemd's userdb service:

  1. If netplan generate is invoked ...
  2. On the next boot, PID 1 runs system generators synchronously and blocking, including netplan's generator (/usr/lib/systemd/system-generators/netplan).
  3. netplan's networkd backend resolves the systemd-network group. nsswitch.conf routes the group lookup through nss-systemd, which issues a varlink request (io.systemd.UserDatabase.GetGroupRecord / GetMemberships) to /run/systemd/userdb/io.systemd.DynamicUser.
  4. That userdb/DynamicUser varlink service is served by PID 1 itself — which is synchronously blocked inside manager_run_generators() waiting for this very generator batch to finish. The varlink round-trips stall.
  5. When the 90s generator-batch timeout fires, the executor is killed; PID 1 logs the misleading Protocol error and then Freezing execution.

It is host-speed-sensitive (a race against the 90s budget): slow/nested-virt hosts freeze, faster CI hosts service the varlink in time on byte-identical images. The trigger is always netplan — the only generator doing a systemd-network group lookup.

Three conditions must coincide: (1) netplan's generator-phase getgrnam("systemd-network") (present since netplan 1.0.1), (2) authselect's files [SUCCESS=merge] systemd group line (new in AZL4), (3) PID 1 unable to answer its own userdb varlink during generators.

Why not a systemd-side fix

A minimal systemd-side mitigation exists (bypass nss-systemd in the generator environment, e.g. SYSTEMD_NSS_DYNAMIC_BYPASS=1 in build_generator_environment(), ~20 lines). However, systemd's own documented contract places the fault on the generator, not on systemd.

systemd.generator(7), "Notes about writing generators" (https://www.freedesktop.org/software/systemd/man/latest/systemd.generator.html):

Generators are run very early at boot and cannot rely on any external services. They may not talk to any other process. [...] generators are executed synchronously and hence delay the entire boot if they are slow.

An NSS lookup that dispatches to nss-systemd is IPC to another process (PID 1's userdb). netplan calling getgrnam in a generator therefore violates the generator contract, so the upstream-systemd position is that this belongs in netplan, not systemd. The deadlock-prone design is also structurally unchanged in current systemd (verified v258 vs main/262~devel), and upstream systemd carries no targeted fix. A systemd workaround would be a non-upstreamable local divergence, whereas netplan already fixes it upstream.

The upstream netplan fix

netplan upstream split the generate and configure stages so the boot generator no longer writes networkd files (and makes no NSS call); the file writing + chown (the getgrnam) moved into a new netplan-configure.service ordered after boot, when PID 1's event loop is free.

Key change: PR #552 "Split generate/configure stages for sd-generator compliance" (canonical/netplan) — merged 2025-12-16, first released in v1.2, present in 1.2.1.

The generator / configure / util sources are byte-identical between netplan 1.2.1 and current main, so 1.2.1 carries the complete deadlock-avoiding refactor.

Why upgrade rather than patch 1.1.2

Backporting the refactor onto 1.1.2 is impractical:

  • PR Update sudo package to 1.9.5p1 #552 itself is 44 commits, 40 files, +2268 / -1335 — new C sources (configure.c, gen-networkd.c, gen-openvswitch.c, gen-sriov.c), heavy rewrites of generate.c / networkd.c / openvswitch.c / sriov.c, a new systemd unit, plus meson / spec / CLI changes. Not a clean cherry-pick onto 1.1.2.
  • The full 1.1.2 -> main delta is 83 files / +3141 -1655 (~7,625 patch lines).
  • A smaller custom patch (resolve the group from /etc/group via fgetgrent instead of getgrnam, ~15-30 lines) is possible but diverges from upstream and we would own it indefinitely.

Moving the Fedora import pointer to f44 gets the released, upstream-maintained 1.2.1 with no vendored divergence.

What this PR changes

  • Adds base/comps/netplan/netplan.comp.toml pinning the Fedora f44 import (upstream-distro = fedora 44, upstream-commit = 66c31bcd3e9aeb8d15a5b4184009e57d799b0158).
  • Removes the inline [components.netplan] entry from base/comps/components.toml.
  • Regenerated locks/netplan.lock and rendered specs/n/netplan/ (now 1.2.1). The f43-era Fedora status_fail_cleanly.patch drops out (not present in f44); netplan-fallback-renderer.patch (Fedora Patch1001) is retained.

Adoption note (important for image consumers)

The refactor defers virtual-device creation (dummy / bridge / bond / vlan / SR-IOV) into netplan-configure.service, which Fedora ships preset: disabled. Images that rely on netplan applying such config at boot must enable netplan-configure.service (e.g. imagecustomizer services.enable), otherwise netplan config silently never applies at boot (empty /run/systemd/network).

Validation

netplan 1.2.1 RPMs were built locally with azldev from this branch (netplan-1.2.1-4.azl4 plus subpackages). Verified the package ships /usr/libexec/netplan/configure, /usr/lib/systemd/system/netplan-configure.service, and a validation-only /usr/libexec/netplan/generate.

Local: Injected the locally built 1.2.1 RPMs into trident image build on stock systemd 258.4-4 (no systemd change), with netplan-configure.service enabled, and ran the Trident rollback update tests. Serial logs confirm netplan ran — the 1.2.1-only netplan-configure.service started and were successfully tested.

Pipeline: The same locally built 1.2.1 RPMs were validated through the work-in-progress AZL4 Trident pipelines.

Copilot AI review requested due to automatic review settings June 26, 2026 18:37

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades netplan in Azure Linux 4.0 from 1.1.2 to 1.2.1 by repointing the Fedora dist-git import from the global f43 snapshot to a pinned Fedora f44 commit (which ships netplan 1.2.1-2). The motivation is to pull in netplan's upstream "split generate/configure" refactor (canonical/netplan PR #552), which makes the boot-time systemd generator validation-only and defers virtual-device creation to a new netplan-configure.service. This eliminates a PID 1 self-deadlock (generator-phase getgrnam("systemd-network") NSS lookup blocking on PID 1's own userdb varlink) that can freeze grubazl4 A/B rollback boots. The change is a faithful upstream import — no vendored backport — consistent with the repo's "minimal divergence from upstream" principle.

Changes:

  • Adds a dedicated base/comps/netplan/netplan.comp.toml pinning the Fedora f44 import (upstream-distro = fedora 44, upstream-commit = 66c31bcd…) with a thorough rationale comment, and removes the inline [components.netplan] entry from components.toml.
  • Regenerates locks/netplan.lock, the rendered specs/n/netplan/netplan.spec (now 1.2.1, adds configure binary + netplan-configure.service), and the sources SHA512; the f43-only status_fail_cleanly.patch is dropped while netplan-fallback-renderer.patch is retained.

I verified the comp pin syntax matches existing f44 pins (e.g., bash, libseccomp, stringtemplate4), the alphabetical ordering in components.toml is preserved, there are no dangling references to the deleted status_fail_cleanly.patch, and no duplicate [components.netplan] definitions remain. No concrete code-level issues were found.

One operational consideration (already documented by the author): the refactor defers dummy/bridge/bond/vlan/SR-IOV creation into netplan-configure.service, which Fedora ships preset-disabled — images relying on netplan applying such config at boot must enable that service or config silently never applies. This is inherent to the upstream change and not fixable within this component diff, but it warrants downstream image awareness.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
base/comps/netplan/netplan.comp.toml New dedicated component pinning the Fedora f44 import for the 1.2.1 upgrade, with rationale comment.
base/comps/components.toml Removes the inline [components.netplan] entry (now defined in its dedicated file); ordering preserved.
locks/netplan.lock Regenerated lock pinning the new f44 upstream-commit and updated fingerprints.
specs/n/netplan/netplan.spec Rendered spec for 1.2.1: drops Patch1002, adds configure binary, netplan-configure.service, and python3-setuptools BR.
specs/n/netplan/sources Updates SHA512 to the 1.2.1 source tarball.
specs/n/netplan/status_fail_cleanly.patch Deleted — the f43-only patch is absent in f44 and no longer referenced.

@bfjelds bfjelds changed the title netplan: upgrade 1.1.2 -> 1.2.1 (Fedora f44 import) to fix grubazl4 rollback PID1 generator deadlock fix: netplan: upgrade 1.1.2 -> 1.2.1 (Fedora f44 import) to fix grubazl4 rollback PID1 generator deadlock Jun 26, 2026
@bfjelds bfjelds marked this pull request as ready for review June 26, 2026 21:46
@bfjelds bfjelds requested a review from a team as a code owner June 26, 2026 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants