fix: netplan: upgrade 1.1.2 -> 1.2.1 (Fedora f44 import) to fix grubazl4 rollback PID1 generator deadlock#17815
fix: netplan: upgrade 1.1.2 -> 1.2.1 (Fedora f44 import) to fix grubazl4 rollback PID1 generator deadlock#17815bfjelds wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR upgrades netplan in Azure Linux 4.0 from 1.1.2 to 1.2.1 by repointing the Fedora dist-git import from the global f43 snapshot to a pinned Fedora f44 commit (which ships netplan 1.2.1-2). The motivation is to pull in netplan's upstream "split generate/configure" refactor (canonical/netplan PR #552), which makes the boot-time systemd generator validation-only and defers virtual-device creation to a new netplan-configure.service. This eliminates a PID 1 self-deadlock (generator-phase getgrnam("systemd-network") NSS lookup blocking on PID 1's own userdb varlink) that can freeze grubazl4 A/B rollback boots. The change is a faithful upstream import — no vendored backport — consistent with the repo's "minimal divergence from upstream" principle.
Changes:
- Adds a dedicated
base/comps/netplan/netplan.comp.tomlpinning the Fedora f44 import (upstream-distro = fedora 44,upstream-commit = 66c31bcd…) with a thorough rationale comment, and removes the inline[components.netplan]entry fromcomponents.toml. - Regenerates
locks/netplan.lock, the renderedspecs/n/netplan/netplan.spec(now 1.2.1, addsconfigurebinary +netplan-configure.service), and thesourcesSHA512; the f43-onlystatus_fail_cleanly.patchis dropped whilenetplan-fallback-renderer.patchis retained.
I verified the comp pin syntax matches existing f44 pins (e.g., bash, libseccomp, stringtemplate4), the alphabetical ordering in components.toml is preserved, there are no dangling references to the deleted status_fail_cleanly.patch, and no duplicate [components.netplan] definitions remain. No concrete code-level issues were found.
One operational consideration (already documented by the author): the refactor defers dummy/bridge/bond/vlan/SR-IOV creation into netplan-configure.service, which Fedora ships preset-disabled — images relying on netplan applying such config at boot must enable that service or config silently never applies. This is inherent to the upstream change and not fixable within this component diff, but it warrants downstream image awareness.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
base/comps/netplan/netplan.comp.toml |
New dedicated component pinning the Fedora f44 import for the 1.2.1 upgrade, with rationale comment. |
base/comps/components.toml |
Removes the inline [components.netplan] entry (now defined in its dedicated file); ordering preserved. |
locks/netplan.lock |
Regenerated lock pinning the new f44 upstream-commit and updated fingerprints. |
specs/n/netplan/netplan.spec |
Rendered spec for 1.2.1: drops Patch1002, adds configure binary, netplan-configure.service, and python3-setuptools BR. |
specs/n/netplan/sources |
Updates SHA512 to the 1.2.1 source tarball. |
specs/n/netplan/status_fail_cleanly.patch |
Deleted — the f43-only patch is absent in f44 and no longer referenced. |
Summary
Upgrade
netplanin Azure Linux 4.0 from 1.1.2 to 1.2.1 by moving the package's Fedora dist-git import pointer to the f44 branch head (which shipsnetplan 1.2.1-2). netplan 1.2.1 contains the upstream "split generate/configure" refactor that makes the boot-time systemd generator validation-only, eliminating a PID 1 self-deadlock that can freeze azurelinux boots with netplan configured.Background: the boot freeze this fixes
azurelinux images with netplan configured freeze during boot with:
Deadlock analysis
The freeze is a PID 1 self-deadlock between the netplan systemd generator and systemd's userdb service:
netplan generateis invoked .../usr/lib/systemd/system-generators/netplan).systemd-networkgroup.nsswitch.confroutes the group lookup through nss-systemd, which issues a varlink request (io.systemd.UserDatabase.GetGroupRecord/GetMemberships) to/run/systemd/userdb/io.systemd.DynamicUser.manager_run_generators()waiting for this very generator batch to finish. The varlink round-trips stall.Protocol errorand thenFreezing execution.It is host-speed-sensitive (a race against the 90s budget): slow/nested-virt hosts freeze, faster CI hosts service the varlink in time on byte-identical images. The trigger is always netplan — the only generator doing a
systemd-networkgroup lookup.Three conditions must coincide: (1) netplan's generator-phase
getgrnam("systemd-network")(present since netplan 1.0.1), (2) authselect'sfiles [SUCCESS=merge] systemdgroup line (new in AZL4), (3) PID 1 unable to answer its own userdb varlink during generators.Why not a systemd-side fix
A minimal systemd-side mitigation exists (bypass nss-systemd in the generator environment, e.g.
SYSTEMD_NSS_DYNAMIC_BYPASS=1inbuild_generator_environment(), ~20 lines). However, systemd's own documented contract places the fault on the generator, not on systemd.systemd.generator(7), "Notes about writing generators" (https://www.freedesktop.org/software/systemd/man/latest/systemd.generator.html):An NSS lookup that dispatches to nss-systemd is IPC to another process (PID 1's userdb). netplan calling
getgrnamin a generator therefore violates the generator contract, so the upstream-systemd position is that this belongs in netplan, not systemd. The deadlock-prone design is also structurally unchanged in current systemd (verifiedv258vsmain/262~devel), and upstream systemd carries no targeted fix. A systemd workaround would be a non-upstreamable local divergence, whereas netplan already fixes it upstream.The upstream netplan fix
netplan upstream split the generate and configure stages so the boot generator no longer writes networkd files (and makes no NSS call); the file writing +
chown(thegetgrnam) moved into a newnetplan-configure.serviceordered after boot, when PID 1's event loop is free.Key change: PR #552 "Split generate/configure stages for sd-generator compliance" (canonical/netplan) — merged 2025-12-16, first released in v1.2, present in 1.2.1.
42db0158— "configure: Add new binary to produce network service configs" (anchor)8233cf9d— adds thenetplan-configure.serviceunit6ad42dec— generator becomes validation-only8622557d— relatedThe generator / configure / util sources are byte-identical between netplan 1.2.1 and current
main, so 1.2.1 carries the complete deadlock-avoiding refactor.Why upgrade rather than patch 1.1.2
Backporting the refactor onto 1.1.2 is impractical:
configure.c,gen-networkd.c,gen-openvswitch.c,gen-sriov.c), heavy rewrites ofgenerate.c/networkd.c/openvswitch.c/sriov.c, a new systemd unit, plus meson / spec / CLI changes. Not a clean cherry-pick onto 1.1.2.1.1.2 -> maindelta is 83 files / +3141 -1655 (~7,625 patch lines)./etc/groupviafgetgrentinstead ofgetgrnam, ~15-30 lines) is possible but diverges from upstream and we would own it indefinitely.Moving the Fedora import pointer to f44 gets the released, upstream-maintained 1.2.1 with no vendored divergence.
What this PR changes
base/comps/netplan/netplan.comp.tomlpinning the Fedora f44 import (upstream-distro = fedora 44,upstream-commit = 66c31bcd3e9aeb8d15a5b4184009e57d799b0158).[components.netplan]entry frombase/comps/components.toml.locks/netplan.lockand renderedspecs/n/netplan/(now 1.2.1). The f43-era Fedorastatus_fail_cleanly.patchdrops out (not present in f44);netplan-fallback-renderer.patch(FedoraPatch1001) is retained.Adoption note (important for image consumers)
The refactor defers virtual-device creation (dummy / bridge / bond / vlan / SR-IOV) into
netplan-configure.service, which Fedora ships preset: disabled. Images that rely on netplan applying such config at boot must enablenetplan-configure.service(e.g. imagecustomizerservices.enable), otherwise netplan config silently never applies at boot (empty/run/systemd/network).Validation
netplan 1.2.1 RPMs were built locally with
azldevfrom this branch (netplan-1.2.1-4.azl4plus subpackages). Verified the package ships/usr/libexec/netplan/configure,/usr/lib/systemd/system/netplan-configure.service, and a validation-only/usr/libexec/netplan/generate.Local: Injected the locally built 1.2.1 RPMs into trident image build on stock systemd 258.4-4 (no systemd change), with
netplan-configure.serviceenabled, and ran the Trident rollback update tests. Serial logs confirm netplan ran — the 1.2.1-onlynetplan-configure.servicestarted and were successfully tested.Pipeline: The same locally built 1.2.1 RPMs were validated through the work-in-progress AZL4 Trident pipelines.