feat(gpu): recognize aks-gpu-cuda in LoadConfig alongside aks-gpu-cuda-lts#8822
Open
ganeshkumarashok wants to merge 1 commit into
Open
feat(gpu): recognize aks-gpu-cuda in LoadConfig alongside aks-gpu-cuda-lts#8822ganeshkumarashok wants to merge 1 commit into
ganeshkumarashok wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR keeps the legacy aks-gpu-cuda GPU driver image recognized (but not selected by default and not VHD-cached) alongside aks-gpu-cuda-lts, to support old-VHD / render-skew transition scenarios while keeping the legacy image pinned to the R580 line for V100/Volta compatibility.
Changes:
- Add a pinned
aks-gpu-cudaentry toGPUContainerImagesinparts/common/components.json(R580 line). - Extend
LoadConfigto parseaks-gpu-cudainto dedicated legacy globals and addLegacyGPUCudaImage()for assembling the legacy image ref. - Add tests covering legacy config recognition + image ref assembly + R580 pin, and constrain Renovate updates for
aks/aks-gpu-cudato580.*.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| pkg/agent/datamodel/gpu_components.go | Loads legacy aks-gpu-cuda version/suffix into new globals and exposes LegacyGPUCudaImage() to build the legacy ref. |
| pkg/agent/datamodel/gpu_components_test.go | Adds coverage ensuring the legacy entry is loaded, correctly assembled, and pinned to the R580 line. |
| parts/common/components.json | Adds aks-gpu-cuda:* GPUContainerImages entry pinned to 580.126.09-... for legacy recognition. |
| .github/renovate.json | Adds a packageRule restricting aks/aks-gpu-cuda updates to /^580\./ so Renovate won’t bump to an R595 line. |
8a55101 to
c158ed8
Compare
…a-lts #8811 moved the managed CUDA driver from aks-gpu-cuda to aks-gpu-cuda-lts, reusing the NvidiaCudaDriverVersion / AKSGPUCudaVersionSuffix globals for the LTS image. This restores a first-class `case "aks-gpu-cuda"` in LoadConfig so the pre-LTS image's version is loaded and available if a SKU is ever routed to the "cuda" image in CSE again -- without disturbing today's render. - components.json: add an aks-gpu-cuda entry pinned to the R580 line (580.126.09), NOT the R595 line that drops Volta/V100. - gpu_components.go: aks-gpu-cuda reclaims NvidiaCudaDriverVersion / AKSGPUCudaVersionSuffix (its pre-#8811 names); aks-gpu-cuda-lts moves to NvidiaCudaLTSDriverVersion / AKSGPUCudaLTSVersionSuffix. Mirrors the existing base-vs-variant naming (NvidiaGridDriverVersion vs NvidiaGridV20DriverVersion) and avoids clobbering a shared global. - baker.go: GetGPUDriverVersion / GetAKSGPUImageSHA render the LTS globals for modern CUDA SKUs, so rendered output is byte-identical (verified: zero testdata drift). aks-gpu-cuda is loaded but not the default render target. - renovate.json: constrain aks-gpu-cuda to /^580\./ so it never bumps to R595. Still not baked into the VHD (install-dependencies.sh only pre-pulls aks-gpu-cuda-lts). Old-VHD / skewed nodes that target aks-gpu-cuda resolve it at boot via the hardened pull (#8821), served by required-MCR or the wildcard network-isolated ACR cache. Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
c158ed8 to
408c88e
Compare
Comment on lines
1522
to
+1525
| if useGridDrivers(size) { | ||
| return datamodel.AKSGPUGridVersionSuffix | ||
| } | ||
| return datamodel.AKSGPUCudaVersionSuffix | ||
| return datamodel.AKSGPUCudaLTSVersionSuffix |
Comment on lines
+998
to
1000
| It("should use newest AKSGPUCudaLTSVersionSuffix with non grid SKU", func() { | ||
| Expect(GetAKSGPUImageSHA("standard_nc6_v3")).To(Equal(datamodel.AKSGPUCudaLTSVersionSuffix)) | ||
| }) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Restores a first-class
case "aks-gpu-cuda"inLoadConfig(removed in effect by #8811, which repurposed the shared CUDA globals foraks-gpu-cuda-lts). Now the pre-LTSaks-gpu-cudaversion is loaded and available if a SKU is ever routed to the"cuda"image in CSE again — without changing today's render.components.json: add anaks-gpu-cudaentry pinned to the R580 line (580.126.09), not the R595 line that drops Volta/V100.gpu_components.go:aks-gpu-cudareclaimsNvidiaCudaDriverVersion/AKSGPUCudaVersionSuffix(its pre-feat(gpu): use aks-gpu-cuda-lts (R580 LTS) for the managed CUDA driver #8811 names);aks-gpu-cuda-ltsmoves toNvidiaCudaLTSDriverVersion/AKSGPUCudaLTSVersionSuffix. This mirrors the existing base-vs-variant naming (NvidiaGridDriverVersionvsNvidiaGridV20DriverVersion) and avoids clobbering a shared global.baker.go:GetGPUDriverVersion/GetAKSGPUImageSHArender the LTS globals for modern CUDA SKUs.renovate.json: constrainaks-gpu-cudato/^580\./so it never bumps to the V100-dropping R595 line.Why the rename (and why it's safe)
You can't have both cases write the same globals — the
LoadConfigloop would clobber them (last-one-wins), making modern SKUs render a non-existentaks-gpu-cuda-lts:580.126.09-…tag → 404. Soaks-gpu-cudatakes back the baseNvidiaCuda*names (as before #8811) and the LTS image gets an explicit…LTS…name; the render is repointed to the LTS globals.Render output is byte-identical — verified:
GENERATE_TEST_DATA=true go test ./pkg/agent/...produces zero golden drift. Modern CUDA SKUs still resolveaks-gpu-cuda-lts:580.159.04-…exactly as before. The globals are internal; nothing outsidebaker.go/gpu_components.goreferences them.What it deliberately does NOT do
install-dependencies.shunchanged (only pre-pullsaks-gpu-cuda-lts), zero size cost.GetGPUDriverTypestill returns"cuda-lts";aks-gpu-cuda's version is loaded but currently unrendered (forward-plumbing).Old-VHD / version-skewed nodes that target
aks-gpu-cudaalready resolve it at boot via the hardened registry pull (#8821), served by required-MCR egress or the wildcard network-isolated ACR cache.Testing
go build ./pkg/agent/...+aks-node-controller,go vet,go test ./pkg/agent/...— pass.GENERATE_TEST_DATA=true go test ./pkg/agent/...— zero golden drift (render unchanged).make validate-components(cue) — pass.