Skip to content

feat(gpu): recognize aks-gpu-cuda in LoadConfig alongside aks-gpu-cuda-lts#8822

Open
ganeshkumarashok wants to merge 1 commit into
mainfrom
ganesh/gpu-recognize-cuda-lts-compat
Open

feat(gpu): recognize aks-gpu-cuda in LoadConfig alongside aks-gpu-cuda-lts#8822
ganeshkumarashok wants to merge 1 commit into
mainfrom
ganesh/gpu-recognize-cuda-lts-compat

Conversation

@ganeshkumarashok

@ganeshkumarashok ganeshkumarashok commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

What

Restores a first-class case "aks-gpu-cuda" in LoadConfig (removed in effect by #8811, which repurposed the shared CUDA globals for aks-gpu-cuda-lts). Now the pre-LTS aks-gpu-cuda version is loaded and available if a SKU is ever routed to the "cuda" image in CSE again — without changing today's render.

  • components.json: add an aks-gpu-cuda entry pinned to the R580 line (580.126.09), not the R595 line that drops Volta/V100.
  • gpu_components.go: aks-gpu-cuda reclaims NvidiaCudaDriverVersion / AKSGPUCudaVersionSuffix (its pre-feat(gpu): use aks-gpu-cuda-lts (R580 LTS) for the managed CUDA driver #8811 names); aks-gpu-cuda-lts moves to NvidiaCudaLTSDriverVersion / AKSGPUCudaLTSVersionSuffix. This mirrors the existing base-vs-variant naming (NvidiaGridDriverVersion vs NvidiaGridV20DriverVersion) and avoids clobbering a shared global.
  • baker.go: GetGPUDriverVersion / GetAKSGPUImageSHA render the LTS globals for modern CUDA SKUs.
  • renovate.json: constrain aks-gpu-cuda to /^580\./ so it never bumps to the V100-dropping R595 line.
  • Tests updated for the new names + LTS-global coverage.

Why the rename (and why it's safe)

You can't have both cases write the same globals — the LoadConfig loop would clobber them (last-one-wins), making modern SKUs render a non-existent aks-gpu-cuda-lts:580.126.09-… tag → 404. So aks-gpu-cuda takes back the base NvidiaCuda* names (as before #8811) and the LTS image gets an explicit …LTS… name; the render is repointed to the LTS globals.

Render output is byte-identical — verified: GENERATE_TEST_DATA=true go test ./pkg/agent/... produces zero golden drift. Modern CUDA SKUs still resolve aks-gpu-cuda-lts:580.159.04-… exactly as before. The globals are internal; nothing outside baker.go/gpu_components.go references them.

What it deliberately does NOT do

  • Not VHD-cachedinstall-dependencies.sh unchanged (only pre-pulls aks-gpu-cuda-lts), zero size cost.
  • Not the default render targetGetGPUDriverType still returns "cuda-lts"; aks-gpu-cuda's version is loaded but currently unrendered (forward-plumbing).

Old-VHD / version-skewed nodes that target aks-gpu-cuda already resolve it at boot via the hardened registry pull (#8821), served by required-MCR egress or the wildcard network-isolated ACR cache.

Testing

  • go build ./pkg/agent/... + aks-node-controller, go vet, go test ./pkg/agent/... — pass.
  • GENERATE_TEST_DATA=true go test ./pkg/agent/...zero golden drift (render unchanged).
  • make validate-components (cue) — pass.

Copilot AI review requested due to automatic review settings July 2, 2026 22:24
@github-actions github-actions Bot added the components This pull request updates cached components on Linux or Windows VHDs label Jul 2, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR keeps the legacy aks-gpu-cuda GPU driver image recognized (but not selected by default and not VHD-cached) alongside aks-gpu-cuda-lts, to support old-VHD / render-skew transition scenarios while keeping the legacy image pinned to the R580 line for V100/Volta compatibility.

Changes:

  • Add a pinned aks-gpu-cuda entry to GPUContainerImages in parts/common/components.json (R580 line).
  • Extend LoadConfig to parse aks-gpu-cuda into dedicated legacy globals and add LegacyGPUCudaImage() for assembling the legacy image ref.
  • Add tests covering legacy config recognition + image ref assembly + R580 pin, and constrain Renovate updates for aks/aks-gpu-cuda to 580.*.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
pkg/agent/datamodel/gpu_components.go Loads legacy aks-gpu-cuda version/suffix into new globals and exposes LegacyGPUCudaImage() to build the legacy ref.
pkg/agent/datamodel/gpu_components_test.go Adds coverage ensuring the legacy entry is loaded, correctly assembled, and pinned to the R580 line.
parts/common/components.json Adds aks-gpu-cuda:* GPUContainerImages entry pinned to 580.126.09-... for legacy recognition.
.github/renovate.json Adds a packageRule restricting aks/aks-gpu-cuda updates to /^580\./ so Renovate won’t bump to an R595 line.

@ganeshkumarashok ganeshkumarashok changed the title feat(gpu): recognize legacy aks-gpu-cuda image alongside aks-gpu-cuda-lts (transition, inert-by-design) feat(gpu): keep aks-gpu-cuda recognized in the component manifest (transition, inert-by-design) Jul 2, 2026
@ganeshkumarashok ganeshkumarashok force-pushed the ganesh/gpu-recognize-cuda-lts-compat branch from 8a55101 to c158ed8 Compare July 2, 2026 22:43
…a-lts

#8811 moved the managed CUDA driver from aks-gpu-cuda to aks-gpu-cuda-lts,
reusing the NvidiaCudaDriverVersion / AKSGPUCudaVersionSuffix globals for the
LTS image. This restores a first-class `case "aks-gpu-cuda"` in LoadConfig so
the pre-LTS image's version is loaded and available if a SKU is ever routed to
the "cuda" image in CSE again -- without disturbing today's render.

- components.json: add an aks-gpu-cuda entry pinned to the R580 line
  (580.126.09), NOT the R595 line that drops Volta/V100.
- gpu_components.go: aks-gpu-cuda reclaims NvidiaCudaDriverVersion /
  AKSGPUCudaVersionSuffix (its pre-#8811 names); aks-gpu-cuda-lts moves to
  NvidiaCudaLTSDriverVersion / AKSGPUCudaLTSVersionSuffix. Mirrors the existing
  base-vs-variant naming (NvidiaGridDriverVersion vs NvidiaGridV20DriverVersion)
  and avoids clobbering a shared global.
- baker.go: GetGPUDriverVersion / GetAKSGPUImageSHA render the LTS globals for
  modern CUDA SKUs, so rendered output is byte-identical (verified: zero
  testdata drift). aks-gpu-cuda is loaded but not the default render target.
- renovate.json: constrain aks-gpu-cuda to /^580\./ so it never bumps to R595.

Still not baked into the VHD (install-dependencies.sh only pre-pulls
aks-gpu-cuda-lts). Old-VHD / skewed nodes that target aks-gpu-cuda resolve it at
boot via the hardened pull (#8821), served by required-MCR or the wildcard
network-isolated ACR cache.

Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
Copilot AI review requested due to automatic review settings July 3, 2026 00:10
@ganeshkumarashok ganeshkumarashok force-pushed the ganesh/gpu-recognize-cuda-lts-compat branch from c158ed8 to 408c88e Compare July 3, 2026 00:10
@ganeshkumarashok ganeshkumarashok changed the title feat(gpu): keep aks-gpu-cuda recognized in the component manifest (transition, inert-by-design) feat(gpu): recognize aks-gpu-cuda in LoadConfig alongside aks-gpu-cuda-lts Jul 3, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment thread pkg/agent/baker.go
Comment on lines 1522 to +1525
if useGridDrivers(size) {
return datamodel.AKSGPUGridVersionSuffix
}
return datamodel.AKSGPUCudaVersionSuffix
return datamodel.AKSGPUCudaLTSVersionSuffix
Comment thread pkg/agent/baker_test.go
Comment on lines +998 to 1000
It("should use newest AKSGPUCudaLTSVersionSuffix with non grid SKU", func() {
Expect(GetAKSGPUImageSHA("standard_nc6_v3")).To(Equal(datamodel.AKSGPUCudaLTSVersionSuffix))
})
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

components This pull request updates cached components on Linux or Windows VHDs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants