Skip to content

feat(vm): add GPU DRA resource claim support#2520

Draft
danilrwx wants to merge 10 commits into
mainfrom
feat/gpu/add-base-gpu-support
Draft

feat(vm): add GPU DRA resource claim support#2520
danilrwx wants to merge 10 commits into
mainfrom
feat/gpu/add-base-gpu-support

Conversation

@danilrwx

@danilrwx danilrwx commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Description

Add support for attaching physical GPU devices to virtual machines via Kubernetes DRA (Dynamic Resource Allocation).

A new spec.gpuDevices field lets a user request a GPU by product model. The virtualization controller generates a DRA ResourceClaimTemplate per device with a CEL selector matching the requested productName, a physical device type, and the absence of a sharing strategy. The kvbuilder renders the corresponding DRA resource claims and GPU devices into the KubeVirt VirtualMachine.

Key pieces:

  • VirtualMachine.spec.gpuDevices[] (name, model), MaxItems: 16, +listType=map keyed by name.
  • GPU feature gate (alpha, locked off in CE).
  • GPUResourceClaimHandler creates/updates/deletes owned ResourceClaimTemplates and cleans up orphans; a ResourceClaimTemplate watcher enqueues the owning VM.
  • GPUDevicesValidator rejects GPU devices unless the GPU feature gate is enabled and the gpu.deckhouse.io DeviceClass exists.
  • vmchange comparator marks GPU changes as requiring restart (AwaitingRestartToApplyConfiguration).
  • Generated names: claim/GPU name = gpu-<name>, device request name = gpu-<name>, ResourceClaimTemplate = <vm>-<name>.

Depends on deckhouse/3p-kubevirt#130 for KubeVirt to recognize the Deckhouse GPU DRA attributes.

Why do we need it, and what problem does it solve?

Users running ML/rendering workloads need to attach a physical GPU to a VM. Today there is no way to do this through the VirtualMachine API.

A model-based request keeps VM manifests portable: the user asks for a GPU class (e.g. NVIDIA H100) and lets DRA + the scheduler pick a concrete, exclusive, passthrough-capable device on a suitable node — instead of pinning the VM to a specific node, PCI address, or GPU UUID.

What is the expected result?

  1. Enable the GPU feature gate in ModuleConfig virtualization and ensure a GPU DRA provider and the gpu.deckhouse.io DeviceClass are installed.
  2. Create a VM with spec.gpuDevices:
    spec:
      gpuDevices:
        - name: gpu0
          model: NVIDIA H100
  3. The controller creates a ResourceClaimTemplate and the VM schedules on a node with a matching exclusive physical GPU.
  4. Changing spec.gpuDevices on a running VM sets AwaitingRestartToApplyConfiguration and applies only after a restart.

Checklist

  • The code is covered by unit tests.
  • e2e tests passed.
  • Documentation updated according to the changes.
  • Changes were tested in the Kubernetes cluster manually.

Changelog entries

section: core
type: feature
summary: "Attach physical GPU devices to virtual machines via DRA by requesting a GPU product model in spec.gpuDevices."

danilrwx added 8 commits June 23, 2026 16:43
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
@danilrwx danilrwx force-pushed the feat/gpu/add-base-gpu-support branch from 22e85a0 to 3925492 Compare June 23, 2026 14:43
danilrwx added 2 commits June 23, 2026 16:48
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Drop the req- and -template suffixes from generated DRA names and align
the device request name with the resource claim name (gpu-<device>).
The template name becomes <vm>-<device>.

This raises the user-facing gpuDevices[].name MaxLength from 55 to 59:
the previous 55-char limit was dictated by the req-gpu- prefix (8 chars)
that left no headroom against the 63-char DNS label limit.

Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant