-
Notifications
You must be signed in to change notification settings - Fork 27
[docs] Add vGPU setup guide for GPU sharing between VMs #467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -198,25 +198,226 @@ We are now ready to create a VM. | |||||
| Kernel modules: nvidiafb, nvidia_drm, nvidia | ||||||
| ``` | ||||||
|
|
||||||
| ## GPU Sharing for Virtual Machines | ||||||
| ## GPU Sharing for Virtual Machines (vGPU) | ||||||
|
|
||||||
| GPU passthrough assigns an entire physical GPU to a single VM. To share one GPU between multiple VMs, you need **NVIDIA vGPU**. | ||||||
| GPU passthrough assigns an entire physical GPU to a single VM. To share one GPU between multiple VMs, you can use **NVIDIA vGPU**, which creates virtual GPUs from a single physical GPU using mediated devices (mdev). | ||||||
|
|
||||||
| ### vGPU (Virtual GPU) | ||||||
| {{% alert color="info" %}} | ||||||
| **Why not MIG?** MIG (Multi-Instance GPU) partitions a GPU into isolated instances, but these are logical divisions within a single PCIe device. VFIO cannot pass them to VMs — MIG only works with containers. To use MIG with VMs, you need vGPU on top of MIG partitions (still requires a vGPU license). | ||||||
| {{% /alert %}} | ||||||
|
|
||||||
| ### Prerequisites | ||||||
|
|
||||||
| - A GPU that supports vGPU (e.g., NVIDIA L40S, A100, A30, A16) | ||||||
| - An NVIDIA vGPU Software license (NVIDIA AI Enterprise or vGPU subscription) | ||||||
| - Access to the [NVIDIA Licensing Portal](https://ui.licensing.nvidia.com) to download the vGPU Manager driver | ||||||
|
|
||||||
| {{% alert color="warning" %}} | ||||||
| The vGPU Manager driver is proprietary software distributed by NVIDIA under a commercial license. Cozystack does not include or redistribute this driver. You must obtain it directly from NVIDIA and build the container image yourself. | ||||||
| {{% /alert %}} | ||||||
|
|
||||||
| ### 1. Build the vGPU Manager Image | ||||||
|
|
||||||
| The GPU Operator expects a pre-built driver container image — it does not install the driver from a raw `.run` file at runtime. | ||||||
|
|
||||||
| 1. Download the vGPU Manager driver from the [NVIDIA Licensing Portal](https://ui.licensing.nvidia.com) (Software Downloads → NVIDIA AI Enterprise → Linux KVM) | ||||||
| 2. Build the driver container image using NVIDIA's Makefile-based build system: | ||||||
|
|
||||||
| ```bash | ||||||
| # Clone the NVIDIA driver container repository | ||||||
| git clone https://gitlab.com/nvidia/container-images/driver.git | ||||||
| cd driver | ||||||
|
|
||||||
| # Place the downloaded .run file in the appropriate directory | ||||||
| cp NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run vgpu/ | ||||||
|
|
||||||
| NVIDIA vGPU uses mediated devices (mdev) to create virtual GPUs assignable to VMs. This is the only production-ready solution for GPU sharing between VMs. | ||||||
| # Build using the provided Makefile | ||||||
| make OS_TAG=ubuntu22.04 \ | ||||||
| VGPU_DRIVER_VERSION=550.90.05 \ | ||||||
| PRIVATE_REGISTRY=registry.example.com/nvidia | ||||||
|
|
||||||
| **Requirements:** | ||||||
| - NVIDIA vGPU license (commercial, purchased from NVIDIA) | ||||||
| - NVIDIA vGPU Manager installed on host nodes | ||||||
| # Push to your private registry | ||||||
| docker push registry.example.com/nvidia/vgpu-manager:550.90.05 | ||||||
| ``` | ||||||
|
|
||||||
| {{% alert color="info" %}} | ||||||
| **Why not MIG?** MIG (Multi-Instance GPU) partitions a GPU into isolated instances, but these are logical divisions within a single PCIe device. VFIO cannot pass them to VMs — MIG only works with containers. To use MIG with VMs, you need vGPU on top of MIG partitions (still requires a license). | ||||||
| The build process compiles kernel modules against the host kernel version. Refer to the [NVIDIA GPU Operator vGPU documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html) for the complete build procedure and supported OS/kernel combinations. | ||||||
| {{% /alert %}} | ||||||
|
|
||||||
| {{% alert color="warning" %}} | ||||||
| Uploading the vGPU driver to a publicly available registry is a violation of the NVIDIA vGPU EULA. Always use a private registry. | ||||||
| {{% /alert %}} | ||||||
|
|
||||||
| ### 2. Install the GPU Operator with vGPU Variant | ||||||
|
|
||||||
| The GPU Operator provides a `vgpu` variant that enables the vGPU Manager and vGPU Device Manager instead of the VFIO Manager used in passthrough mode. | ||||||
|
|
||||||
| 1. Label the worker node for vGPU workloads: | ||||||
|
|
||||||
| ```bash | ||||||
| kubectl label node <node-name> --overwrite nvidia.com/gpu.workload.config=vm-vgpu | ||||||
| ``` | ||||||
|
|
||||||
| 2. Create the GPU Operator Package with the `vgpu` variant, providing your vGPU Manager image coordinates: | ||||||
|
|
||||||
| ```yaml | ||||||
| apiVersion: cozystack.io/v1alpha1 | ||||||
| kind: Package | ||||||
| metadata: | ||||||
| name: cozystack.gpu-operator | ||||||
| spec: | ||||||
| variant: vgpu | ||||||
| components: | ||||||
| gpu-operator: | ||||||
| values: | ||||||
| gpu-operator: | ||||||
| vgpuManager: | ||||||
| repository: registry.example.com/nvidia | ||||||
| version: "550.90.05" | ||||||
| ``` | ||||||
|
|
||||||
| If your registry requires authentication, create an `imagePullSecret` in the `cozy-gpu-operator` namespace first, then reference it: | ||||||
|
|
||||||
| ```yaml | ||||||
| gpu-operator: | ||||||
| vgpuManager: | ||||||
| repository: registry.example.com/nvidia | ||||||
| version: "550.90.05" | ||||||
| imagePullSecrets: | ||||||
| - name: nvidia-registry-secret | ||||||
| ``` | ||||||
|
|
||||||
| 3. Verify all pods are running: | ||||||
|
|
||||||
| ```bash | ||||||
| kubectl get pods -n cozy-gpu-operator | ||||||
| ``` | ||||||
|
|
||||||
| Example output: | ||||||
|
|
||||||
| ```console | ||||||
| NAME READY STATUS RESTARTS AGE | ||||||
| ... | ||||||
| nvidia-vgpu-manager-daemonset-xxxxx 1/1 Running 0 60s | ||||||
| nvidia-vgpu-device-manager-xxxxx 1/1 Running 0 45s | ||||||
| nvidia-sandbox-validator-xxxxx 1/1 Running 0 30s | ||||||
| ``` | ||||||
|
|
||||||
| ### 3. Configure NVIDIA License Server (NLS) | ||||||
|
|
||||||
| vGPU requires a license to operate. Create a ConfigMap with the NLS client configuration: | ||||||
|
|
||||||
| ```yaml | ||||||
| apiVersion: v1 | ||||||
| kind: ConfigMap | ||||||
| metadata: | ||||||
| name: licensing-config | ||||||
| namespace: cozy-gpu-operator | ||||||
| data: | ||||||
| gridd.conf: | | ||||||
| ServerAddress=nls.example.com | ||||||
| ServerPort=443 | ||||||
| FeatureType=1 | ||||||
| # ServerPort depends on your NLS deployment (commonly 443 for DLS or 7070 for legacy NLS) | ||||||
| ``` | ||||||
|
|
||||||
| Then reference it in the Package values: | ||||||
|
|
||||||
| ```yaml | ||||||
| gpu-operator: | ||||||
| vgpuManager: | ||||||
| repository: registry.example.com/nvidia | ||||||
| version: "550.90.05" | ||||||
| driver: | ||||||
| licensingConfig: | ||||||
| configMapName: licensing-config | ||||||
| ``` | ||||||
|
|
||||||
| ### 4. Update the KubeVirt Custom Resource | ||||||
|
|
||||||
| Configure KubeVirt to permit mediated devices. The `mediatedDeviceTypes` field specifies which vGPU profiles to use, and `permittedHostDevices` makes them available to VMs: | ||||||
|
|
||||||
| ```bash | ||||||
| kubectl edit kubevirt -n cozy-kubevirt | ||||||
| ``` | ||||||
|
|
||||||
| ```yaml | ||||||
| spec: | ||||||
| configuration: | ||||||
| mediatedDevicesConfiguration: | ||||||
| mediatedDeviceTypes: | ||||||
| - nvidia-592 # Example: NVIDIA L40S-24Q | ||||||
| permittedHostDevices: | ||||||
| mediatedDevices: | ||||||
| - mdevNameSelector: NVIDIA L40S-24Q | ||||||
| resourceName: nvidia.com/NVIDIA_L40S-24Q | ||||||
| ``` | ||||||
|
|
||||||
| To find the correct type ID and profile name for your GPU, consult the [NVIDIA vGPU User Guide](https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/). | ||||||
|
|
||||||
| ### 5. Create a Virtual Machine with vGPU | ||||||
|
|
||||||
| ```yaml | ||||||
| apiVersion: apps.cozystack.io/v1alpha1 | ||||||
| appVersion: '*' | ||||||
| kind: VirtualMachine | ||||||
| metadata: | ||||||
| name: gpu-vgpu | ||||||
| namespace: tenant-example | ||||||
| spec: | ||||||
| running: true | ||||||
| instanceProfile: ubuntu | ||||||
| instanceType: u1.medium | ||||||
| systemDisk: | ||||||
| image: ubuntu | ||||||
| storage: 5Gi | ||||||
| storageClass: replicated | ||||||
| gpus: | ||||||
| - name: nvidia.com/NVIDIA_L40S-24Q | ||||||
| cloudInit: | | ||||||
| #cloud-config | ||||||
| password: ubuntu | ||||||
| chpasswd: { expire: False } | ||||||
| ``` | ||||||
|
|
||||||
| ```bash | ||||||
| kubectl apply -f vmi-vgpu.yaml | ||||||
| ``` | ||||||
|
|
||||||
| Once the VM is running, log in and verify the vGPU is available: | ||||||
|
|
||||||
| ```bash | ||||||
| virtctl console virtual-machine-gpu-vgpu | ||||||
| ``` | ||||||
|
Comment on lines
+358
to
+391
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add an explicit VM readiness check before opening console. After Line 384, jumping directly to Proposed doc patch ```bash
kubectl apply -f vmi-vgpu.yaml+Wait until the VM instance is ready: Verify each finding against the current code and only fix it if needed. In |
||||||
|
|
||||||
| ```console | ||||||
| ubuntu@gpu-vgpu:~$ nvidia-smi | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For consistency with the GPU passthrough example (line 194) and Cozystack's default naming convention for virtual machine instances, the hostname in the command prompt should include the
Suggested change
|
||||||
| +-----------------------------------------------------------------------------------------+ | ||||||
| | NVIDIA-SMI 550.90.05 Driver Version: 550.90.05 CUDA Version: 12.4 | | ||||||
| | | | ||||||
| | GPU Name ... MIG M. | | ||||||
| | 0 NVIDIA L40S-24Q ... N/A | | ||||||
| +-----------------------------------------------------------------------------------------+ | ||||||
| ``` | ||||||
|
|
||||||
| ### vGPU Profiles | ||||||
|
|
||||||
| Each GPU model supports specific vGPU profiles that determine how the GPU is partitioned. Common profiles for NVIDIA L40S: | ||||||
|
|
||||||
| | Profile | Frame Buffer | Max Instances | Use Case | | ||||||
| | --- | --- | --- | --- | | ||||||
| | NVIDIA L40S-1Q | 1 GB | 48 | Light 3D / VDI | | ||||||
| | NVIDIA L40S-2Q | 2 GB | 24 | Medium 3D / VDI | | ||||||
| | NVIDIA L40S-4Q | 4 GB | 12 | Heavy 3D / VDI | | ||||||
| | NVIDIA L40S-6Q | 6 GB | 8 | Professional 3D | | ||||||
| | NVIDIA L40S-8Q | 8 GB | 6 | AI/ML inference | | ||||||
| | NVIDIA L40S-12Q | 12 GB | 4 | AI/ML training | | ||||||
| | NVIDIA L40S-24Q | 24 GB | 2 | Large AI workloads | | ||||||
| | NVIDIA L40S-48Q | 48 GB | 1 | Full GPU equivalent | | ||||||
|
|
||||||
| ### Open-Source vGPU (Experimental) | ||||||
|
|
||||||
| NVIDIA is developing open-source vGPU support for the Linux kernel. Once merged, this could enable GPU sharing without a license. | ||||||
| NVIDIA is developing open-source vGPU support for the Linux kernel. Once merged, this could enable GPU sharing without a commercial license. | ||||||
|
|
||||||
| - Status: RFC stage, not merged into mainline kernel | ||||||
| - Supports Ada Lovelace and newer (L4, L40, etc.) | ||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is helpful to clarify what the
FeatureTypevalue represents to assist users in customizing their configuration. In the NVIDIA Grid configuration,1corresponds to the "NVIDIA vGPU" (vPC/vWS) feature, while2is for "NVIDIA Virtual Compute Server" (vCS).