
NVIDIA GPU Operator Version 2

TL;DR
# Deploy the Out-of-tree Operator
$ kubectl apply -k https://github.com/qbarrand/oot-operator/config/default
# Deploy the NVIDIA GPU Operator Version 2
$ git clone [email protected]:smgglrs/nvidia-gpu-operator.git && cd nvidia-gpu-operator
$ make deploy
# Create a sample DeviceConfig targeting GPU nodes.
#
# NOTE: the `driverImage` tag should be adjusted to the kernel version
# of the nodes selected
$ kubectl apply -f config/samples/gpu_v1alpha1_deviceconfig.yaml
# Wait until all NVIDIA components are healthy
$ kubectl get -n nvidia-gpu-operator get all
# Run a sample GPU workload pod
$ cat <<EOF kubectl -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
namespace: nvidia-gpu-operator
spec:
restartPolicy: OnFailure
containers:
- image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.6.0-ubi8
imagePullPolicy: IfNotPresent
name: cuda-vectoradd
resources:
limits:
nvidia.com/gpu: "1"
EOF
$ kubectl logs -n nvidia-gpu-operator pod/cuda-vectoradd
Overview
Kubernetes provides access to special hardware resources such as NVIDIA GPUs, NICs, Infiniband
adapters and other devices through the device plugin framework.
However, configuring and managing Kubernetes nodes with these hardware resources requires
configuration of multiple software components, such as drivers, container runtimes, device plugins
and other libraries, which is time consuming and error prone.
The NVIDIA GPU Operator Version 2
is used in paar with the Out-of-tree Operator, in order
to automate the management of all the NVIDIA software components needed to provision GPUs within
Kubernetes, from the drivers to the respective monitoring metrics.
Dependencies
Components
The components managed by the operator are:
Design
For a detailed description of the design trade-offs of the NVIDIA GPU Operator v2 check this
doc.
Changelog
- Removed the NVIDIA driver and device plugin management, this is offloaded to the Out-of-tree Operator
- Deprecated the dependency of Node Feature Discovery, set it as optional
- Added support to deploy different GPU configurations per node group, by refactoring the NVIDIA GPU
Operator v1 singleton
ClusterPolicy
CR (also renamed it to DeviceConfig
)
- Added Github Actions worklow that build the operator, the OLM
bundle and OLM catalog images for the
main
branch
- Added the NVIDIA GPU Prometheus Exporter, which
exports additional metrics, i.e.
gpu_info
following the respective
practices, in order to not
rely only on Kubernetes node labels (via the NVIDIA GPU Feature Discovery)
- Added
80%
+ test coverage
OpenShift
Use the NVIDIA GPU Operator Version 2, along with the Out-of-tree Operator
in your OpenShift cluster to
automatically provision and manage different GPU configurations per node group.
The following guide leverages the automatically generated container images of:
# Given an OpenShift cluster with GPU powered nodes
$ kubectl get clusterversions.config.openshift.io
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.11 True False 30h Cluster version is 4.10.11
# Deploy the Out-of-tree Operator
$ kubectl apply -k https://github.com/qbarrand/oot-operator/config/default
# Deploy the NVIDIA GPU Operator v2 via OLM
$ kubectl apply -f https://github.com/smgglrs/nvidia-gpu-operator/hack/openshift/deploy.yaml
# Create a sample DeviceConfig targeting GPU nodes.
#
# NOTE: the `driverImage` tag should be adjusted to the kernel version
# of the selected nodes
$ kubectl apply -f https://github.com/smgglrs/nvidia-gpu-operator/hack/openshift/deviceconfig.yaml
# Wait for all NVIDIA GPU components are healthy
$ kubectl get -n nvidia-gpu-operator all
# Verify the setup by running a sample GPU workload pod
$ kubectl apply -f https://github.com/smgglrs/nvidia-gpu-operator/hack/openshift/sample-gpu-workload.yaml
# Check the GPU workload logs
$ kubectl logs -n nvidia-gpu-operator pod/cuda-vectoradd