README
¶
NVIDIA GPU Operator Version 2
TL;DR
# Deploy the Out-of-tree Operator
$ kubectl apply -k https://github.com/qbarrand/oot-operator/config/default
# Deploy the NVIDIA GPU Operator Version 2
$ git clone git@github.com:smgglrs/nvidia-gpu-operator.git && cd nvidia-gpu-operator
$ make deploy
# Create a sample DeviceConfig targeting GPU nodes.
#
# NOTE: the `driverImage` tag should be adjusted to the kernel version
# of the nodes selected
$ kubectl apply -f config/samples/gpu_v1alpha1_deviceconfig.yaml
# Wait until all NVIDIA components are healthy
$ kubectl get -n nvidia-gpu-operator get all
# Run a sample GPU workload pod
$ cat <<EOF kubectl -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
namespace: nvidia-gpu-operator
spec:
restartPolicy: OnFailure
containers:
- image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.6.0-ubi8
imagePullPolicy: IfNotPresent
name: cuda-vectoradd
resources:
limits:
nvidia.com/gpu: "1"
EOF
$ kubectl logs -n nvidia-gpu-operator pod/cuda-vectoradd
Overview
Kubernetes provides access to special hardware resources such as NVIDIA GPUs, NICs, Infiniband adapters and other devices through the device plugin framework.
However, configuring and managing Kubernetes nodes with these hardware resources requires configuration of multiple software components, such as drivers, container runtimes, device plugins and other libraries, which is time consuming and error prone.
The NVIDIA GPU Operator Version 2 is used in paar with the Out-of-tree Operator, in order to automate the management of all the NVIDIA software components needed to provision GPUs within Kubernetes, from the drivers to the respective monitoring metrics.
Dependencies
- Out-of-tree Operator
- (optional) Node Feature Discovery
Components
The components managed by the operator are:
- Out-of-tree Operator Module, which
manages:
- NVIDIA drivers (to enable CUDA)
- Kubernetes device plugin for GPUs
- NVIDIA GPU Prometheus Exporter
- NVIDIA Container Toolkit
- NVIDIA DCGM
- NVIDIA DCGM Prometheus Exporter
Design
For a detailed description of the design trade-offs of the NVIDIA GPU Operator v2 check this doc.
Changelog
- Removed the NVIDIA driver and device plugin management, this is offloaded to the Out-of-tree Operator
- Deprecated the dependency of Node Feature Discovery, set it as optional
- Added support to deploy different GPU configurations per node group, by refactoring the NVIDIA GPU
Operator v1 singleton
ClusterPolicy
CR (also renamed it toDeviceConfig
) - Added Github Actions worklow that build the operator, the OLM
bundle and OLM catalog images for the
main
branch - Added the NVIDIA GPU Prometheus Exporter, which
exports additional metrics, i.e.
gpu_info
following the respective practices, in order to not rely only on Kubernetes node labels (via the NVIDIA GPU Feature Discovery) - Added
80%
+ test coverage
OpenShift
Use the NVIDIA GPU Operator Version 2, along with the Out-of-tree Operator in your OpenShift cluster to automatically provision and manage different GPU configurations per node group.
The following guide leverages the automatically generated container images of:
- the NVIDIA GPU v2 operator
- the operator OLM bundle
- the OLM CatalogSource
# Given an OpenShift cluster with GPU powered nodes
$ kubectl get clusterversions.config.openshift.io
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.11 True False 30h Cluster version is 4.10.11
# Deploy the Out-of-tree Operator
$ kubectl apply -k https://github.com/qbarrand/oot-operator/config/default
# Deploy the NVIDIA GPU Operator v2 via OLM
$ kubectl apply -f https://github.com/smgglrs/nvidia-gpu-operator/hack/openshift/deploy.yaml
# Create a sample DeviceConfig targeting GPU nodes.
#
# NOTE: the `driverImage` tag should be adjusted to the kernel version
# of the selected nodes
$ kubectl apply -f https://github.com/smgglrs/nvidia-gpu-operator/hack/openshift/deviceconfig.yaml
# Wait for all NVIDIA GPU components are healthy
$ kubectl get -n nvidia-gpu-operator all
# Verify the setup by running a sample GPU workload pod
$ kubectl apply -f https://github.com/smgglrs/nvidia-gpu-operator/hack/openshift/sample-gpu-workload.yaml
# Check the GPU workload logs
$ kubectl logs -n nvidia-gpu-operator pod/cuda-vectoradd
Documentation
¶
There is no documentation for this package.
Directories
¶
Path | Synopsis |
---|---|
api
|
|
v1alpha1
Package v1alpha1 contains API Schema definitions for the gpu v1alpha1 API group +kubebuilder:object:generate=true +groupName=gpu.nvidia.com
|
Package v1alpha1 contains API Schema definitions for the gpu v1alpha1 API group +kubebuilder:object:generate=true +groupName=gpu.nvidia.com |