nvidia-gpu-operator

command module

v0.0.0-...-e0c3458 Latest Latest Go to latest Published: Jun 17, 2022 License: Apache-2.0 Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/smgglrs/nvidia-gpu-operator

README ¶

NVIDIA GPU Operator Version 2

TL;DR

# Deploy the Out-of-tree Operator
$ kubectl apply -k https://github.com/qbarrand/oot-operator/config/default

# Deploy the NVIDIA GPU Operator Version 2
$ git clone git@github.com:smgglrs/nvidia-gpu-operator.git && cd nvidia-gpu-operator
$ make deploy

# Create a sample DeviceConfig targeting GPU nodes.
#
# NOTE: the `driverImage` tag should be adjusted to the kernel version
# of the nodes selected
$ kubectl apply -f config/samples/gpu_v1alpha1_deviceconfig.yaml

# Wait until all NVIDIA components are healthy
$ kubectl get -n nvidia-gpu-operator get all

# Run a sample GPU workload pod
$ cat <<EOF kubectl -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
  namespace: nvidia-gpu-operator
spec:
  restartPolicy: OnFailure
  containers:
  - image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.6.0-ubi8
    imagePullPolicy: IfNotPresent
    name: cuda-vectoradd
    resources:
      limits:
        nvidia.com/gpu: "1"
EOF

$ kubectl logs -n nvidia-gpu-operator pod/cuda-vectoradd

Overview

Kubernetes provides access to special hardware resources such as NVIDIA GPUs, NICs, Infiniband adapters and other devices through the device plugin framework.

However, configuring and managing Kubernetes nodes with these hardware resources requires configuration of multiple software components, such as drivers, container runtimes, device plugins and other libraries, which is time consuming and error prone.

The NVIDIA GPU Operator Version 2 is used in paar with the Out-of-tree Operator, in order to automate the management of all the NVIDIA software components needed to provision GPUs within Kubernetes, from the drivers to the respective monitoring metrics.

Dependencies

Out-of-tree Operator
(optional) Node Feature Discovery

Components

The components managed by the operator are:

Out-of-tree Operator Module, which manages:
- NVIDIA drivers (to enable CUDA)
- Kubernetes device plugin for GPUs
NVIDIA GPU Prometheus Exporter
NVIDIA Container Toolkit
NVIDIA DCGM
NVIDIA DCGM Prometheus Exporter

Design

For a detailed description of the design trade-offs of the NVIDIA GPU Operator v2 check this doc.

Changelog

Removed the NVIDIA driver and device plugin management, this is offloaded to the Out-of-tree Operator
Deprecated the dependency of Node Feature Discovery, set it as optional
Added support to deploy different GPU configurations per node group, by refactoring the NVIDIA GPU Operator v1 singleton ClusterPolicy CR (also renamed it to DeviceConfig)
Added Github Actions worklow that build the operator, the OLM bundle and OLM catalog images for the main branch
Added the NVIDIA GPU Prometheus Exporter, which exports additional metrics, i.e. gpu_info following the respective practices, in order to not rely only on Kubernetes node labels (via the NVIDIA GPU Feature Discovery)
Added 80%+ test coverage

OpenShift

Use the NVIDIA GPU Operator Version 2, along with the Out-of-tree Operator in your OpenShift cluster to automatically provision and manage different GPU configurations per node group.

The following guide leverages the automatically generated container images of:

the NVIDIA GPU v2 operator
the operator OLM bundle
the OLM CatalogSource

# Given an OpenShift cluster with GPU powered nodes
$ kubectl get clusterversions.config.openshift.io
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.11   True        False         30h     Cluster version is 4.10.11

# Deploy the Out-of-tree Operator
$ kubectl apply -k https://github.com/qbarrand/oot-operator/config/default

# Deploy the NVIDIA GPU Operator v2 via OLM
$ kubectl apply -f https://github.com/smgglrs/nvidia-gpu-operator/hack/openshift/deploy.yaml

# Create a sample DeviceConfig targeting GPU nodes.
#
# NOTE: the `driverImage` tag should be adjusted to the kernel version
# of the selected nodes
$ kubectl apply -f https://github.com/smgglrs/nvidia-gpu-operator/hack/openshift/deviceconfig.yaml

# Wait for all NVIDIA GPU components are healthy
$ kubectl get -n nvidia-gpu-operator all

# Verify the setup by running a sample GPU workload pod
$ kubectl apply -f https://github.com/smgglrs/nvidia-gpu-operator/hack/openshift/sample-gpu-workload.yaml

# Check the GPU workload logs
$ kubectl logs -n nvidia-gpu-operator pod/cuda-vectoradd

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
api
v1alpha1 Package v1alpha1 contains API Schema definitions for the gpu v1alpha1 API group +kubebuilder:object:generate=true +groupName=gpu.nvidia.com	Package v1alpha1 contains API Schema definitions for the gpu v1alpha1 API group +kubebuilder:object:generate=true +groupName=gpu.nvidia.com
controllers
deviceconfig
nodekernellabel

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL