k8s-dra-driver

module

v0.0.0-...-c16a028 Latest Latest Go to latest Published: Apr 3, 2024 License: Apache-2.0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

README ¶

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes

This DRA resource driver is currently under active development and not yet designed for production use. We will continually be force pushing over main until we have something more stable. Use at your own risk.

A document and demo of the DRA support for GPUs provided by this repo can be found below:

Document	Demo

Demo

This section describes using kind to demo the functionality of the NVIDIA GPU DRA Driver.

First since we'll launch kind with GPU support, ensure that the following prerequisites are met:

kind is installed. See the official documentation here.
Ensure that the NVIDIA Container Toolkit is installed on your system. This can be done by following the instructions here.

Configure the NVIDIA Container Runtime as the default Docker runtime:

sudo nvidia-ctk runtime configure --runtime=docker --set-as-default

Restart Docker to apply the changes:
```
sudo systemctl restart docker
```
Set the accept-nvidia-visible-devices-as-volume-mounts option to true in the /etc/nvidia-container-runtime/config.toml file to configure the NVIDIA Container Runtime to use volume mounts to select devices to inject into a container.

We start by first cloning this repository and cding into it. All of the scripts and example Pod specs used in this demo are in the demo subdirectory, so take a moment to browse through the various files and see what's available:

git clone https://github.com/NVIDIA/k8s-dra-driver.git

cd k8s-dra-driver

Setting up the infrastructure

First, create a kind cluster to run the demo:

./demo/clusters/kind/create-cluster.sh

From here we will build the image for the example resource driver:

./demo/clusters/kind/build-dra-driver.sh

This also makes the built images available to the kind cluster.

We now install the NVIDIA GPU DRA driver:

./demo/clusters/kind/install-dra-driver.sh

This should show two pods running in the nvidia-dra-driver namespace:

$ kubectl get pods -n nvidia-dra-driver
NAMESPACE           NAME                                       READY   STATUS    RESTARTS   AGE
nvidia-dra-driver   nvidia-dra-controller-6bdf8f88cc-psb4r     1/1     Running   0          34s
nvidia-dra-driver   nvidia-dra-plugin-lt7qh                    1/1     Running   0          32s

Run the examples by following the steps in the demo script

Finally, you can run the various examples contained in the demo/specs/quickstart folder. The README in that directory shows the full script of the demo you can walk through.

cat demo/specs/quickstart/README.md
...

Where the running the first three examples should produce output similar to the following:

$ kubectl apply --filename=demo/specs/quickstart/gpu-test{1,2,3}.yaml
...

$ kubectl get pod -A
NAMESPACE           NAME                                       READY   STATUS    RESTARTS   AGE
gpu-test1           pod1                                       1/1     Running   0          34s
gpu-test1           pod2                                       1/1     Running   0          34s
gpu-test2           pod                                        2/2     Running   0          34s
gpu-test3           pod1                                       1/1     Running   0          34s
gpu-test3           pod2                                       1/1     Running   0          34s
...

$ kubectl logs -n gpu-test1 -l app=pod
GPU 0: A100-SXM4-40GB (UUID: GPU-662077db-fa3f-0d8f-9502-21ab0ef058a2)
GPU 0: A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)

$ kubectl logs -n gpu-test2 pod --all-containers
GPU 0: A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
GPU 0: A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)

$ kubectl logs -n gpu-test3 -l app=pod
GPU 0: A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
GPU 0: A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)

Cleaning up the environment

Running

$ ./demo/clusters/kind/delete-cluster.sh

will remove the cluster created in the preceding steps.

Directories ¶

Path	Synopsis
api
nvidia.com/resource/gpu/nas/v1alpha1
nvidia.com/resource/gpu/nas/v1alpha1/client
nvidia.com/resource/gpu/v1alpha1
utils/selector
utils/sharing
utils/types
cmd
nvidia-dra-controller
nvidia-dra-plugin
set-nas-status
internal
info
pkg
flags
nvidia.com/resource/clientset/versioned
nvidia.com/resource/clientset/versioned/fake This package has the automatically generated fake clientset.	This package has the automatically generated fake clientset.
nvidia.com/resource/clientset/versioned/scheme This package contains the scheme of the automatically generated clientset.	This package contains the scheme of the automatically generated clientset.
nvidia.com/resource/clientset/versioned/typed/gpu/v1alpha1 This package has the automatically generated typed clients.	This package has the automatically generated typed clients.
nvidia.com/resource/clientset/versioned/typed/gpu/v1alpha1/fake Package fake has the automatically generated clients.	Package fake has the automatically generated clients.
nvidia.com/resource/clientset/versioned/typed/nas/v1alpha1 This package has the automatically generated typed clients.	This package has the automatically generated typed clients.
nvidia.com/resource/clientset/versioned/typed/nas/v1alpha1/fake Package fake has the automatically generated clients.	Package fake has the automatically generated clients.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL