nvidia-gpu-scheduler

module

v0.2.0 Latest Latest Go to latest Published: Feb 21, 2022 License: Apache-2.0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/caden2016/nvidia-gpu-scheduler

Links

Open Source Insights

README ¶

NVIDIA device scheduler extender for Kubernetes

English | 简体中文

Introduction

With the help of NVIDIA device plugin for Kubernetes and kubernetes kubelet deviceplugin manager, we can schedule our pod by gpu numbers. But in some case, our node have more gpu devices with different model, we wish kubernetes to shcedule the pod (need 2 gpu with model x) to the nodes which satisfied it. nvidia-gpu-scheduler helps to achieve it and also helps to monitor pods used differnet gpus and gpuinfos of each node.

Features and Components

Features

Real-time data acquisition.(Data will be published in time no matter the gpuserver is restart or the gpuserver-ds of each node is restarted.)
Health check in time. (the gpunode-lifecycle-controller in gpuserver check the health of each node in time with the fresh lease from the gpuserver-ds.)
Schedule ExtendPoint Filter,Score,Preempt.(Filter nodes with annotation nvidia-gpu-scheduler/gpu.model of requested pod, scores by gpu numbers of the request model in each node.)

Components

The NVIDIA device scheduler extender for Kubernetes contains a StatefulSet (gpuserver) and a Daemonset (gpuserver-ds):

gpuserver

Provide following apis to help monitor gpu pod and gpu node info:

GET /apis/nvidia-gpu-scheduler/v1/gpupods?watch=true
GET /apis/nvidia-gpu-scheduler/v1/gpunodes?watch=true

Provide following apis to help extend kubernetes kube-scheduler as a HTTPExtender:

POST /apis/nvidia-gpu-scheduler/v1/schedule/filter
POST /apis/nvidia-gpu-scheduler/v1/schedule/prioritize
POST /apis/nvidia-gpu-scheduler/v1/schedule/preempt

gpuserver-ds

Populate node gpu devices info to gpuserver.

It gets pods used gpu device infos with the help of kubelet grpc Server PodResourcesServer
It gets gpu device infos with the help of NVML.

Please note that: You needn't have to do the following extensions when making sure each of your cluster node have only one type of gpu model. If you have more than one type of gpu device in your kubelet node. In order to make the pod scheduled to the kubelet get gpu with model it needs, the following tow need to be changed additionally.

The original kubernetes kubelet component is not support to shcedule pod with different gpu model, we need to change it.
The original NVIDIA device plugin for Kubernetes need to be changed, to add gpu model info to kubelet via changing the kubelet deviceplugin API.

Prerequisites

The list of prerequisites for running the NVIDIA device scheduler extender described below:

NVIDIA device plugin for Kubernetes.
Kubernetes >= v1.13 (gpuserver-ds get pod gpu info base on kubelet podresources API.)

Quick Start

Build with docker.

$ make all REGISTRY=docker.io/<yourname>

Add an extender configuration to kubernetes kube-scheduler config file.

$ cat kube-scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1alpha2
...
extenders:
 - urlPrefix: 'https://<kube-apiserver>:6443/apis/nvidia-gpu-scheduler/v1/schedule'
   filterVerb: filter
   prioritizeVerb: prioritize
   preemptVerb: preempt
   weight: 1
   enableHttps: true
   nodeCacheCapable: true
   ignorable: true
   TLSConfig:
     CAFile: /etc/kubernetes/ssl/ca.pem
     CertFile: /etc/kubernetes/ssl/admin.pem
     KeyFile: /etc/kubernetes/ssl/admin-key.pem
profiles:
- schedulerName: default-scheduler

Deploy with helm

Current version of nvidia-gpu-scheduler is v0.2.0. The preferred way to deploy it is using helm.

Instructions for installing helm can be found here. The simple guide for helm with nvidia-gpu-scheduler repo can be found here

Add and Update chart repo

# helm repo add ngs https://caden2016.github.io/nvidia-gpu-scheduler
# helm repo update

Install from chart repo，xxx is the release name. nodeinfo=gpu is the label of gpu node, where to deploy gpuserver-ds.

# helm install xxx ngs/nvidia-gpu-scheduler --version 0.2.0 --namespace kube-system  --set nodeSelectorDaemonSet.nodeinfo=gpu
# helm  list --namespace kube-system

Building and Running Locally

Versioning

Use the versioning to follow SEMVER. The first version following this scheme has been tagged v0.0.0.

Going forward, the major version of the nvidia-gpu-scheduler will only change following a change in the kubelet podresources API itself. For example, version v1alpha1 of kubelet podresources API corresponds to version v0.x.x of nvidia-gpu-scheduler. If a new v2beta2 version of kubelet podresources API comes out, then nvidia-gpu-scheduler will increase its major version to 1.x.x.

As of now, the podresources API for Kubernetes >= v1.13 is v1alpha1 or v1 added compatibly. If you have a version of Kubernetes >= 1.13 you can deploy any nvidia-gpu-scheduler version > v0.0.0.

Directories ¶

Path	Synopsis
api
gpunode/v1 Package v1 contains API Schema definitions for the resources.scheduler v1 API group +kubebuilder:object:generate=true +groupName=resources.scheduler.caden2016.github.io	Package v1 contains API Schema definitions for the resources.scheduler v1 API group +kubebuilder:object:generate=true +groupName=resources.scheduler.caden2016.github.io
gpupod/v1 Package v1 contains API Schema definitions for the resources.scheduler v1 API group +kubebuilder:object:generate=true +groupName=resources.scheduler.caden2016.github.io	Package v1 contains API Schema definitions for the resources.scheduler v1 API group +kubebuilder:object:generate=true +groupName=resources.scheduler.caden2016.github.io
jsonstruct Package jsonstruct describe how data is serialized to json in communication between gpuserver and gpuserver-ds .	Package jsonstruct describe how data is serialized to json in communication between gpuserver and gpuserver-ds .
cmd
gpuserver
gpuserver-ds
gpuserver-ds/app
gpuserver-ds/app/options
gpuserver/app
gpuserver/app/options
pkg
generated/gpunode/clientset/versioned This package has the automatically generated clientset.	This package has the automatically generated clientset.
generated/gpunode/clientset/versioned/fake This package has the automatically generated fake clientset.	This package has the automatically generated fake clientset.
generated/gpunode/clientset/versioned/scheme This package contains the scheme of the automatically generated clientset.	This package contains the scheme of the automatically generated clientset.
generated/gpunode/clientset/versioned/typed/gpunode/v1 This package has the automatically generated typed clients.	This package has the automatically generated typed clients.
generated/gpunode/clientset/versioned/typed/gpunode/v1/fake Package fake has the automatically generated clients.	Package fake has the automatically generated clients.
generated/gpunode/informers/externalversions
generated/gpunode/informers/externalversions/gpunode
generated/gpunode/informers/externalversions/gpunode/v1
generated/gpunode/informers/externalversions/internalinterfaces
generated/gpunode/listers/gpunode/v1
generated/gpupod/clientset/versioned This package has the automatically generated clientset.	This package has the automatically generated clientset.
generated/gpupod/clientset/versioned/fake This package has the automatically generated fake clientset.	This package has the automatically generated fake clientset.
generated/gpupod/clientset/versioned/scheme This package contains the scheme of the automatically generated clientset.	This package contains the scheme of the automatically generated clientset.
generated/gpupod/clientset/versioned/typed/gpupod/v1 This package has the automatically generated typed clients.	This package has the automatically generated typed clients.
generated/gpupod/clientset/versioned/typed/gpupod/v1/fake Package fake has the automatically generated clients.	Package fake has the automatically generated clients.
generated/gpupod/informers/externalversions
generated/gpupod/informers/externalversions/gpupod
generated/gpupod/informers/externalversions/gpupod/v1
generated/gpupod/informers/externalversions/internalinterfaces
generated/gpupod/listers/gpupod/v1
gpuserver-ds/controller
gpuserver-ds/podresources
gpuserver-ds/podresources/util
gpuserver/certs/util
gpuserver/certs/util/generator
gpuserver/certs/util/generator/fake
gpuserver/certs/util/writer
gpuserver/certs/util/writer/atomic
gpuserver/controller
gpuserver/router
gpuserver/router/init
gpuserver/router/metricserver
gpuserver/router/schedulerserver
gpuserver/scheduler/framework
gpuserver/scheduler/framework/plugins
gpuserver/scheduler/framework/plugins/names
gpuserver/scheduler/framework/plugins/noderesources
gpuserver/scheduler/framework/runtime
nameflag
util
util/info/metadata
util/server
util/server/cache
util/server/watcher Package watcher notify the change of gpupod and gpunode to watchers from rest api in metricserver.	Package watcher notify the change of gpupod and gpunode to watchers from rest api in metricserver.
util/serverds
util/signal

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

nvidia-gpu-scheduler

README ¶

NVIDIA device scheduler extender for Kubernetes

Table of Contents

Introduction

Features and Components

Features

Components

gpuserver

gpuserver-ds

Prerequisites

Quick Start

Build with docker.

Add an extender configuration to kubernetes kube-scheduler config file.

Deploy with `helm`

Building and Running Locally

Versioning

Directories ¶

README ¶

NVIDIA device scheduler extender for Kubernetes

Table of Contents

Introduction

Features and Components

Features

Components

gpuserver

gpuserver-ds

Prerequisites

Quick Start

Build with docker.

Add an extender configuration to kubernetes kube-scheduler config file.

Deploy with helm

Building and Running Locally

Versioning

Directories ¶

Deploy with `helm`