vodascheduler

module

v0.2.2 Latest Latest Go to latest Published: Nov 6, 2022 License: Apache-2.0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/heyfey/vodascheduler

Links

Open Source Insights

README ¶

tags: voda-scheduler

Voda Scheduler

Note that everything is experimental and may change significantly at any time.

Voda scheduler is a GPU scheduler for elastic deep learning workloads based on Kubernetes, Kubeflow Training Operator and Horovod.

Voda Scheduler is designed to be easily deployed in any Kuberneters cluster, for more architectural details, see design.

Contents

Why Elastic Training?
Why Voda Scheduler?
Get Started
Scheduling Algorithms
Docker Images
Prometheus Metrics Exposed
Related Projects

Why Elastic Training?

Elastic training enables the distributed training jobs to be scaled up and down dynamically at runtime, without interrupting the training process. With elastic training, the scheduler is able to have training jobs utilize idle resources if there are any, as well as make the most efficient resource allocations if the cluster is heavily-loaded, hence increasing cluster throughput and reducing overall training time.

For more information about elastic training, see Elastic Horovod and Torch Distributed Elastic.

Why Voda Scheduler?

Voda Scheduler provides several critical features for elastic deep learning workloads as follows.

Rich Scheduling Algorithms (with Resource Elasticity)

To fully utilize the strength of resource elasticity, Voda implements the most popular algorithms for scheduling elastic deep learning workloads.

See the list of algorithms provided. The system administrators can choose any of them.

You can also implement your own scheduling algorithms. Voda offers functionalities to collect run-time metrics of training jobs that could be useful for scheduling.

Topology-Aware Scheduling

Job placement is critical to the performance of distributed computing jobs on GPU clusters. Voda scheduler offers topology-aware scheduling and worker migration to consolidate resources, which minimizes communication overhead and maximizes cluster throughput.

Node Autoscaling and Fault-Tolerance

Voda scheduler is aware of the addition/removal of computing nodes and makes the best scheduling decision upon it, thus smoothly

co-works with existing autoscaler.
makes the best use of spot instances that may come and go with little warning.
tolerates failing nodes.

Fault-Tolerance of the Scheduler

Voda scheduler adopts microservice architecture.
- For the training service, there is no single point of failure.
- For the scheduler, it restarts on failure and restores previous status.
No training job will be interrupted when any of the Voda scheduler components fails.

Prerequisite

A Kubernetes cluster, on-cloud or on-premise, that can schedule GPUs. Voda Scheduler is tested with v1.20.

Get Started

Scheduling Algorithms

Algorithm	Elastic	Reference
FIFO (default)
Elastic-FIFO	✔
SRJF
Elastic-SRJF	✔
Tiresias		Gu, Juncheng, et al. "Tiresias: A GPU cluster manager for distributed deep learning." 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 2019. https://www.usenix.org/conference/nsdi19/presentation/gu
Elastic-Tiresias	✔	Wu, Yidi, et al. "Elastic Deep Learning in Multi-Tenant GPU Clusters." IEEE Transactions on Parallel and Distributed Systems (2021). https://ieeexplore.ieee.org/abstract/document/9373916
FfDL Optimizer	✔	Saxena, Vaibhav, et al. "Effective elastic scaling of deep learning workloads." 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 2020. https://ieeexplore.ieee.org/abstract/document/9285954
AFS-L	✔	Shin, Jinwoo, and KyoungSoo Park. "Elastic Resource Sharing for Distributed Deep Learning." (2021) https://www.usenix.org/system/files/nsdi21-hwang.pdf

Docker Images

Voda Scheduler Docker Images

Prometheus Metrics Exposed

Prometheus Metrics Exposed

kubeflow/training-operator: Training operators on Kubernetes.
kubeflow/mpi-operator: The MPIJob controller.
horovod/horovod: Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
heyfey/munkres: Hungarian algorithm used in the placement algorithm.
heyfey/nvidia_smi_exporter: For monitoring GPUs in the cluster.

Directories ¶

Path	Synopsis
cmd
config
pkg
algorithm
allocator
allocator/allocator
common/mongo
common/rabbitmq
common/trainingjob
common/types
common/util
placement
scheduler
scheduler/scheduler
service
service/service

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL