kill-gpu-zombie-pod

command module
v0.0.0-...-60f371e Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 13, 2023 License: MIT Imports: 21 Imported by: 0

README

kill-gpu-zombie-pod

Detect and delete GPU zombie pod in Kubernetes cluster.

In a GPU cluster, if a GPU is scheduled to a pod but has zero utilization continuously, the pod may be hanging or deadlocked, which harms the cluster utilization.

This repository provides a tool to automatically detect and delete these "GPU zombie pods" after a timeout period.

Quick Start

1. Deploy nvidia_smi_exporter

We scratch GPU metrics from heyfey/nvidia_smi_exporter

git clone https://github.com/heyfey/nvidia_smi_exporter.git
kubectl apply -f nvidia_smi_exporter/nvidia_smi_exporter.yaml 
2. Deploy kill-gpu-zombie-pod
git clone https://github.com/heyfey/kill-gpu-zombie-pod.git
cd kill-gpu-zombie-pod
kubectl apply -f kill-gpu-zombie-pod.yaml

args:

-check_period_seconds float
        Check for zombie every # seconds (default 10)
-idle_timeout_seconds float
        Kill the pod after idle timeout of # seconds (default 90)
-namespace string
        Detect and kill GPU zombie pod in the namespace (default "default")

You can specify args in the YAML

3. Done!

Zombie Example

kubectl apply -f gpu-zombie-pod.yaml

Build Image

docker build  .

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL