MPI-Operator

command module
v0.0.0-...-deab0c8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 25, 2022 License: Apache-2.0 Imports: 11 Imported by: 0

README

License Go Reference Go Go Report Card

MPI Operator

A big part of this project is based on MPI Operator in Kubeflow. This project is a stripped down version written according to my own understanding using kubebuilder.

The MPI Operator makes it easy to run allreduce-style distributed training on Kubernetes. Please check out this blog post for an introduction to MPI Operator and its industry adoption.

Installation

You’ll need a Kubernetes cluster to run against. You can use KIND to get a local cluster for testing, or run against a remote cluster. You’ll need kustomize installed. Note: Your controller will automatically use the current context in your kubeconfig file (i.e. whatever cluster kubectl cluster-info shows).

You can deploy the operator by running the following commands. By default, we will create a namespace 'sw-mpi-operator' and deploy everything in it.

git clone https://github.com/FFFFFaraway/MPI-Operator
cd mpi-operator
make deploy

You can check whether the MPI Job custom resource is installed via:

kubectl get crd

The output should include mpijobs.batch.test.bdap.com like the following:

NAME                                       AGE
...
mpijobs.batch.test.bdap.com                4d
...

You can check whether the MPI Job Operator is running via:

kubectl get pod -n sw-mpi-operator

Creating an MPI Job

You can create an MPI job by defining an MPIJob config file. For example:

apiVersion: batch.test.bdap.com/v1
kind: MPIJob
metadata:
  name: simple-train-cpu
  namespace: sw-mpi-operator
spec:
  numWorkers: 5
  launcherTemplate:
    spec:
      containers:
        - args:
            - mkdir sample-python-train &&
              cd sample-python-train &&
              horovodrun -np 2 --hostfile $OMPI_MCA_orte_default_hostfile python generate_data.py &&
              horovodrun -np 2 --hostfile $OMPI_MCA_orte_default_hostfile python main.py
          command:
            - /bin/sh
            - -c
          image: farawaya/horovod-torch-cpu
          name: horovod-master
      restartPolicy: Never
  workerTemplate:
    spec:
      containers:
        - args:
            - git clone https://github.com/FFFFFaraway/sample-python-train.git &&
              cd sample-python-train &&
              pip install -r requirements.txt &&
              touch /ready.txt &&
              sleep infinity
          command:
            - /bin/sh
            - -c
          image: farawaya/horovod-torch-cpu
          name: horovod-worker
          readinessProbe:
            exec:
              command:
                - cat
                - /ready.txt
            initialDelaySeconds: 30
            periodSeconds: 5

Deploy the MPIJob resource:

kubectl apply -f config/samples/training_job_cpu.yaml

Note that the launcher pod will use all workers (numWorkers in spec), the -npparameter after horovodrun does not seem to work.

Monitoring an MPI Job

You can inspect the logs to see the training progress. When the job starts, access the logs from the launcher pod:

kubectl logs simple-train-cpu-launcher -n sw-mpi-operator

Editing MPI Job

Modify and apply the MPIJob yaml file.

  • However, if the Launcher is modified, then you need to manually delete the existing Launcher Pod to trigger the update.
  • If the Worker is modified, there is no need to delete Worker Pod manually. It will be automatically updated.

Deleting MPI Job

Delete the MPIJob yaml file. And all pods, configmaps, rbac will be automatically deleted.

You need to manually delete the MPIJob task to avoid occupying GPU resources.

Uninstall

make undeploy

TODO List

  • Add MPIJob Status
  • Add Defaulter and Validator Webhook
  • Add scheduler

Docker Images

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
api
batch.test.bdap.com/v1
Package v1 is the v1alpha1 version of the API.
Package v1 is the v1alpha1 version of the API.
v1
Package v1 contains API Schema definitions for the batch v1 API group +kubebuilder:object:generate=true +groupName=batch.test.bdap.com
Package v1 contains API Schema definitions for the batch v1 API group +kubebuilder:object:generate=true +groupName=batch.test.bdap.com
client
clientset/versioned
This package has the automatically generated clientset.
This package has the automatically generated clientset.
clientset/versioned/fake
This package has the automatically generated fake clientset.
This package has the automatically generated fake clientset.
clientset/versioned/scheme
This package contains the scheme of the automatically generated clientset.
This package contains the scheme of the automatically generated clientset.
clientset/versioned/typed/batch.test.bdap.com/v1
This package has the automatically generated typed clients.
This package has the automatically generated typed clients.
clientset/versioned/typed/batch.test.bdap.com/v1/fake
Package fake has the automatically generated clients.
Package fake has the automatically generated clients.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL