gke-tpu-env-injector

command module
v0.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 5, 2023 License: MIT Imports: 14 Imported by: 0

README

gke-tpu-env-injector

CI Go Report Card MIT licensed GitHub release (latest SemVer) Kubernetes 1.24 Kubernetes 1.25 Kubernetes 1.26 Kubernetes 1.27

Automatically inject the environment variables used by libtpu when running TPUs on GKE.

On August 31, 2023, Google officially released support for running their TPU VMs (v4 and v5e) on Google Kubernetes Engine.

The tpu_driver application, with the accompanying TPU device driver application that gets installed to GKE clusters with TPU support enabled actually require two interesting environment variables to be available to your applications when you run them. These environment variables are TPU_WORKER_ID and TPU_WORKER_HOSTNAMES.

Taken directly from the GCP documentation:

TPU_WORKER_ID: A unique integer for each Pod. This ID denotes a unique
worker-id in the TPU slice. The supported values for this field range from zero
to the number of Pods minus one.

TPU_WORKER_HOSTNAMES: A comma-separated list of TPU VM hostnames or IP addresses
that need to communicate with each other within the slice. There should be a
hostname or IP address for each TPU VM in the slice. The list of IP addresses or
hostnames are ordered and zero indexed by the TPU_WORKER_ID.

These two environment variables require that you can dynamically inject the TPU_WORKER_ID into each application, and that TPU_WORKER_HOSTNAMES contains individually addressable DNS names for each specific worker, which will represent the pieces of a TPU PodSlice.

Conveniently, GKE will automatically inject these environment variables into pods for you BUT they will only do that under very specific conditions:

GKE automatically injects these environment variables by using a mutating
webhook when a Job is created with the completionMode: Indexed, subdomain,
parallelism > 1, and requesting google.com/tpu properties.

However, what if you're not launching a Kubernetes Job at all? What if you have your own applications to launch that still need these environment variables? That's what gke-tpu-env-injector is for.

gke-tpu-env-injector will do this same environment variable injection for Kubernetes StatefulSets, which can also leverage a Kubernetes headless service, in order to get predictable, addressable individual pod DNS addresses.

gke-tpu-env-injector does this through the Kubernetes native MutatingAdmissionWebhook functionality which will intercept all scheduled StatefulSets and Pods that are annotated with gke-tpu-env-injector.aaronbatilo.dev/inject: enabled.

Getting started

To install gke-tpu-env-injector, we've provided a helm chart that's hosted on the GitHub Container Registry as an OCI artifact:

helm upgrade --install gke-tpu-env-injector oci://ghcr.io/abatilo/gke-tpu-env-injector --set cert-manager.enabled=true

Setting cert-manager.enabled=true will both create certificate authority for self signed certificates as well as request the required TLS certificates from cert-manager and mount them for gke-tpu-env-injector to be able to receive encrypted webhooks from the Kubernetes control plane.

Configuration

CLI flag Environment variable Description Default
--tls-cert-file GTEI_TLS_CERT_FILE The path to the file containing the default x509 certificate for HTTPS. /etc/tls/tls.crt
--tls-key-file GTEI_TLS_KEY_FILE The path to the file containing the default x509 private key matching --tls-cert-file. /etc/tls/tls.key

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL