azure-pipelines-k8s-agent-scaler

module
v0.0.0-...-b74f83a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 3, 2024 License: Apache-2.0

README

azure-pipelines-k8s-agent-scaler

A Kubernetes operator that provisions ephemeral Pods that run Azure DevOps Pipelines agents, as well as other sidecar containers.

This operator is written in Go, based on controller-runtime. We use kubebuilder for bootstrapping. This solution is completely unrelated to KEDA.

Background: why create yet another solution?

As of 2023, Azure Pipelines has the following methods for self-hosting elastically-scalable agents:

  1. Azure Virtual Machine Scale Set agents, which are slowly auto-scaling VMs (several minutes of provisioning time), where you get poor resource usage (due to having only one agent per VM)
  2. KEDA (a general-purpose Kubernetes operator) using its Azure Pipelines scaler, or meta-level solutions based on it, e.g. Azure Pipelines Agent
  3. Kubernetes operators built specifically for Azure Pipelines agents, such as https://github.com/ogmaresca/azp-agent-autoscaler/ or https://github.com/microsoft/azure-pipelines-orchestrator, which have all been discontinued

While the KEDA-based solution is the most economical one, it has many technical drawbacks:

  • With KEDA, running agents whose pods have multiple containers (e.g. because you need the tools contained in the respective images) is cumbersome. Instead of using Azure Pipeline's demands / capabilities feature, you have to create a dedicated agent pool for each set of containers, and maintain a correspondingly large set of KEDA-specific YAML manifests
  • It is not easily possible to run agents with containers dynamically-defined in your pipeline YAML file. Example: job #1 builds and pushes a Docker image (with a tag that depends on an Azure Pipelines variable) that you want to run with a KEDA-based agent in job #2 that starts after job #1). The only solution is to start a dynamic container as an ephemeral container (in an already-running agent Pod), which has many other drawbacks (e.g. an ephemeral container cannot be protected from termination via a preStop lifecycle hook, it is invisible in most tools, and its resource usage is not accounted for via requests/limits)
  • For every agent pool for which you configure KEDA, you need to define at least one agent Job/Deployment with minReplicaCount larger than 0, as otherwise your jobs would not even start. This disallows you to use the " scale to zero" approach, unless you manually take care of registering a fake/dummy agent for each pool/demand yourself
  • If you use long-running agent pods (i.e., not providing the --once flag to the Azure Pipelines agent container), KEDA may prematurely kill your agent pods, resulting in aborted pipelines and many 'offline' agents in your agent pool. Why? Because KEDA scales your Deployments/Jobs only based on the number of pending jobs. Suppose two jobs are pending, and two Deployments are scheduled by KEDA. One job terminates quickly, the other one takes longer. The pending job count gets reduced from 2 to 1, and KEDA down-scales the Deployments, and Kubernetes may (arbitrarily) try to kill the one that is still runs the active job.
    • While you could just use ephemeral pods (with the --once flag), e.g. as Kubernetes Jobs, as done in https://github.com/clemlesne/azure-pipelines-agent, their disadvantage is that they lack support for cache volumes: Kubernetes has no mechanism to ensure that a cache volume is only concurrently used by one Job (the " Once" in the ReadWriteOnce access mode is highly misleading)

For these reasons, this Kubernetes operator provides a better solution solving all of the above problems at once.

Description

azure-pipelines-k8s-agent-scaler (this project) manages Kubernetes Pods that run the Azure Pipelines (AZP) agent Docker image (see here). The pods are ephemeral, meaning that the agent container is started with the --once flag, such that it terminates after having completed one Azure Pipelines job.

Features of azure-pipelines-k8s-agent-scaler:

  • Ability to specify multiple pod configurations, each one for a different set of Azure Pipeline capabilities. For each pod, you can configure a min/max count for the pods, and define several sidecar containers, e.g. BuildKit, or any other tools you need in your pipeline
    • Because sidecar containers are regular, statically-defined containers (not ephemeral containers), you can kubectl exec into them. This is useful when you build pipelines and run into problems, e.g. the container crashing. Using kubectl exec, you can invoke debugging tools (like top or ps). You can also temporarily add a sleep N statement in your pipeline YAML for the problematic bash: ... step, and then interactively run different commands in the container directly, until you figure out the correct command.
  • Automatic termination of agent pods: once the AZP agent container has terminated, azure-pipelines-k8s-agent-scaler will automatically stop all other sidecar containers, to transition the pod into a terminated state
  • Automatic deletion of terminated pods (with the configurable ability to keep the N most recently terminated pods, for debugging purposes)
  • Careful termination of superfluous agent pods (which only happens under rare circumstances anyway): only those pods are killed that are currently not running any AZP job
  • Configurable definition of cache volumes that are mounted to the defined pods (e.g. to speed up BuildKit via a local cache). azure-pipelines-k8s-agent-scaler provisions new volumes if necessary, and re-mounts existing volumes to new pods, ensuring that a volume is mounted to only one pod
  • Ability to specify extra containers (including their CPU and memory requests/limits) right in the AZP pipeline YAML file via demands (example: ExtraAgentContainers -equals name=containername,image=someImage:someTag,cpu=250m,memory=64Mi||name=otherContainerName,image=someOtherImage:someTag,cpu=500m,memory=128Mi). Note that the values can also be dynamic, e.g. by populating the demand with AZP variables
  • Automatic registration of (offline) dummy/fake AZP agents that have the demands that you defined in your configuration. This is necessary because the AZP platform would otherwise abort jobs that have demands for which there are no registered agents. This AZP-platform-behavior conflicts with the dynamic registration of agents, as done by the azure-pipelines-k8s-agent-scaler operator, but it can be worked around via the automatic pre-registration of agents. Note: if you define ExtraAgentContainers you can fully-dynamically register new agents using the companion CLI tool azure-pipelines-agent-registrator
  • Operator can be installed into your Kubernetes cluster via Kustomize or Helm

Getting Started

You need a Kubernetes cluster to run against, e.g. the one by Docker Desktop, k3d, KIND, or a remote cluster.

Note: Your controller automatically uses the current context in your kubeconfig file (i.e. whatever cluster kubectl cluster-info shows).

Running on the cluster

For local development, it is recommended to use the Kustomize manifests stored in the config/default folder.

Since this project is based on kubebuilder, it comes with various Make targets (check the Makefile for details).

Examples:

  • make install installs only the CRD (make uninstall removes the CRD again)
  • make docker-build builds the local Docker image that contains the controller-manager (by default tagged as controller:latest)
  • make deploy installs the Docker-based version into your cluster (installing the CRD, controller-manager Deployment and various RBAC-related manifests). You need to manually run make docker-build beforehand!
  • make undeploy reverts the effects of make deploy

For a production deployment, it is recommended to use the Helm chart instead. There is also a tutorial available here.

Next, create a dedicated Kubernetes namespace that hosts your AZP agent Pods. Inside it, create a Secret that contains your AZP Personal Access Token:

kubectl create secret generic azure-pat --from-literal=pat=YOUR-PAT-HERE --namespace <your-namespace>

Finally, create your desired CustomResource (see the sample) and apply it to the cluster (to <your-namespace>), or use the demo-agent Helm chart.

Debugging

Check the logs of the controller Pod/container to identify problems.

If your AZP jobs are pending (and you think that the operator should create Pods, but nothing happens and there is no log output either), you can temporarily enable additional debug-prints:

  • Execute into the controller's container, which comes with a minimal shell that only supports creating and deleting files
  • Run touch /home/nonroot/debug to enable debug printing
  • Run rm /home/nonroot/debug to disable debug printing again

Development

How it works

This project aims to follow the Kubernetes Operator pattern.

It uses Controllers, which provide a reconcile function responsible for synchronizing resources until the desired state is reached on the cluster.

Test It Out
  1. Install the CRDs into the cluster:
make install
  1. Run your controller (this will run in the foreground, so switch to a new terminal if you want to leave it running):
make run

NOTE: You can also run this in one step by running: make install run

Modifying the API definitions

If you are editing the API definitions, generate the manifests such as CRs or CRDs using:

make manifests

NOTE: Run make --help for more information on all potential make targets

More information can be found via the Kubebuilder Documentation

Directories

Path Synopsis
api
v1
Package v1 contains API Schema definitions for the v1 API group +kubebuilder:object:generate=true +groupName=azurepipelines.k8s.scaler.io
Package v1 contains API Schema definitions for the v1 API group +kubebuilder:object:generate=true +groupName=azurepipelines.k8s.scaler.io
fake_agent module
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL