queue-scaler

command

v0.0.0-...-598a827 Latest Latest Go to latest Published: Sep 28, 2021 License: Apache-2.0 Imports: 42 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/leaf-ai/studio-go-runner

Links

Open Source Insights

README ¶

= SQS queue scaler query tool
Copyright 2021 (c) Cognizant Digital Business, Evolutionary AI. All rights reserved. Issued under the Apache 2.0 license.
ifdef::env-github[]
:imagesdir:
https://raw.githubusercontent.com/cognizantcodehub/LEAF-ManyMinima/main/docs/artwork
:tip-caption: :bulb:
:note-caption: :information_source:
:important-caption: :heavy_exclamation_mark:
:caution-caption: :fire:
:warning-caption: :warning:
endif::[]

ifndef::env-github[]
:imagesdir: ./
endif::[]

:source-highlighter: pygments
:source-language: go

This tool is motivated by a need to have a way of discovering outstanding work in SQS queues and provision, using Kubernetes, the compute needs that the queues have.

This software component is a part of the LEAF MLOps offering.

:toc:

== Usage

This tool can be used to observed and respond to SQS queues, in one of two ways. The first being tp create summary json report of the queue state, the second being to generate output suitable for ingestion by the kubectl tool to create kubernetes jobs runners needed to address the needs of outstanding work in the queue.

To print a JSON formatted list of queues on AWS run the command with the --queue-report-only. Any AWS access defined for your account will by default be used to print a list of the queues being used by StudioML and their current requirements, if known.

....
queue-scaler
usage: queue-scaler [arguments] SQS Queue Scaler tool

Arguments:

-aws-access-key-id string
credentials for accessing SQS queues
-aws-queue string
A regular expression for selecting the queues to be queries (default "^sqs_.*$")
-aws-region string
The region in which this command will query for queues
-aws-secret-access-key string
credentials for accessing SQS queues
-debug
leave debugging artifacts in place, print internal execution information
-dry-run
output the new kubernetes resources on stdout without taking any actions
-eks-cluster-name string
cluster name for EKS scaling support, when used the cluster will be scaled out using Jobs
-in-cluster
used to indicate if this component is running inside a cluster
-job-template string
file containing a Kubernetes Job YAML template sent to the cluster to add runners
-kubeconfig string
filepath for the Kubernetes configuration file (default "/home/kmutch/.kube/config")
-max-cost string
The maximum permitted cost for any individual requested machine, in USD (default "10.00")
-namespace string
the namespace being used by jobs being tracked against queues (default "default")
-queue-name string
A regular expression for selecting the queues to be queried (default ".*")
-queue-report-only
list queue details only then exit

Environment Variables:

options can be read for environment variables by changing dashes '-' to underscores
and using upper case letters.

To control log levels the LOGXI env variables can be used, these are documented at https://github.com/mgutz/logxi
All logging output goes to stderr, stdout contains command output only.
....

Example output from the reporting function:

....
$ ./queue-scaler --queue-report-only
{
"sqs_StudioML_kmutch": {
"Ready": 1,
"NotVisible": 0,
"Running": 0,
"Resource": {
"cpus": 4,
"gpus": 1,
"hdd": "10gb",
"ram": "2gb",
"gpuMem": "8G"
},
"AWSInstances": [
{
"region": "us-west-2",
"type": "p2.xlarge",
"price": 0.9
},
{
"region": "us-west-2",
"type": "p3.2xlarge",
"price": 3.06
},
{
"region": "us-west-2",
"type": "p2.8xlarge",
"price": 7.2
}
]
},
"sqs_asd_zues3": {
"Ready": 0,
"NotVisible": 0,
"Running": 0
}
}
....

In json reporting mode the AWSInstances array is an array of the AWS EC2 machine instance types that could process the work sorted in price order.

=== Scaling Clusters using queue-scaler

It is envisioned that when using this tool to assist with scaling operations in a Kubernetes cluster that there are two administration roles at work

First is the role of a machine learning (ML) engineer tasked with creating working ML applications. In general the focus in this role is to dispatch work to a remote system and to handle any results.

The second role is an operational one, Machine Learning engineering and Operations (MLOps). In this role the concern is addressing queued work requests using a Kubernetes cluster with Machine Learning capabilities.

This tool is designed to assist MLOps in scaling clusters. Initial cluster creation using EKS would be performed as a function of MLOps and would consist of a cluster with a variety of auto scaling node groups (ASG), each with one or two AWS machine instance types configured and starting with 0 activate nodes, and maximum node counts set as a safe guard.

The role of the queue-scaler tool is to take the idle cluster and to generate yaml that will cause the cluster to scale. In this document we limit the discussion to the tool itself. The tool can be run within a container inside the cluster, or as a scheduled cronjob like process outside of the cluster. The output of queue-scaler is available as yaml on the standard output, errors and logging on standard error. The standard output can be pipped directly into the kubectl command.

The queue-scaler tool allows for Horizontal Pod Autoscaling (HPA) using the outstanding requests in queues to scale up the number of pods present and working within a cluster. This tool operates like an HPA with the exception that the load being experienced by a pod does not drive the decision to scale out.

Using an HPA while adding pods does not however increase the number or capacity of the Kubernetes hosts available to service the pods. To do this the stock Kubernetes Cluster Auto-Scaler (CA) needs to be added to the cluster.

The AWS EKS installation instructions, https://github.com/leaf-ai/studio-go-runner/blob/main/docs/aws_k8s.md, provide a description of how to deploy a cluster and auto-scaler to meet the CA requirement. An example yaml file is also provided that you should inspect before applying, https://github.com/leaf-ai/studio-go-runner/blob/main/examples/aws/autoscaler.yaml.

If you wish to know more about auto-scaling and Kubernetes the following article might be a good place to start, https://learnk8s.io/kubernetes-autoscaling-strategies.

=== Job Templates

The tool supports the generation of output that accepts a template file containing Go Template, https://pkg.go.dev/text/template?utm_source=godoc. The templating is extended to support additional functionality using the Masterminds Sprig library, https://pkg.go.dev/github.com/Masterminds/sprig/v3.

When using a template the standard 100+ sprig functions are available and variables are supplied that are derived from the job characteristics obtained from the queue. Combining these items with the template will result in a set of Kubernetes resources customized for the queue.

Variables from the queue statistics can be incorporated into the template, for example in the above report example the ram required to run the task, as '{{ .Resource.ram }}', or other items in the report can be substituted. To check the generated variables names available use the --queue-report-only option to see what is available. Other variables that are available include:

* QueueName The SQS queue name. Would generally be referenced in the ConfigMap.
* NodeGroup The EKS node group that this work should have affinity to. Generally referred to within the Job spec.
* Ready The count of StudioML tasks that are waiting on runners.
* NotVisible The count of StudioML tasks that are being worked on by runners
* Running The number of StudioML go runners that are actively running

The portion of your Kubernetes configuration which remains static can be placed into a seperate file and applied to your cluster. An exmaple of the static configuration is provided in the sqs_static.yaml file that is located in the same directory as this README.md file.

The job definition that will be pushed to the cluster to add new processing capacity for jobs can be found in the sqs_job.yaml example, again in the current directory.

The following is a walk through explaining various template features and how they function when they interact with the cluster.

The file starts with the generation of a UUID V4 ID for our job. Jobs within Kubernetes are unqiuely named applying a new template a second time to a job that has already been completed will not cause the job to be restarted and so a unqiue name is applied everytime. sprig functions include a UUID generator.

....

With the inclusion of the UUID in the configuration map name we can have a per job queue. The .QueueName is supplied by the queue-scaler tooling when submitting the job.

The LIMIT_IDLE_DURATION parameter allows us to exit processing after the time period used as the value if there is no new work. Using this parameter we can support scale down operations.
....
---
apiVersion: v1
kind: ConfigMap
metadata:
name: studioml-env-{{$uuid}}
data:
LOGXI_FORMAT: "happy,maxcol=1024"
LOGXI: "*=DBG"
QUEUE_MATCH: "^{{.QueueName}}$"
SQS_CERTS: "./certs/aws-sqs"
MESSAGE_CRYPT: "./certs/message"
CACHE_SIZE: "10Gib"
CACHE_DIR: "/tmp/cache"
CLEAR_TEXT_MESSAGES: "true"
LIMIT_IDLE_DURATION: "15m"
....

The main job template uses the uuid to generate unique job names and also incorporates the local environments AWS variables into the resource.

The {{ .NodeGroup }} variable will be substituted with the node group to which the queue tool wishes work to be assigned to.

Two parameters from the queue .Resource.ram, and .Resource.cpus are also substituted into the job to allow it to be correctly sized within the cluster.

....
---
apiVersion: batch/v1
kind: Job
metadata:
name: studioml-go-runner-{{$uuid}}
labels:
queue: {{.QueueName}}
spec:
parallelism: 1
backoffLimit: 2
template:
metadata:
labels:
queue: {{.QueueName}}
spec:
restartPolicy: Never
serviceAccountName: studioml-account
automountServiceAccountToken: true
imagePullSecrets:
- name: studioml-go-docker-key
nodeSelector:
alpha.eksctl.io/nodegroup-name: {{ .NodeGroup }}
containers:
- name: studioml-go-runner
envFrom:
- configMapRef:
name: studioml-env-{{$uuid}}
# Digest should be used to prevent version drift, prevented using idempotent SHA256 digest
image: quay.io/leafai/studio-go-runner:0.14.1-main-aaaagrhimez
imagePullPolicy: Always
resources:
limits:
nvidia.com/gpu: 1
memory: {{ .Resource.ram }}
cpu: {{ .Resource.cpus }}
volumeMounts:
- name: aws-sqs
mountPath: "/runner/certs/aws-sqs/default"
readOnly: true
- name: message-encryption
mountPath: "/runner/certs/message/encryption"
readOnly: true
- name: encryption-passphrase
mountPath: "/runner/certs/message/passphrase"
readOnly: true
- name: queue-signing
mountPath: "/runner/certs/queues/signing"
readOnly: true
- name: response-queue-signing
mountPath: "/runner/certs/queues/response-encrypt"
readOnly: true
- name: tmp-volume
mountPath: /tmp
- name: nvidia
mountPath: /usr/local/nvidia
nodeSelector:
beta.kubernetes.io/os: linux
volumes:
- name: aws-sqs
secret:
optional: true
secretName: studioml-runner-aws-sqs
items:
- key: credentials
path: credentials
- key: config
path: config
- name: message-encryption
secret:
optional: true
secretName: studioml-runner-key-secret
items:
- key: ssh-privatekey
path: ssh-privatekey
- key: ssh-publickey
path: ssh-publickey
- name: encryption-passphrase
secret:
optional: true
secretName: studioml-runner-passphrase-secret
items:
- key: ssh-passphrase
path: ssh-passphrase
- name: queue-signing
secret:
optional: false
secretName: studioml-signing
- name: response-queue-signing
secret:
optional: false
secretName: studioml-report-keys
- name: tmp-volume
emptyDir:
sizeLimit: 200Gi
- name: nvidia
hostPath:
path: /usr/local/nvidia
....

Running the tool and directly applying the results to your cluster can be done as follows:

....
export AWS_PROFILE=...
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=...
export CLUSTER_NAME=...
export KUBECONFIG=...
kubectl apply -f sqs_static.yaml
queue-scaler --eks-cluster-name test-eks --job-template sqs_job.yaml --debug | kubectl apply -f -
....

You will see the names of the config map and the job shown as output allow you to capture logs or examine the status of the running work.

The environment variables supplied are used for accessing the SQS queues and obtaining information about the number and scale of working waiting in the queue.

In order to performing scaling operations you will need to configure your KUBECONFIG environment variable to point at the appropriate Kubernetes credentials needed to interact with the cluster.

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL