README ¶
= SQS queue scaler query tool Copyright 2021 (c) Cognizant Digital Business, Evolutionary AI. All rights reserved. Issued under the Apache 2.0 license. ifdef::env-github[] :imagesdir: https://raw.githubusercontent.com/cognizantcodehub/LEAF-ManyMinima/main/docs/artwork :tip-caption: :bulb: :note-caption: :information_source: :important-caption: :heavy_exclamation_mark: :caution-caption: :fire: :warning-caption: :warning: endif::[] ifndef::env-github[] :imagesdir: ./ endif::[] :source-highlighter: pygments :source-language: go This tool is motivated by a need to have a way of discovering outstanding work in SQS queues and provision, using Kubernetes, the compute needs that the queues have. This software component is a part of the LEAF MLOps offering. :toc: == Usage This tool can be used to observed and respond to SQS queues, in one of two ways. The first being tp create summary json report of the queue state, the second being to generate output suitable for ingestion by the kubectl tool to create kubernetes jobs runners needed to address the needs of outstanding work in the queue. To print a JSON formatted list of queues on AWS run the command with the --queue-report-only. Any AWS access defined for your account will by default be used to print a list of the queues being used by StudioML and their current requirements, if known. .... queue-scaler usage: queue-scaler [arguments] SQS Queue Scaler tool Arguments: -aws-access-key-id string credentials for accessing SQS queues -aws-queue string A regular expression for selecting the queues to be queries (default "^sqs_.*$") -aws-region string The region in which this command will query for queues -aws-secret-access-key string credentials for accessing SQS queues -debug leave debugging artifacts in place, print internal execution information -dry-run output the new kubernetes resources on stdout without taking any actions -eks-cluster-name string cluster name for EKS scaling support, when used the cluster will be scaled out using Jobs -in-cluster used to indicate if this component is running inside a cluster -job-template string file containing a Kubernetes Job YAML template sent to the cluster to add runners -kubeconfig string filepath for the Kubernetes configuration file (default "/home/kmutch/.kube/config") -max-cost string The maximum permitted cost for any individual requested machine, in USD (default "10.00") -namespace string the namespace being used by jobs being tracked against queues (default "default") -queue-name string A regular expression for selecting the queues to be queried (default ".*") -queue-report-only list queue details only then exit Environment Variables: options can be read for environment variables by changing dashes '-' to underscores and using upper case letters. To control log levels the LOGXI env variables can be used, these are documented at https://github.com/mgutz/logxi All logging output goes to stderr, stdout contains command output only. .... Example output from the reporting function: .... $ ./queue-scaler --queue-report-only { "sqs_StudioML_kmutch": { "Ready": 1, "NotVisible": 0, "Running": 0, "Resource": { "cpus": 4, "gpus": 1, "hdd": "10gb", "ram": "2gb", "gpuMem": "8G" }, "AWSInstances": [ { "region": "us-west-2", "type": "p2.xlarge", "price": 0.9 }, { "region": "us-west-2", "type": "p3.2xlarge", "price": 3.06 }, { "region": "us-west-2", "type": "p2.8xlarge", "price": 7.2 } ] }, "sqs_asd_zues3": { "Ready": 0, "NotVisible": 0, "Running": 0 } } .... In json reporting mode the AWSInstances array is an array of the AWS EC2 machine instance types that could process the work sorted in price order. === Scaling Clusters using queue-scaler It is envisioned that when using this tool to assist with scaling operations in a Kubernetes cluster that there are two administration roles at work First is the role of a machine learning (ML) engineer tasked with creating working ML applications. In general the focus in this role is to dispatch work to a remote system and to handle any results. The second role is an operational one, Machine Learning engineering and Operations (MLOps). In this role the concern is addressing queued work requests using a Kubernetes cluster with Machine Learning capabilities. This tool is designed to assist MLOps in scaling clusters. Initial cluster creation using EKS would be performed as a function of MLOps and would consist of a cluster with a variety of auto scaling node groups (ASG), each with one or two AWS machine instance types configured and starting with 0 activate nodes, and maximum node counts set as a safe guard. The role of the queue-scaler tool is to take the idle cluster and to generate yaml that will cause the cluster to scale. In this document we limit the discussion to the tool itself. The tool can be run within a container inside the cluster, or as a scheduled cronjob like process outside of the cluster. The output of queue-scaler is available as yaml on the standard output, errors and logging on standard error. The standard output can be pipped directly into the kubectl command. The queue-scaler tool allows for Horizontal Pod Autoscaling (HPA) using the outstanding requests in queues to scale up the number of pods present and working within a cluster. This tool operates like an HPA with the exception that the load being experienced by a pod does not drive the decision to scale out. Using an HPA while adding pods does not however increase the number or capacity of the Kubernetes hosts available to service the pods. To do this the stock Kubernetes Cluster Auto-Scaler (CA) needs to be added to the cluster. The AWS EKS installation instructions, https://github.com/leaf-ai/studio-go-runner/blob/main/docs/aws_k8s.md, provide a description of how to deploy a cluster and auto-scaler to meet the CA requirement. An example yaml file is also provided that you should inspect before applying, https://github.com/leaf-ai/studio-go-runner/blob/main/examples/aws/autoscaler.yaml. If you wish to know more about auto-scaling and Kubernetes the following article might be a good place to start, https://learnk8s.io/kubernetes-autoscaling-strategies. === Job Templates The tool supports the generation of output that accepts a template file containing Go Template, https://pkg.go.dev/text/template?utm_source=godoc. The templating is extended to support additional functionality using the Masterminds Sprig library, https://pkg.go.dev/github.com/Masterminds/sprig/v3. When using a template the standard 100+ sprig functions are available and variables are supplied that are derived from the job characteristics obtained from the queue. Combining these items with the template will result in a set of Kubernetes resources customized for the queue. Variables from the queue statistics can be incorporated into the template, for example in the above report example the ram required to run the task, as '{{ .Resource.ram }}', or other items in the report can be substituted. To check the generated variables names available use the --queue-report-only option to see what is available. Other variables that are available include: * QueueName The SQS queue name. Would generally be referenced in the ConfigMap. * NodeGroup The EKS node group that this work should have affinity to. Generally referred to within the Job spec. * Ready The count of StudioML tasks that are waiting on runners. * NotVisible The count of StudioML tasks that are being worked on by runners * Running The number of StudioML go runners that are actively running The portion of your Kubernetes configuration which remains static can be placed into a seperate file and applied to your cluster. An exmaple of the static configuration is provided in the sqs_static.yaml file that is located in the same directory as this README.md file. The job definition that will be pushed to the cluster to add new processing capacity for jobs can be found in the sqs_job.yaml example, again in the current directory. The following is a walk through explaining various template features and how they function when they interact with the cluster. The file starts with the generation of a UUID V4 ID for our job. Jobs within Kubernetes are unqiuely named applying a new template a second time to a job that has already been completed will not cause the job to be restarted and so a unqiue name is applied everytime. sprig functions include a UUID generator. .... # Copyright (c) 2021 Cognizant Digital Business, Evolutionary AI. All rights reserved. Issued under the Apache 2.0 License. {{ $uuid := uuidv4 }} .... With the inclusion of the UUID in the configuration map name we can have a per job queue. The .QueueName is supplied by the queue-scaler tooling when submitting the job. The LIMIT_IDLE_DURATION parameter allows us to exit processing after the time period used as the value if there is no new work. Using this parameter we can support scale down operations. .... --- apiVersion: v1 kind: ConfigMap metadata: name: studioml-env-{{$uuid}} data: LOGXI_FORMAT: "happy,maxcol=1024" LOGXI: "*=DBG" QUEUE_MATCH: "^{{.QueueName}}$" SQS_CERTS: "./certs/aws-sqs" MESSAGE_CRYPT: "./certs/message" CACHE_SIZE: "10Gib" CACHE_DIR: "/tmp/cache" CLEAR_TEXT_MESSAGES: "true" LIMIT_IDLE_DURATION: "15m" .... The main job template uses the uuid to generate unique job names and also incorporates the local environments AWS variables into the resource. The {{ .NodeGroup }} variable will be substituted with the node group to which the queue tool wishes work to be assigned to. Two parameters from the queue .Resource.ram, and .Resource.cpus are also substituted into the job to allow it to be correctly sized within the cluster. .... --- apiVersion: batch/v1 kind: Job metadata: name: studioml-go-runner-{{$uuid}} labels: queue: {{.QueueName}} spec: parallelism: 1 backoffLimit: 2 template: metadata: labels: queue: {{.QueueName}} spec: restartPolicy: Never serviceAccountName: studioml-account automountServiceAccountToken: true imagePullSecrets: - name: studioml-go-docker-key nodeSelector: alpha.eksctl.io/nodegroup-name: {{ .NodeGroup }} containers: - name: studioml-go-runner envFrom: - configMapRef: name: studioml-env-{{$uuid}} # Digest should be used to prevent version drift, prevented using idempotent SHA256 digest image: quay.io/leafai/studio-go-runner:0.14.1-main-aaaagrhimez imagePullPolicy: Always resources: limits: nvidia.com/gpu: 1 memory: {{ .Resource.ram }} cpu: {{ .Resource.cpus }} volumeMounts: - name: aws-sqs mountPath: "/runner/certs/aws-sqs/default" readOnly: true - name: message-encryption mountPath: "/runner/certs/message/encryption" readOnly: true - name: encryption-passphrase mountPath: "/runner/certs/message/passphrase" readOnly: true - name: queue-signing mountPath: "/runner/certs/queues/signing" readOnly: true - name: response-queue-signing mountPath: "/runner/certs/queues/response-encrypt" readOnly: true - name: tmp-volume mountPath: /tmp - name: nvidia mountPath: /usr/local/nvidia nodeSelector: beta.kubernetes.io/os: linux volumes: - name: aws-sqs secret: optional: true secretName: studioml-runner-aws-sqs items: - key: credentials path: credentials - key: config path: config - name: message-encryption secret: optional: true secretName: studioml-runner-key-secret items: - key: ssh-privatekey path: ssh-privatekey - key: ssh-publickey path: ssh-publickey - name: encryption-passphrase secret: optional: true secretName: studioml-runner-passphrase-secret items: - key: ssh-passphrase path: ssh-passphrase - name: queue-signing secret: optional: false secretName: studioml-signing - name: response-queue-signing secret: optional: false secretName: studioml-report-keys - name: tmp-volume emptyDir: sizeLimit: 200Gi - name: nvidia hostPath: path: /usr/local/nvidia .... Running the tool and directly applying the results to your cluster can be done as follows: .... export AWS_PROFILE=... export AWS_ACCESS_KEY_ID=... export AWS_SECRET_ACCESS_KEY=... export AWS_DEFAULT_REGION=... export CLUSTER_NAME=... export KUBECONFIG=... kubectl apply -f sqs_static.yaml queue-scaler --eks-cluster-name test-eks --job-template sqs_job.yaml --debug | kubectl apply -f - .... You will see the names of the config map and the job shown as output allow you to capture logs or examine the status of the running work. The environment variables supplied are used for accessing the SQS queues and obtaining information about the number and scale of working waiting in the queue. In order to performing scaling operations you will need to configure your KUBECONFIG environment variable to point at the appropriate Kubernetes credentials needed to interact with the cluster.
Documentation ¶
There is no documentation for this package.
Click to show internal directories.
Click to hide internal directories.