queue-scaler

command
v0.0.0-...-598a827 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 28, 2021 License: Apache-2.0 Imports: 42 Imported by: 0

README

= SQS queue scaler query tool
Copyright 2021 (c) Cognizant Digital Business, Evolutionary AI. All rights reserved. Issued under the Apache 2.0 license.
ifdef::env-github[]
:imagesdir:
https://raw.githubusercontent.com/cognizantcodehub/LEAF-ManyMinima/main/docs/artwork
:tip-caption: :bulb:
:note-caption: :information_source:
:important-caption: :heavy_exclamation_mark:
:caution-caption: :fire:
:warning-caption: :warning:
endif::[]

ifndef::env-github[]
:imagesdir: ./
endif::[]

:source-highlighter: pygments
:source-language: go


This tool is motivated by a need to have a way of discovering outstanding work in SQS queues and provision, using Kubernetes, the compute needs that the queues have.

This software component is a part of the LEAF MLOps offering.

:toc:

== Usage

This tool can be used to observed and respond to SQS queues, in one of two ways.  The first being tp create summary json report of the queue state, the second being to generate output suitable for ingestion by the kubectl tool to create kubernetes jobs runners needed to address the needs of outstanding work in the queue.

To print a JSON formatted list of queues on AWS run the command with the --queue-report-only.  Any AWS access defined for your account will by default be used to print a list of the queues being used by StudioML and their current requirements, if known.

....
queue-scaler
usage:  queue-scaler [arguments]      SQS Queue Scaler tool

Arguments:

  -aws-access-key-id string
        credentials for accessing SQS queues
  -aws-queue string
        A regular expression for selecting the queues to be queries (default "^sqs_.*$")
  -aws-region string
        The region in which this command will query for queues
  -aws-secret-access-key string
        credentials for accessing SQS queues
  -debug
        leave debugging artifacts in place, print internal execution information
  -dry-run
        output the new kubernetes resources on stdout without taking any actions
  -eks-cluster-name string
        cluster name for EKS scaling support, when used the cluster will be scaled out using Jobs
  -in-cluster
        used to indicate if this component is running inside a cluster
  -job-template string
        file containing a Kubernetes Job YAML template sent to the cluster to add runners
  -kubeconfig string
        filepath for the Kubernetes configuration file (default "/home/kmutch/.kube/config")
  -max-cost string
        The maximum permitted cost for any individual requested machine, in USD (default "10.00")
  -namespace string
        the namespace being used by jobs being tracked against queues (default "default")
  -queue-name string
        A regular expression for selecting the queues to be queried (default ".*")
  -queue-report-only
        list queue details only then exit

Environment Variables:

options can be read for environment variables by changing dashes '-' to underscores
and using upper case letters.

To control log levels the LOGXI env variables can be used, these are documented at https://github.com/mgutz/logxi
All logging output goes to stderr, stdout contains command output only.
....

Example output from the reporting function:

....
$ ./queue-scaler --queue-report-only
{
    "sqs_StudioML_kmutch": {
        "Ready": 1,
        "NotVisible": 0,
        "Running": 0,
        "Resource": {
            "cpus": 4,
            "gpus": 1,
            "hdd": "10gb",
            "ram": "2gb",
            "gpuMem": "8G"
        },
        "AWSInstances": [
            {
                "region": "us-west-2",
                "type": "p2.xlarge",
                "price": 0.9
            },
            {
                "region": "us-west-2",
                "type": "p3.2xlarge",
                "price": 3.06
            },
            {
                "region": "us-west-2",
                "type": "p2.8xlarge",
                "price": 7.2
            }
        ]
    },
    "sqs_asd_zues3": {
        "Ready": 0,
        "NotVisible": 0,
        "Running": 0
    }
}
....

In json reporting mode the AWSInstances array is an array of the AWS EC2 machine instance types that could process the work sorted in price order.

=== Scaling Clusters using queue-scaler

It is envisioned that when using this tool to assist with scaling operations in a Kubernetes cluster that there are two administration roles at work

First is the role of a machine learning (ML) engineer tasked with creating working ML applications.  In general the focus in this role is to dispatch work to a remote system and to handle any results.

The second role is an operational one, Machine Learning engineering and Operations (MLOps).  In this role the concern is addressing queued work requests using a Kubernetes cluster with Machine Learning capabilities.

This tool is designed to assist MLOps in scaling clusters.  Initial cluster creation using EKS would be performed as a function of MLOps and would consist of a cluster with a variety of auto scaling node groups (ASG), each with one or two AWS machine instance types configured and starting with 0 activate nodes, and maximum node counts set as a safe guard.

The role of the queue-scaler tool is to take the idle cluster and to generate yaml that will cause the cluster to scale.  In this document we limit the discussion to the tool itself.  The tool can be run within a container inside the cluster, or as a scheduled cronjob like process outside of the cluster.  The output of queue-scaler is available as yaml on the standard output, errors and logging on standard error.  The standard output can be pipped directly into the kubectl command.

The queue-scaler tool allows for Horizontal Pod Autoscaling (HPA) using the outstanding requests in queues to scale up the number of pods present and working within a cluster.  This tool operates like an HPA with the exception that the load being experienced by a pod does not drive the decision to scale out.

Using an HPA while adding pods does not however increase the number or capacity of the Kubernetes hosts available to service the pods.  To do this the stock Kubernetes Cluster Auto-Scaler (CA) needs to be added to the cluster.  

The AWS EKS installation instructions, https://github.com/leaf-ai/studio-go-runner/blob/main/docs/aws_k8s.md, provide a description of how to deploy a cluster and auto-scaler to meet the CA requirement.  An example yaml file is also provided that you should inspect before applying, https://github.com/leaf-ai/studio-go-runner/blob/main/examples/aws/autoscaler.yaml.

If you wish to know more about auto-scaling and Kubernetes the following article might be a good place to start, https://learnk8s.io/kubernetes-autoscaling-strategies.

=== Job Templates

The tool supports the generation of output that accepts a template file containing Go Template, https://pkg.go.dev/text/template?utm_source=godoc.  The templating is extended to support additional functionality using the Masterminds Sprig library, https://pkg.go.dev/github.com/Masterminds/sprig/v3.

When using a template the standard 100+ sprig functions are available and variables are supplied that are derived from the job characteristics obtained from the queue.  Combining these items with the template will result in a set of Kubernetes resources customized for the queue.

Variables from the queue statistics can be incorporated into the template, for example in the above report example the ram required to run the task, as '{{ .Resource.ram }}', or other items in the report can be substituted.  To check the generated variables names available use the --queue-report-only option to see what is available.  Other variables that are available include:

* QueueName The SQS queue name.  Would generally be referenced in the ConfigMap.
* NodeGroup The EKS node group that this work should have affinity to.  Generally referred to within the Job spec.
* Ready The count of StudioML tasks that are waiting on runners.
* NotVisible The count of StudioML tasks that are being worked on by runners
* Running The number of StudioML go runners that are actively running

The portion of your Kubernetes configuration which remains static can be placed into a seperate file and applied to your cluster.  An exmaple of the static configuration is provided in the sqs_static.yaml file that is located in the same directory as this README.md file.

The job definition that will be pushed to the cluster to add new processing capacity for jobs can be found in the sqs_job.yaml example, again in the current directory.

The following is a walk through explaining various template features and how they function when they interact with the cluster.

The file starts with the generation of a UUID V4 ID for our job.  Jobs within Kubernetes are unqiuely named applying a new template a second time to a job that has already been completed will not cause the job to be restarted and so a unqiue name is applied everytime.  sprig functions include a UUID generator.

....
# Copyright (c) 2021 Cognizant Digital Business, Evolutionary AI. All rights reserved. Issued under the Apache 2.0 License.

{{ $uuid := uuidv4 }}

....

With the inclusion of the UUID in the configuration map name we can have a per job queue.  The .QueueName is supplied by the queue-scaler tooling when submitting the job.

The LIMIT_IDLE_DURATION parameter allows us to exit processing after the time period used as the value if there is no new work.  Using this parameter we can support scale down operations.
....
---
apiVersion: v1
kind: ConfigMap
metadata:
 name: studioml-env-{{$uuid}}
data:
 LOGXI_FORMAT: "happy,maxcol=1024"
 LOGXI: "*=DBG"
 QUEUE_MATCH: "^{{.QueueName}}$"
 SQS_CERTS: "./certs/aws-sqs"
 MESSAGE_CRYPT: "./certs/message"
 CACHE_SIZE: "10Gib"
 CACHE_DIR: "/tmp/cache"
 CLEAR_TEXT_MESSAGES: "true"
 LIMIT_IDLE_DURATION: "15m"
....

The main job template uses the uuid to generate unique job names and also incorporates the local environments AWS variables into the resource.

The {{ .NodeGroup }} variable will be substituted with the node group to which the queue tool wishes work to be assigned to.

Two parameters from the queue .Resource.ram, and .Resource.cpus are also substituted into the job to allow it to be correctly sized within the cluster.

....
---
apiVersion: batch/v1
kind: Job
metadata:
 name: studioml-go-runner-{{$uuid}}
 labels:
     queue: {{.QueueName}}
spec:
 parallelism: 1
 backoffLimit: 2
 template:
   metadata:
     labels:
       queue: {{.QueueName}}
   spec:
      restartPolicy: Never
      serviceAccountName: studioml-account
      automountServiceAccountToken: true
      imagePullSecrets:
        - name: studioml-go-docker-key
      nodeSelector:
        alpha.eksctl.io/nodegroup-name: {{ .NodeGroup }}
      containers:
      - name: studioml-go-runner
        envFrom:
        - configMapRef:
            name: studioml-env-{{$uuid}}
        #  Digest should be used to prevent version drift, prevented using idempotent SHA256 digest
        image: quay.io/leafai/studio-go-runner:0.14.1-main-aaaagrhimez
        imagePullPolicy: Always
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: {{ .Resource.ram }}
            cpu: {{ .Resource.cpus }}
        volumeMounts:
        - name: aws-sqs
          mountPath: "/runner/certs/aws-sqs/default"
          readOnly: true
        - name: message-encryption
          mountPath: "/runner/certs/message/encryption"
          readOnly: true
        - name: encryption-passphrase
          mountPath: "/runner/certs/message/passphrase"
          readOnly: true
        - name: queue-signing
          mountPath: "/runner/certs/queues/signing"
          readOnly: true
        - name: response-queue-signing
          mountPath: "/runner/certs/queues/response-encrypt"
          readOnly: true
        - name: tmp-volume
          mountPath: /tmp
        - name: nvidia
          mountPath: /usr/local/nvidia
      nodeSelector:
        beta.kubernetes.io/os: linux
      volumes:
        - name: aws-sqs
          secret:
            optional: true
            secretName: studioml-runner-aws-sqs
            items:
            - key: credentials
              path: credentials
            - key: config
              path: config
        - name: message-encryption
          secret:
            optional: true
            secretName: studioml-runner-key-secret
            items:
            - key: ssh-privatekey
              path: ssh-privatekey
            - key: ssh-publickey
              path: ssh-publickey
        - name: encryption-passphrase
          secret:
            optional: true
            secretName: studioml-runner-passphrase-secret
            items:
            - key: ssh-passphrase
              path: ssh-passphrase
        - name: queue-signing
          secret:
            optional: false
            secretName: studioml-signing
        - name: response-queue-signing
          secret:
            optional: false
            secretName: studioml-report-keys
        - name: tmp-volume
          emptyDir:
            sizeLimit: 200Gi
        - name: nvidia
          hostPath:
            path: /usr/local/nvidia
....

Running the tool and directly applying the results to your cluster can be done as follows:

....
export AWS_PROFILE=...
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=...
export CLUSTER_NAME=...
export KUBECONFIG=...
kubectl apply -f sqs_static.yaml
queue-scaler  --eks-cluster-name test-eks --job-template sqs_job.yaml --debug | kubectl apply -f -
....

You will see the names of the config map and the job shown as output allow you to capture logs or examine the status of the running work.

The environment variables supplied are used for accessing the SQS queues and obtaining information about the number and scale of working waiting in the queue.

In order to performing scaling operations you will need to configure your KUBECONFIG environment variable to point at the appropriate Kubernetes credentials needed to interact with the cluster.

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL