templates

package
v0.0.0-...-de45acf Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 13, 2023 License: MIT Imports: 7 Imported by: 0

Documentation

Overview

This package contains the implementation of various SLO templates.

When creating SLOs for services, once the key SLIs have been defined it's necessary to forumulate an expression that can represent how well the service is performing in accordance to its objectives.

Finding the expression that will take into account many key performance objectives of the system while being understandable can be difficult. The value of this package is to provide pre-configured templates that map to different categories of system that cater for the common properties people care about, enabling developers to apply sensible SLOs without having to dive deep into SLO-theory and Prometheus details.

Each template registers itself with a global registry, at which point it's possible to use the template in a definition file provided to the build command. Pipelines then construct a rule group in the order required to power each different template, while feeding into a common set of alerting windows that apply to all SLOs.

Index

Constants

This section is empty.

Variables

View Source
var (
	// Templates stores a mapping of template name to registered template. This is used to
	// unmarshal template definitions from their yaml source and to provide users with
	// feedback about what templates this tool supports.
	Templates = map[string]SLO{}

	// TemplateRules implement the translation from the rules produced by each instance of
	// SLO templates into the generic SLO error:ratio<I> format, which then power alerts.
	TemplateRules = []rulefmt.Rule{}

	// AlertWindows are common interval windows we want to precompute
	AlertWindows = []string{"1m", "5m", "30m", "1h", "2h", "6h", "1d", "3d", "7d", "28d"}

	// AlertRules every SLO type produces rules that terminate in job:slo_error:ratio<I> and
	// job:slo_error_budget's. Together, we can use these rules to power generic
	// multi-window SLO error budget burn alerts, and these alert rules are run as the final
	// part of the Pipeline generated RuleGroup.
	AlertRules = []rulefmt.Rule{
		rulefmt.Rule{
			Alert: "SLOErrorBudgetFastBurn",
			For:   model.Duration(time.Minute),
			Labels: map[string]string{
				"severity": "ticket",
			},
			Expr: `
((
  job:slo_error:ratio1h > on(name) group_left() (14.4 * job:slo_error_budget:ratio)
and
  job:slo_error:ratio5m > on(name) group_left() (14.4 * job:slo_error_budget:ratio)
)
or
(
  job:slo_error:ratio6h > on(name) group_left() (6.0 * job:slo_error_budget:ratio)
and
  job:slo_error:ratio30m > on(name) group_left() (6.0 * job:slo_error_budget:ratio)
)) * on(name) group_left(channel) job:slo_labels_info
			`,
		},
		rulefmt.Rule{
			Alert: "SLOErrorBudgetSlowBurn",
			For:   model.Duration(time.Hour),
			Labels: map[string]string{
				"severity": "ticket",
			},
			Expr: `
((
  job:slo_error:ratio1d > on(name) group_left() (3.0 * job:slo_error_budget:ratio)
and
  job:slo_error:ratio2h > on(name) group_left() (3.0 * job:slo_error_budget:ratio)
)
or
(
  job:slo_error:ratio3d > on(name) group_left() (1.0 * job:slo_error_budget:ratio)
and
  job:slo_error:ratio6h > on(name) group_left() (1.0 * job:slo_error_budget:ratio)
)) * on(name) group_left(channel) jobs:slo_labels_info
			`,
		},
	}
)
View Source
var (
	// BatchProcessingTemplateRules map from the job:slo_batch_* time series to
	// the SLO-compliant job:slo_error:ratio<I> series that are used to power
	// alerts.
	BatchProcessingTemplateRules = flattenRules(

		rulefmt.Rule{
			Record: "job:slo_batch_error:interval",
			Expr: `
1.0 - clamp_max(
  job:slo_batch_throughput:interval / job:slo_batch_throughput_target:max,
  1.0
)
			`,
		},

		forIntervals(AlertWindows,
			rulefmt.Rule{
				Record: "job:slo_error:ratio%s",
				Expr:   `avg_over_time(job:slo_batch_error:interval[%s])`,
			},
		),
	)
)
View Source
var (
	// ErrorRateTemplateRules map from the job:slo_error_rate_total and
	// job:slo_error_rate_errors time series to the SLO-compliant
	// job:slo_error:ratio<I> series that are used to power alerts.
	ErrorRateTemplateRules = flattenRules(

		forIntervals(AlertWindows, rulefmt.Rule{
			Record: "job:slo_error:ratio%s",
			Expr:   `((job:slo_error_rate_errors:rate%[1]s) or (0 * job:slo_error_rate_total:rate%[1]s)) / job:slo_error_rate_total:rate%[1]s`,
		}),
	)
)
View Source
var (
	// LatencyTemplateRules map from the job:slo_latency_* time series to the
	// SLO-compliant job:slo_error:ratio<I> series than are used to power
	// alerts.
	LatencyTemplateRules = flattenRules(

		forIntervals(AlertWindows, rulefmt.Rule{
			Record: "job:slo_error:ratio%s",
			Expr:   `(job:slo_latency_total:rate%[1]s - job:slo_latency_observation:rate%[1]s) / job:slo_latency_total:rate%[1]s`,
		}),
	)
)

Functions

func MustRegisterTemplate

func MustRegisterTemplate(slo SLO, rules ...rulefmt.Rule)

MustRegisterTemplate installs the rules that map template specific SLO intermediate calculations to the job:slo_error:ratio<I> series that power alerts. This is called from the place a template is implemented.

Types

type BatchProcessingSLO

type BatchProcessingSLO struct {
	Deadline   serializeableDuration // time after starting the batch that it must finish
	Volume     string                // expected maximum volume to be processed by a single batch run
	Throughput string                // measure of batch throughput
	// contains filtered or unexported fields
}

BatchProcessingSLO is used to construct SLOs around large batch processes that the business demands finishes within a given deadline.

To use this template, you provide a measure of throughput for the batch process which is only present when the job is underway. The SLO then uses an estimated measure of maximum expected volume and the business deadline to compute a target throughput, then measures SLO compliance against how well the batch process meets the target.

It can be a good idea to compute the volume measurement by taking a record of previous historic maximums and applying a growth multiplier that is appropriate for the business context. If you're processing a number of payments, and your peak volume comes once a month, expecting 1.5x the maximum volume processed by the batch job in the last 60 days might be a good starting point.

The important characteristics of this SLO are:

- Error budget is consumed at a rate proportional to unmet target performance - Error budget is consumed even by batches that process less-than-maximum volume

One thing to note is that throughput exceeding the target threshold is considered 0% error, rather than some negative error value. This is a deliberate choice to avoid encouraging spiky throughput values, but may be toggled in future.

func (BatchProcessingSLO) GetName

func (b BatchProcessingSLO) GetName() string

func (BatchProcessingSLO) Rules

func (b BatchProcessingSLO) Rules() []rulefmt.Rule

type ErrorRateSLO

type ErrorRateSLO struct {
	Errors string
	Total  string
	// contains filtered or unexported fields
}

To use this template, you provide a parameterised rate of requests and errors that are sliced across multiple time windows.

func (ErrorRateSLO) GetName

func (b ErrorRateSLO) GetName() string

func (ErrorRateSLO) Rules

func (e ErrorRateSLO) Rules() []rulefmt.Rule

type LatencySLO

type LatencySLO struct {
	RequestClass string // request class references a latency target
	Total        string // parameterized rate of total requests
	Observation  string // parameterized rate of histogram bucket
	// contains filtered or unexported fields
}

LatencySLO is used to construct SLOs based on latency.

To use this template, you provide a parameterized rate of total requests, a parameterized counter that tracks the number of observations (histogram bucket) and request class that references a latency target.

This template allows defining SLOs as follows:

90% requests < 300ms 99% requests < 1000ms

func (LatencySLO) GetName

func (b LatencySLO) GetName() string

func (LatencySLO) Rules

func (l LatencySLO) Rules() []rulefmt.Rule

type Pipeline

type Pipeline struct {
	// Name defines the RuleGroup name in Prometheus
	Name string

	// SLORules is where each SLO should place the appropriate rules that power the
	// post-processing and alert trailers.
	SLORules []rulefmt.Rule
}

Pipeline can build a RuleGroup that powers the generation of SLO time series. The RuleGroup generated by the Pipeline will include rules installed by templates and the global alerting windows, with each SLOs registered on a Pipeline instance via the MustRegister() method.

func NewPipeline

func NewPipeline(name string) *Pipeline

func (*Pipeline) Build

func (p *Pipeline) Build() rulefmt.RuleGroups

func (*Pipeline) MustRegister

func (p *Pipeline) MustRegister(slos ...SLO)

type SLO

type SLO interface {
	// GetName returns a globally unique name for the SLO
	GetName() string
	// Rules generates Prometheus recording rules that implement the SLO definition
	Rules() []rulefmt.Rule
}

SLO the base interface type for all SLOs

func ParseDefinitions

func ParseDefinitions(payload []byte) ([]SLO, error)

ParseDefinitions loads a YAML file of configured templates that looks like this:

---
definitions:
  - template: BatchProcessingSLO
    definition:
      name: MarkPaymentsAsPaidMeetsDeadline
      ...

and produces a list of SLOs. This is the file format we expect users to be providing to the slo-builder.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL