Documentation ¶
Overview ¶
This package contains the implementation of various SLO templates.
When creating SLOs for services, once the key SLIs have been defined it's necessary to forumulate an expression that can represent how well the service is performing in accordance to its objectives.
Finding the expression that will take into account many key performance objectives of the system while being understandable can be difficult. The value of this package is to provide pre-configured templates that map to different categories of system that cater for the common properties people care about, enabling developers to apply sensible SLOs without having to dive deep into SLO-theory and Prometheus details.
Each template registers itself with a global registry, at which point it's possible to use the template in a definition file provided to the build command. Pipelines then construct a rule group in the order required to power each different template, while feeding into a common set of alerting windows that apply to all SLOs.
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ( // Templates stores a mapping of template name to registered template. This is used to // unmarshal template definitions from their yaml source and to provide users with // feedback about what templates this tool supports. Templates = map[string]SLO{} // TemplateRules implement the translation from the rules produced by each instance of // SLO templates into the generic SLO error:ratio<I> format, which then power alerts. TemplateRules = []rulefmt.Rule{} // AlertWindows are common interval windows we want to precompute AlertWindows = []string{"1m", "5m", "30m", "1h", "2h", "6h", "1d", "3d", "7d", "28d"} // AlertRules every SLO type produces rules that terminate in job:slo_error:ratio<I> and // job:slo_error_budget's. Together, we can use these rules to power generic // multi-window SLO error budget burn alerts, and these alert rules are run as the final // part of the Pipeline generated RuleGroup. AlertRules = []rulefmt.Rule{ rulefmt.Rule{ Alert: "SLOErrorBudgetFastBurn", For: model.Duration(time.Minute), Labels: map[string]string{ "severity": "ticket", }, Expr: ` (( job:slo_error:ratio1h > on(name) group_left() (14.4 * job:slo_error_budget:ratio) and job:slo_error:ratio5m > on(name) group_left() (14.4 * job:slo_error_budget:ratio) ) or ( job:slo_error:ratio6h > on(name) group_left() (6.0 * job:slo_error_budget:ratio) and job:slo_error:ratio30m > on(name) group_left() (6.0 * job:slo_error_budget:ratio) )) * on(name) group_left(channel) job:slo_labels_info `, }, rulefmt.Rule{ Alert: "SLOErrorBudgetSlowBurn", For: model.Duration(time.Hour), Labels: map[string]string{ "severity": "ticket", }, Expr: ` (( job:slo_error:ratio1d > on(name) group_left() (3.0 * job:slo_error_budget:ratio) and job:slo_error:ratio2h > on(name) group_left() (3.0 * job:slo_error_budget:ratio) ) or ( job:slo_error:ratio3d > on(name) group_left() (1.0 * job:slo_error_budget:ratio) and job:slo_error:ratio6h > on(name) group_left() (1.0 * job:slo_error_budget:ratio) )) * on(name) group_left(channel) jobs:slo_labels_info `, }, } )
var ( // BatchProcessingTemplateRules map from the job:slo_batch_* time series to // the SLO-compliant job:slo_error:ratio<I> series that are used to power // alerts. BatchProcessingTemplateRules = flattenRules( rulefmt.Rule{ Record: "job:slo_batch_error:interval", Expr: ` 1.0 - clamp_max( job:slo_batch_throughput:interval / job:slo_batch_throughput_target:max, 1.0 ) `, }, forIntervals(AlertWindows, rulefmt.Rule{ Record: "job:slo_error:ratio%s", Expr: `avg_over_time(job:slo_batch_error:interval[%s])`, }, ), ) )
var ( // ErrorRateTemplateRules map from the job:slo_error_rate_total and // job:slo_error_rate_errors time series to the SLO-compliant // job:slo_error:ratio<I> series that are used to power alerts. ErrorRateTemplateRules = flattenRules( forIntervals(AlertWindows, rulefmt.Rule{ Record: "job:slo_error:ratio%s", Expr: `((job:slo_error_rate_errors:rate%[1]s) or (0 * job:slo_error_rate_total:rate%[1]s)) / job:slo_error_rate_total:rate%[1]s`, }), ) )
var ( // LatencyTemplateRules map from the job:slo_latency_* time series to the // SLO-compliant job:slo_error:ratio<I> series than are used to power // alerts. LatencyTemplateRules = flattenRules( forIntervals(AlertWindows, rulefmt.Rule{ Record: "job:slo_error:ratio%s", Expr: `(job:slo_latency_total:rate%[1]s - job:slo_latency_observation:rate%[1]s) / job:slo_latency_total:rate%[1]s`, }), ) )
Functions ¶
func MustRegisterTemplate ¶
MustRegisterTemplate installs the rules that map template specific SLO intermediate calculations to the job:slo_error:ratio<I> series that power alerts. This is called from the place a template is implemented.
Types ¶
type BatchProcessingSLO ¶
type BatchProcessingSLO struct { Deadline serializeableDuration // time after starting the batch that it must finish Volume string // expected maximum volume to be processed by a single batch run Throughput string // measure of batch throughput // contains filtered or unexported fields }
BatchProcessingSLO is used to construct SLOs around large batch processes that the business demands finishes within a given deadline.
To use this template, you provide a measure of throughput for the batch process which is only present when the job is underway. The SLO then uses an estimated measure of maximum expected volume and the business deadline to compute a target throughput, then measures SLO compliance against how well the batch process meets the target.
It can be a good idea to compute the volume measurement by taking a record of previous historic maximums and applying a growth multiplier that is appropriate for the business context. If you're processing a number of payments, and your peak volume comes once a month, expecting 1.5x the maximum volume processed by the batch job in the last 60 days might be a good starting point.
The important characteristics of this SLO are:
- Error budget is consumed at a rate proportional to unmet target performance - Error budget is consumed even by batches that process less-than-maximum volume
One thing to note is that throughput exceeding the target threshold is considered 0% error, rather than some negative error value. This is a deliberate choice to avoid encouraging spiky throughput values, but may be toggled in future.
func (BatchProcessingSLO) Rules ¶
func (b BatchProcessingSLO) Rules() []rulefmt.Rule
type ErrorRateSLO ¶
To use this template, you provide a parameterised rate of requests and errors that are sliced across multiple time windows.
func (ErrorRateSLO) Rules ¶
func (e ErrorRateSLO) Rules() []rulefmt.Rule
type LatencySLO ¶
type LatencySLO struct { RequestClass string // request class references a latency target Total string // parameterized rate of total requests Observation string // parameterized rate of histogram bucket // contains filtered or unexported fields }
LatencySLO is used to construct SLOs based on latency.
To use this template, you provide a parameterized rate of total requests, a parameterized counter that tracks the number of observations (histogram bucket) and request class that references a latency target.
This template allows defining SLOs as follows:
90% requests < 300ms 99% requests < 1000ms
func (LatencySLO) Rules ¶
func (l LatencySLO) Rules() []rulefmt.Rule
type Pipeline ¶
type Pipeline struct { // Name defines the RuleGroup name in Prometheus Name string // SLORules is where each SLO should place the appropriate rules that power the // post-processing and alert trailers. SLORules []rulefmt.Rule }
Pipeline can build a RuleGroup that powers the generation of SLO time series. The RuleGroup generated by the Pipeline will include rules installed by templates and the global alerting windows, with each SLOs registered on a Pipeline instance via the MustRegister() method.
func NewPipeline ¶
func (*Pipeline) Build ¶
func (p *Pipeline) Build() rulefmt.RuleGroups
func (*Pipeline) MustRegister ¶
type SLO ¶
type SLO interface { // GetName returns a globally unique name for the SLO GetName() string // Rules generates Prometheus recording rules that implement the SLO definition Rules() []rulefmt.Rule }
SLO the base interface type for all SLOs
func ParseDefinitions ¶
ParseDefinitions loads a YAML file of configured templates that looks like this:
--- definitions: - template: BatchProcessingSLO definition: name: MarkPaymentsAsPaidMeetsDeadline ...
and produces a list of SLOs. This is the file format we expect users to be providing to the slo-builder.