alertmanager-status

command module
v0.0.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 27, 2020 License: Apache-2.0 Imports: 5 Imported by: 0

README

alertmanager-status

CI codecov

Quick start

alertmanger-status is an Alertmanager webhook that serves a status page indicating whether or not it has received an alert recently. This allows you to use a normal website monitoring service to alert you if Alertmanager stops publishing alerts.

To do this, you'll need an alert that is always firing. Many people have this already, but here's one that I found laying around on the Internet:

- name: Watchdog
  rules:
      - alert: AlwaysFiring
        expr: vector(1)
        for: 1s
        labels:
            severity: none
        annotations:
            summary: "AlwaysFiring"
            description: |
                This is an alert meant to ensure that the entire alerting pipeline is functional.
                This alert is always firing, therefore it should always be firing in Alertmanager.

Next, you'll need to install alertmanager-status and tell Alertmanager to send just this alert to its webhook endpoint. Your alertmanager.yml file should look something like mine:

global:
    resolve_timeout: 5m
route:
    group_by: ["job"]
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    routes:
        - match:
              alertname: AlwaysFiring
          group_wait: 0s
          repeat_interval: 5s
          group_interval: 5s
          receiver: "status"
        - receiver: "discord"
    receiver: "null"
receivers:
    - name: "null"
    - name: "status"
      webhook_configs:
          - url: "http://alertmanager-status.monitoring.svc.cluster.local.:8081/webhook"
            send_resolved: false
    - name: "discord"
      webhook_configs:
          - url: "http://alertmanager-discord.monitoring.svc.cluster.local.:8080/"

What's going on here is that we setup a status webhook, which is the location of your alertmanager-status instance's /webhook endpoint (it's served from the "debug" server), and set up a route to match the AlwaysFiring alert. In that route, we send it to status instead of discord, which is where I normally receive alerts. We also set a ridiculous repeat interval, so that alertmanager-status can be aggressive about marking the alerting system unhealthy (the default interval is 1 minute; no sign of that alert for 1 minute and we declare the system unhealthy). Now when you visit the public part of alertmanager-status, you'll see either "alertmanager ok" if it's sending that alert, or "alertmanager unhealthy" if it hasn't sent the alert for one minute. From there, you can expose that status page to the Internet and tell your favorite website monitoring service (I use Oh Dear!) to alert you when it "goes down". Then you'll know if your Alertmanager setup stops sending alerts!

Installation

If you're using Kubernetes, I have prepared a manifest. It uses kustomize, so you can just write your site-specific configuration in a kustomization.yaml file that looks like:

namespace: monitoring
bases:
    - github.com/jrockway/alertmanager-status/deploy?ref=v0.0.7

and kubectl apply -k . to the directory you put that file in. The release names on Github are tags that you can use in the ?ref=... directive.

You will need to add your own Ingress configuration if you want one. Create the manifest and add it to your kustomization.yaml by adding a resources section that refers to it.

If you're not using Kubernetes, it's just a go program that takes configuration from command-line flags or the environment and reads HTTP requests from the Internet. You can run --help for details.

Operation

alertmanager-status logs JSON-formatted structured logs at level=debug by default. The debug logs are very verbose. The provided manifests change the level to info, which only contain logs that are relevant to operators, and so are less verbose. State changes of the monitored alertmanager are logged, and any time that "unhealthy" is served to your site-check service, a log message is generated. That should serve to be informative and managable in terms of volume.

alertmanager-status binds two ports by default; "public" (8080) and "debug" (8081). "public" serves the status page, and is intended to be exposed to the Internet for an external service to probe. "debug" serves a readiness check (/healthz), a liveness check (/livez), the alertmanager webhook (/webhook), a page of metrics for Prometheus to scrape (/metrics), and the usual assortment of net/http/pprof endpoints.

A trace of every HTTP request is sent to Jaeger, if available. alertmanager-status is a Go app, and you can configure Jager to your exact needs by following their documented environment variables.

Metrics are made available to Prometheus via a page on the debug server, /metrics. We export a variety of standard metrics, and some app-specific ones:

  • alertmanager_status_alertmanager_health - 1 if we consider Alertmanager healthy, 0 otherwise.

  • alertmanager_status_alertmanager_last_healthy - When Alertmanager was last confirmed to be healthy, in seconds since the Unix epoch.

  • alertmanager_status_last_health_checked - When your the health endpoint was last polled by your external health check service, in seconds since the Unix epoch. This lets you set up an alert to detect that your health checking service is down. Hopefully that doesn't happen at the same time your Alertmanager goes down!

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL