gotel

package module
v0.0.0-...-312d9a2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 25, 2019 License: BSD-2-Clause Imports: 21 Imported by: 1

README

GoTel

Who monitors the monitors?

Build Status

CrowdStrike Cloud Engineering is releasing GoTel which is an internal monitoring service that aims to ensure scheduled jobs, cronjobs, batch oriented work, or general scheduled tasks are completing successfully and within a set SLA time period.

  • Provides coordinator/worker pattern to ensure one gotel is always operational
  • HTTP based API to enable easy integration
  • Ability to pause checkins during scheduled maintenance periods

Authors/Contributors

Overview

GoTel is for monitoring scheduled operations.

Most companies have scheduled reports, cron jobs, backup jobs, random data process tasks, etc..., various things that are expected to run perfectly but sometimes silently fail. GoTel will let them make a "reservation" which means they have to check in during their allotted time frame or alerts will be sent out to the world. When they run they can "checkin" to GoTel which updates their last check-in time.

This was born from years of experience with various "cron" type jobs that suddenly stop working because the job they ran had a port blocked on it from a firewall config, data sets grow and something that takes 10 minutes now takes 2 hours, something in the environment changed or a myriad of various other failure conditions occurred. GoTel is for when things need to run and you need an independent monitor in your network that is not locked in to a specific vendor. We've also seen alerting failures from 3rd party vendors that are supposed to give us the warm fuzzy feeling they'll alert us when things stop working.

For a toy example take the case where you have a nightly job that removes old data from a data store. It’s a simple one liner that runs every night. One day you hit the inflection point and indexes turn useless and your script now does a full table scan. Now your 20 minute data clean up job takes 5 hours and you didn’t know about it as soon as it happened.

GoTel is for when you don't need the overhead of an "enterprise" grade schedule monitor. It's for the microservice world where you have apps running in various languages, platforms and locations. It's expected that you have two GoTel instances up for redundancy (across datacenters). The coordinator will monitor the worker and the worker will monitor the coordinator to ensure GoTel is always operational and if not that alerts are sent out to avoid silent failures.

GoTel has been running inside CrowdStrike and has already caught cases in production of scheduled operations silently failing, reducing customer impact.

GoTel Arch

Requirements

0.1 expects a MySQL backend for storing jobs and leader election. Future plugins will allow direct integration with ZooKeeper for leader election but v1 keeps the minimum requirements for easier outside adoption.

Web UI

Below is a screenshot showing what the admin UI looks like when you browse to http://localhost:8080/status Gotel Web UI

Getting Started

in MySQL create a "gotel" database

mysql> create database gotel;

Grab the code:

mkdir -p gotel_github/src

export GOPATH=~/gotel_github

cd gotel_github

go get github.com/CrowdStrike/gotel

go get github.com/go-sql-driver/mysql

go get github.com/ParsePlatform/go.flagenv

cd $GOPATH/src/github.com/CrowdStrike/gotel/cmd/gotelweb

./run.sh

navigate to: http://127.0.0.1:8080/status

Version

0.1

Terminology

GoTel uses a number of hotel related concepts. It's imagined that a small hotel owner becomes fond of her regular guests and gets sad when they don't checkin when they say they're supposed to.

  • [App] - A general application that's under watch
  • [Component] - A sub piece of your app, e.g. job1, job2, etc...
  • [Reservation] - When your app starts up or is created you create a placeholder and tell GoTel how often it will checkin
  • [Checkin] - Your app completed it's work properly and is telling GoTel everything is A-OK
  • [Checkout] - If you want to power down an app you can checkout and GoTel will stop alerting on it
  • [Snooze] - If your app is down for maintenance you can "snooze" the job checker to avoid alerts getting fired
  • [Alerters] - GoTel allows plugins to be created that can output to various notification systems. SMTP, PagerDuty, etc..

Alerters

GoTel allows for configurable alerters to be set so when an application doesn't checkin over it's SLA then we fire off to one or more alert systems.

Currently configured alerts:

####SMTP

  • sends emails to the "notify" parameter of a reservation

To Enable: edit gotel.gcfg and set enable=true under [smtp]

pass in the flag -GOTEL_SMTP_HOST=10.10.1.1 (or whatever your smtp server address is)

####PagerDuty

  • creates a pager duty incident that will alert via SMS when an app/component fails to checkin

API

// make a reservation that tells GoTel testapp/requests will complete work every 5 minutes or alert me
// supported time_units currently are seconds,minutes,hours
// notify parameter supports a comma-separated list of recipients that will receive an alert when a job fails to checkin
// alert_msg will replace the following fields with their corresponding values:
// {jobid}, {app}, {component}, {owner}, {notify}, {frequency}, {last}, {since}, {checkins}, {srv}
// where {last} is the timestamp of the last checkin, {since} is how long ago the last checkin was, {checkins} is the
// total number of checkins so far, and {srv} is the IP address of the server sending the notification
curl -XPOST 'http://127.0.0.1:8080/reservation' -i -H "Content-type: application/json" -d '
{
  "app": "testapp",
  "component": "requests",
  "notify": "jim@foo.com",
  "alert_msg": "App: [{app}] Component: [{component}] failed checkin on IP [{srv}]. Contact owner [{owner}]"
  "frequency": 5,
  "time_units": "minutes",
  "owner": "jim@foo.com"
}
'

// checkin for a reservation to avoid having alerts sent
curl -XPOST 'http://127.0.0.1:8080/checkin' -i -H "Content-type: application/json" -d '
{
  "app": "testapp",
  "component": "requests",
  "notes": "all is well"
}
'

// pause (snooze your wakeup call) a job if you're going down for maintenance or testing
curl -XPOST 'http://127.0.0.1:8080/snooze' -i -H "Content-type: application/json" -d '
{
  "app": "testapp",
  "component": "requests",
  "duration": 10,
  "time_units": "hours"
}
'

// checkout/delete reservation
curl -XPOST 'http://127.0.0.1:8080/checkout' -i -H "Content-type: application/json" -d '
{
  "app": "testapp",
  "component": "requests"
}
'

// view all reservations
curl 'http://127.0.0.1:8080/reservation'

// view the status in your browser
http://127.0.0.1:8080/status
Configure Config File. Instructions in following file
  • cmd/gotelweb/gotel.cfcg

Future ToDos

  • ZooKeeper option for leadership election
  • Additional Alerter integrations
  • Adding auth/tls support for SMTP alert
  • Better coordinator/worker monitoring.. make sure jobs are fully processed
  • ability to specify on a reservation which alerters you want to include (potentially useful for debugging)
  • web interface to be able to make reservations through a web ui and view stats
  • ability to set up escalation level. e.g. if this reservation fails then it's a "wake me up" type alert

Documentation

Index

Constants

View Source
const (
	// Minute is the nubmer of seconds in a minute
	Minute = 60
	// Hour is the nubmer of seconds in an hour
	Hour = 60 * Minute
	// Day is the nubmer of seconds in a day
	Day = 24 * Hour
	// Week is the nubmer of seconds in a week
	Week = 7 * Day
	// Month is the nubmer of seconds in a 30 day month
	Month = 30 * Day
	// Year is the nubmer of seconds in almost a year
	Year = 12 * Month
	// LongTime is the nubmer of seconds in a while
	LongTime = 37 * Year
)

Variables

This section is empty.

Functions

func FailsSLA

func FailsSLA(res reservation) bool

FailsSLA monitors the reservations and determines if any jobs haven't checked in within their allotted timeframe

func InitDb

func InitDb(host, user, pass string, conf Config) *sql.DB

InitDb initializes and then bootstraps the database

func InitializeMonitoring

func InitializeMonitoring(c Config, db *sql.DB)

InitializeMonitoring sets up alerters based on configuration

func Monitor

func Monitor(db *sql.DB)

Monitor checks existing reservations for late arrivals

func RelTime

func RelTime(a, b time.Time, albl, blbl string) string

RelTime returns the duration between to times, formatted as a string

Types

type Config

type Config struct {
	Main struct {
		GotelOwnerEmail    string
		HoursBetweenAlerts int64
		DaysToStoreLogs    int
	}
	SMTP struct {
		Enabled     bool
		FromAddress string
		ReplyTO     string
	}
	PagerDuty struct {
		Enabled    bool
		ServiceKey string
	}
}

Config is the service configuration

func NewConfig

func NewConfig(confPath string, sysLogEnabled bool) Config

NewConfig returns a gotel config with configPath and sysLogEnabled set. As part of initialization it will also parse the provided config file.

type Endpoint

type Endpoint struct {
	Db *sql.DB
}

Endpoint holds the reference to our DB connection

func (*Endpoint) InitAPI

func (ge *Endpoint) InitAPI(port int, htmlPath string)

InitAPI initializes the webservice on the specific port

type Response

type Response map[string]interface{}

Response will hold a response sent back to the caller

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL