staple

package module
v0.0.0-...-1b3333d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 27, 2024 License: MIT Imports: 11 Imported by: 0

README

Staple

Go Reference

import "andy.dev/staple"

Staple provides Erlang-ish supervisor trees for Go.

It is intended to deal gracefully with the real failure cases that can occur with supervision trees (such as burning all your CPU time endlessly restarting dead services), while also making no unnecessary demands on the "service" code, and providing hooks to perform adequate logging with in a production environment.

A blog post describing the design decisions is available.

This module is fairly fully covered with godoc including an example, usage, and everything else you might expect from a README.md on GitHub. (DRY.)

Special Thanks

Special thanks to thejerf for a great package. When in doubt, use that instead, as this package primariliy exists so I can tweak it ever so slightly to suit my needs.

Changelog

staple uses semantic versioning and go modules.

  • 1.0.0:
    • Initial fork from suture
    • Minor lint-based cleanup, nothing to write home about.
    • Add slog as the default logger and allow all event types to conform to slog.LogValuer so that they decide their own logging levels.
  • suture-4.0.2:
    • Add the ability to specify a handler for non-string panics to format them.
    • Fixed an issue where trying to close a currently-panicked service was having problems. (This may have leaked goroutines in other ways too.)
    • Merged a PR that addresses race conditions in the test suite. (These seem to have been isolated to the test suite and not have affected the core code.)
  • suture-4.0.1:
    • Add a channel returned from ServeBackground that can be used to examine any error coming out of the supervisor once it is stopped.
    • Tweak up the docs to try to make it more clear suture's special error returns are checked via errors.Is when possible, addressing issue #51.
  • suture-4.0:
    • Switched the entire API to be context based.
    • Switched how logging works to take a single closure that will be presented with a defined set of structs, rather than a set of closures for each event.
    • Consequently, "Stop" removed from the Service interface. A wrapper for old-style code is provided.
    • Services can now return errors. Errors will be included in the log message. Two special errors control restarting behavior:
      • ErrDoNotRestart indicates the service should not be restarted, but other services should be unaffected.
      • ErrTerminateTree indicates the parent service tree should be terminated. Supervisor trees can be configured to either continue terminating upwards, or terminate themselves but not continue propagating the termination upwards.
    • UnstoppedServiceReport calling semantics modified to allow correctly retrieving reports from entire trees. (Prior to 4.0, a report was only on the supervisor it was called on.)

Documentation

Overview

Package staple provides Erlang-like supervisor trees.

This implements Erlang-esque supervisor trees, as adapted for Go. This is an industrial-strength, tested library deployed into hostile environments, not just a proof of concept or a toy.

Why use Staple?

  • You want to write bullet-resistant services that will remain available despite unforeseen failure.
  • You need the code to be smart enough not to consume 100% of the CPU restarting things.
  • You want to easily compose multiple such services in one program.
  • You want the Erlang programmers to stop lording their supervision trees over you.

Staple has 100% test coverage, and is golint clean.

A blog post describing the design decisions of rhe intial proect is available at http://www.jerf.org/iri/post/2930 .

Using Staple

To idiomatically use Staple, create a Supervisor which is your top level "application" supervisor. This will often occur in your program's "main" function.

Create "Service"s, which implement the Service interface. .Add() them to your Supervisor. Supervisors are also services, so you can create a tree structure here, depending on the exact combination of restarts you want to create.

As a special case, when adding Supervisors to Supervisors, the "sub" supervisor will have the "super" supervisor's Log function copied. This allows you to set one log function on the "top" supervisor, and have it propagate down to all the sub-supervisors. This also allows libraries or modules to provide Supervisors without having to commit their users to a particular logging method.

Finally, as what is probably the last line of your main() function, call .Serve() on your top level supervisor. This will start all the services you've defined.

See the Example for an example, using a simple service that serves out incrementing integers.

Index

Examples

Constants

View Source
const (
	DefaultFailureDecay     = 30
	DefaultFailureThreshold = 5
	DefaultFailureBackoff   = 15 * time.Second
	DefaultTimeout          = 10 * time.Second
)

Variables

View Source
var ErrDoNotRestart = errors.New("service should not be restarted")

ErrDoNotRestart can be returned by a service to voluntarily not be restarted. Any error that will compare with errors.Is as being this error will count as an ErrDoNotRestart.

View Source
var ErrSupervisorNotRunning = errors.New("supervisor not running")

ErrSupervisorNotRunning is returned by some methods if the supervisor is not running, either because it has not been started or because it has been terminated.

View Source
var ErrSupervisorNotStarted = errors.New("supervisor not started yet")

ErrSupervisorNotStarted is returned if you try to send control messages to a supervisor that has not started yet. See note on Supervisor struct about the legal ways to start a supervisor.

View Source
var ErrSupervisorNotTerminated = errors.New("supervisor not terminated")

ErrSupervisorNotTerminated is returned when asking for a stopped service report before the supervisor has been terminated.

View Source
var ErrTerminateSupervisorTree = errors.New("tree should be terminated")

ErrTerminateSupervisorTree can can be returned by a service to terminate the entire supervision tree above it as well. Any error that will compare with errors.Is to be ErrTerminateSupervisorTree will count as an ErrTerminateSupervisorTree.

View Source
var ErrTimeout = errors.New("waiting for service to stop has timed out")

ErrTimeout is returned when an attempt to RemoveAndWait for a service to stop has timed out.

View Source
var ErrWrongSupervisor = errors.New("wrong supervisor for this service token, no service removed")

ErrWrongSupervisor is returned by the (*Supervisor).Remove method if you pass a ServiceToken from the wrong Supervisor.

Functions

This section is empty.

Types

type DefaultJitter

type DefaultJitter struct {
	// contains filtered or unexported fields
}

DefaultJitter is the jitter function that is applied when spec.BackoffJitter is set to nil.

func (*DefaultJitter) Jitter

func (dj *DefaultJitter) Jitter(d time.Duration) time.Duration

Jitter will jitter the backoff time by uniformly distributing it into the range [FailureBackoff, 1.5 * FailureBackoff).

type Event

type Event interface {
	fmt.Stringer
	slog.Leveler
	Type() EventType
	Map() map[string]interface{}
}

Event defines the interface implemented by all events Staple will generate.

Map will return a map with the details of the event serialized into a map[string]interface{}, with only the values suitable for serialization.

type EventBackoff

type EventBackoff struct {
	Supervisor     *Supervisor `json:"-"`
	SupervisorName string      `json:"supervisor_name"`
}

func (EventBackoff) Level

func (EventBackoff) Level() slog.Level

func (EventBackoff) Map

func (e EventBackoff) Map() map[string]interface{}

func (EventBackoff) String

func (e EventBackoff) String() string

func (EventBackoff) Type

func (EventBackoff) Type() EventType

type EventHook

type EventHook func(Event)

type EventResume

type EventResume struct {
	Supervisor     *Supervisor `json:"-"`
	SupervisorName string      `json:"supervisor_name"`
}

func (EventResume) Level

func (EventResume) Level() slog.Level

func (EventResume) Map

func (e EventResume) Map() map[string]interface{}

func (EventResume) String

func (e EventResume) String() string

func (EventResume) Type

func (EventResume) Type() EventType

type EventServicePanic

type EventServicePanic struct {
	Supervisor       *Supervisor `json:"-"`
	SupervisorName   string      `json:"supervisor_name"`
	Service          Service     `json:"-"`
	ServiceName      string      `json:"service_name"`
	CurrentFailures  float64     `json:"current_failures"`
	FailureThreshold float64     `json:"failure_threshold"`
	Restarting       bool        `json:"restarting"`
	PanicMsg         string      `json:"panic_msg"`
	Stacktrace       string      `json:"stacktrace"`
}

func (EventServicePanic) Level

func (EventServicePanic) Level() slog.Level

func (EventServicePanic) Map

func (e EventServicePanic) Map() map[string]interface{}

func (EventServicePanic) String

func (e EventServicePanic) String() string

func (EventServicePanic) Type

type EventServiceTerminate

type EventServiceTerminate struct {
	Supervisor       *Supervisor `json:"-"`
	SupervisorName   string      `json:"supervisor_name"`
	Service          Service     `json:"-"`
	ServiceName      string      `json:"service_name"`
	CurrentFailures  float64     `json:"current_failures"`
	FailureThreshold float64     `json:"failure_threshold"`
	Restarting       bool        `json:"restarting"`
	Err              interface{} `json:"error_msg"`
}

func (EventServiceTerminate) Level

func (EventServiceTerminate) Map

func (e EventServiceTerminate) Map() map[string]interface{}

func (EventServiceTerminate) String

func (e EventServiceTerminate) String() string

func (EventServiceTerminate) Type

type EventStopTimeout

type EventStopTimeout struct {
	Supervisor     *Supervisor `json:"-"`
	SupervisorName string      `json:"supervisor_name"`
	Service        Service     `json:"-"`
	ServiceName    string      `json:"service"`
}

func (EventStopTimeout) Level

func (EventStopTimeout) Level() slog.Level

func (EventStopTimeout) Map

func (e EventStopTimeout) Map() map[string]interface{}

func (EventStopTimeout) String

func (e EventStopTimeout) String() string

func (EventStopTimeout) Type

func (EventStopTimeout) Type() EventType

type EventType

type EventType int
const (
	EventTypeStopTimeout EventType = iota
	EventTypeServicePanic
	EventTypeServiceTerminate
	EventTypeBackoff
	EventTypeResume
)

type HasSupervisor

type HasSupervisor interface {
	GetSupervisor() *Supervisor
}

HasSupervisor is an interface that indicates the given struct contains a supervisor. If the struct is either already a *Supervisor, or it embeds a *Supervisor, this will already be implemented for you. Otherwise, a struct containing a supervisor will need to implement this in order to participate in the log function propagation and recursive UnstoppedService report.

It is legal for GetSupervisor to return nil, in which case the supervisor-specific behaviors will simply be ignored.

type Jitter

type Jitter interface {
	Jitter(time.Duration) time.Duration
}

Jitter returns the sum of the input duration and a random jitter. It is compatible with the jitter functions in github.com/lthibault/jitterbug.

type NoJitter

type NoJitter struct{}

NoJitter does not apply any jitter to the input duration

func (NoJitter) Jitter

func (NoJitter) Jitter(d time.Duration) time.Duration

Jitter leaves the input duration d unchanged.

type Service

type Service interface {
	Serve(ctx context.Context) error
}

Service is the interface that describes a service to a Supervisor.

Serve Method

The Serve method is called by a Supervisor to start the service. The service should execute within the goroutine that this is called in, that is, it should not spawn a "worker" goroutine. If this function either returns error or panics, the Supervisor will call it again.

A Serve method SHOULD do as much cleanup of the state as possible, to prevent any corruption in the previous state from crashing the service again. The beginning of a service with persistent state should generally be a few lines to initialize and clean up that state.

The error returned by the service, if any, will be part of the log message generated for it. There are two distinguished errors a Service can return:

ErrDoNotRestart indicates that the service should not be restarted and removed from the supervisor entirely.

ErrTerminateTree indicates that the Supervisor the service is running in should be terminated. If that Supervisor recursively returns that, its parent supervisor will also be terminated. (This can be controlled with configuration in the Supervisor.)

In Go 1.13 and greater, this is checked via errors.Is, so the error can be further wrapped with whatever additional info you like. Prior to Go 1.13, it will be checked via directly equality check, so the distinguished errors cannot be wrapped.

Once the service has been instructed to stop, the Service SHOULD NOT be reused in any other supervisor! Because of the impossibility of guaranteeing that the service has fully stopped in Go, you can't prove that you won't be starting two goroutines using the exact same memory to store state, causing completely unpredictable behavior.

Serve should not return until the service has actually stopped. "Stopped" here is defined as "the service will stop servicing any further requests in the future". Any mandatory cleanup related to the Service should also have been performed.

If a service does not stop within the supervisor's timeout duration, the supervisor will log an entry to that effect. This does not guarantee that the service is hung; it may still get around to being properly stopped in the future. Until the service is fully stopped, both the service and the spawned goroutine trying to stop it will be "leaked".

Stringer Interface

When a Service is added to a Supervisor, the Supervisor will create a string representation of that service used for logging.

If you implement the fmt.Stringer interface, that will be used.

If you do not implement the fmt.Stringer interface, a default fmt.Sprintf("%#v") will be used.

type ServiceToken

type ServiceToken struct {
	// contains filtered or unexported fields
}

ServiceToken is an opaque identifier that can be used to terminate a service that has been Add()ed to a Supervisor.

type Spec

type Spec struct {
	EventHook                EventHook
	Sprint                   SprintFunc
	FailureDecay             float64
	FailureThreshold         float64
	FailureBackoff           time.Duration
	BackoffJitter            Jitter
	Timeout                  time.Duration
	PassThroughPanics        bool
	DontPropagateTermination bool
}

Spec is used to pass arguments to the New function to create a supervisor. See the New function for full documentation.

type SprintFunc

type SprintFunc func(interface{}) string

SprintFunc formats an arbitrary Go value into a string. It is used by the supervisor to format the value of a call to recover().

type Supervisor

type Supervisor struct {
	Name string
	// contains filtered or unexported fields
}

Supervisor is the core type of the module that represents a Supervisor.

Supervisors should be constructed either by New or NewSimple.

Once constructed, a Supervisor should be started in one of three ways:

  1. Calling .Serve(ctx).
  2. Calling .ServeBackground(ctx).
  3. Adding it to an existing Supervisor.

Calling Serve will cause the supervisor to run until the passed-in context is cancelled. Often one of the last lines of the "main" func for a program will be to call one of the Serve methods.

Calling ServeBackground will CORRECTLY start the supervisor running in a new goroutine. It is risky to directly run

go supervisor.Serve()

because that will briefly create a race condition as it starts up, if you try to .Add() services immediately afterward.

func New

func New(name string, spec Spec) *Supervisor

New is the full constructor function for a supervisor.

The name is a friendly human name for the supervisor, used in logging. Staple does not care if this is unique, but it is good for your sanity if it is.

If not set, the following values are used:

  • EventHook: A function is created that handles structured logging of an event..
  • FailureDecay: 30 seconds
  • FailureThreshold: 5 failures
  • FailureBackoff: 15 seconds
  • Timeout: 10 seconds
  • BackoffJitter: DefaultJitter

The EventHook function will be called when errors occur. Staple will log the following:

  • When a service has failed, with a descriptive message about the current backoff status, and whether it was immediately restarted
  • When the supervisor has gone into its backoff mode, and when it exits it
  • When a service fails to stop

The failureRate, failureThreshold, and failureBackoff controls how failures are handled, in order to avoid the supervisor failure case where the program does nothing but restarting failed services. If you do not care how failures behave, the default values should be fine for the vast majority of services, but if you want the details:

The supervisor tracks the number of failures that have occurred, with an exponential decay on the count. Every FailureDecay seconds, the number of failures that have occurred is cut in half. (This is done smoothly with an exponential function.) When a failure occurs, the number of failures is incremented by one. When the number of failures passes the FailureThreshold, the entire service waits for FailureBackoff seconds before attempting any further restarts, at which point it resets its failure count to zero.

Timeout is how long Staple will wait for a service to properly terminate.

The PassThroughPanics options can be set to let panics in services propagate and crash the program, should this be desirable.

DontPropagateTermination indicates whether this supervisor tree will propagate a ErrTerminateTree if a child process returns it. If false, this supervisor will itself return an error that will terminate its parent. If true, it will merely return ErrDoNotRestart. false by default.

Example (Simple)
package main

import (
	"context"
	"fmt"
)

type Incrementor struct {
	current int
	next    chan int
	stop    chan bool
}

func (i *Incrementor) Serve(ctx context.Context) error {
	for {
		select {
		case i.next <- i.current:
			i.current++
		case <-ctx.Done():
			// This message on i.stop is just to synchronize
			// this test with the example code so the output is
			// consistent for the test code; most services
			// would just "return nil" here.
			fmt.Println("Stopping the service")
			i.stop <- true
			return nil
		}
	}
}

func main() {
	supervisor := NewSimple("Supervisor")
	service := &Incrementor{0, make(chan int), make(chan bool)}
	supervisor.Add(service)

	ctx, cancel := context.WithCancel(context.Background())
	supervisor.ServeBackground(ctx)

	fmt.Println("Got:", <-service.next)
	fmt.Println("Got:", <-service.next)
	cancel()

	// We sync here just to guarantee the output of "Stopping the service"
	<-service.stop

}
Output:

Got: 0
Got: 1
Stopping the service

func NewSimple

func NewSimple(name string) *Supervisor

NewSimple is a convenience function to create a service with just a name and the sensible defaults.

func (*Supervisor) Add

func (s *Supervisor) Add(service Service) ServiceToken

Add adds a service to this supervisor.

If the supervisor is currently running, the service will be started immediately. If the supervisor has not been started yet, the service will be started when the supervisor is. If the supervisor was already stopped, this is a no-op returning an empty service-token.

The returned ServiceID may be passed to the Remove method of the Supervisor to terminate the service.

As a special behavior, if the service added is itself a supervisor, the supervisor being added will copy the EventHook function from the Supervisor it is being added to. This allows factoring out providing a Supervisor from its logging. This unconditionally overwrites the child Supervisor's logging functions.

func (*Supervisor) GetSupervisor

func (s *Supervisor) GetSupervisor() *Supervisor

func (*Supervisor) Remove

func (s *Supervisor) Remove(id ServiceToken) error

Remove will remove the given service from the Supervisor, and attempt to Stop() it. The ServiceID token comes from the Add() call. This returns without waiting for the service to stop.

func (*Supervisor) RemoveAndWait

func (s *Supervisor) RemoveAndWait(id ServiceToken, timeout time.Duration) error

RemoveAndWait will remove the given service from the Supervisor and attempt to Stop() it. It will wait up to the given timeout value for the service to terminate. A timeout value of 0 means to wait forever.

If a nil error is returned from this function, then the service was terminated normally. If either the supervisor terminates or the timeout passes, ErrTimeout is returned. (If this isn't even the right supervisor ErrWrongSupervisor is returned.)

func (*Supervisor) Serve

func (s *Supervisor) Serve(ctx context.Context) error

Serve starts the supervisor. You should call this on the top-level supervisor, but nothing else.

func (*Supervisor) ServeBackground

func (s *Supervisor) ServeBackground(ctx context.Context) <-chan error

ServeBackground starts running a supervisor in its own goroutine. When this method returns, the supervisor is guaranteed to be in a running state. The returned one-buffered channel receives the error returned by .Serve.

func (*Supervisor) Services

func (s *Supervisor) Services() []Service

Services returns a []Service containing a snapshot of the services this Supervisor is managing.

func (*Supervisor) String

func (s *Supervisor) String() string

String implements the fmt.Stringer interface.

func (*Supervisor) UnstoppedServiceReport

func (s *Supervisor) UnstoppedServiceReport() (UnstoppedServiceReport, error)

UnstoppedServiceReport will return a report of what services failed to stop when the supervisor was stopped. This call will return when the supervisor is done shutting down. It will hang on a supervisor that has not been stopped, because it will not be "done shutting down".

Calling this on a supervisor will return a report for the whole supervisor tree under it.

WARNING: Technically, any use of the returned data structure is a TOCTOU violation: https://en.wikipedia.org/wiki/Time-of-check_to_time-of-use Since the data structure was generated and returned to you, any of these services may have stopped since then.

However, this can still be useful information at program teardown time. For instance, logging that a service failed to stop as expected is still useful, as even if it shuts down later, it was still later than you expected.

But if you cast the Service objects back to their underlying objects and start trying to manipulate them ("shut down harder!"), be sure to account for the possibility they are in fact shut down before you get them.

If there are no services to report, the UnstoppedServiceReport will be nil. A zero-length constructed slice is never returned.

type UnstoppedService

type UnstoppedService struct {
	SupervisorPath []*Supervisor
	Service        Service
	Name           string
	ServiceToken   ServiceToken
}

An UnstoppedService is the component member of an UnstoppedServiceReport.

The SupervisorPath is the path down the supervisor tree to the given service.

type UnstoppedServiceReport

type UnstoppedServiceReport []UnstoppedService

An UnstoppedServiceReport will be returned by StopWithReport, reporting which services failed to stop.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL