suture

package module
v4.0.5+incompatible Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 8, 2024 License: MIT Imports: 8 Imported by: 67

README

Suture

Go Reference

import "github.com/thejerf/suture/v4"

Suture provides Erlang-ish supervisor trees for Go. "Supervisor trees" -> "sutree" -> "suture" -> holds your code together when it's trying to die.

If you are reading this on pkg.go.dev, you should visit the v4 docs.

It is intended to deal gracefully with the real failure cases that can occur with supervision trees (such as burning all your CPU time endlessly restarting dead services), while also making no unnecessary demands on the "service" code, and providing hooks to perform adequate logging with in a production environment.

A blog post describing the design decisions is available.

This module is fairly fully covered with godoc including an example, usage, and everything else you might expect from a README.md on GitHub. (DRY.)

v3 and before (which existed before go module support) documentation is also available.

A default slog-based logger is provided in github.com/thejerf/sutureslog. This is a separate Go module in order to avoid "infecting" the main suture/v4 with a new requirement to be on at least Go 1.21. Using this will require an additional go get github.com/thejerf/sutureslog.

Special Thanks

Special thanks to the Syncthing team, who have been fantastic about working with me to push fixes upstream of them.

Major Versions

v4 is a rewrite to make Suture function with contexts. If you are using suture for the first time, I recommend it. It also changes how logging works, to get a single function from the user that is presented with a defined set of structs, rather than requiring a number of closures from the consumer.

suture v3 is the latest version that does not feature contexts. It is still supported and getting backported fixes as of now.

Code Signing

Starting with the commit after ac7cf8591b, I will be signing this repository with the "jerf" keybase account. If you are viewing this repository through GitHub, you should see the commits as showing as "verified" in the commit view.

(Bear in mind that due to the nature of how git commit signing works, there may be runs of unverified commits; what matters is that the top one is signed.)

Aspiration

One of the big wins the Erlang community has with their pervasive OTP support is that it makes it easy for them to distribute libraries that easily fit into the OTP paradigm. It ought to someday be considered a good idea to distribute libraries that provide some sort of supervisor tree functionality out of the box. It is possible to provide this functionality without explicitly depending on the Suture library.

Changelog

suture uses semantic versioning and go modules.

  • 4.0.4 and 4.0.5:
    • Apparently there is no way to have this be its own module and live in this directory. Moved sutureslog into its own repo.
  • 4.0.3:
  • 4.0.2:
    • Add the ability to specify a handler for non-string panics to format them.
    • Fixed an issue where trying to close a currently-panicked service was having problems. (This may have leaked goroutines in other ways too.)
    • Merged a PR that addresses race conditions in the test suite. (These seem to have been isolated to the test suite and not have affected the core code.)
  • 4.0.1:
    • Add a channel returned from ServeBackground that can be used to examine any error coming out of the supervisor once it is stopped.
    • Tweak up the docs to try to make it more clear suture's special error returns are checked via errors.Is when possible, addressing issue #51.
  • 4.0:
    • Switched the entire API to be context based.
    • Switched how logging works to take a single closure that will be presented with a defined set of structs, rather than a set of closures for each event.
    • Consequently, "Stop" removed from the Service interface. A wrapper for old-style code is provided.
    • Services can now return errors. Errors will be included in the log message. Two special errors control restarting behavior:
      • ErrDoNotRestart indicates the service should not be restarted, but other services should be unaffected.
      • ErrTerminateTree indicates the parent service tree should be terminated. Supervisor trees can be configured to either continue terminating upwards, or terminate themselves but not continue propagating the termination upwards.
    • UnstoppedServiceReport calling semantics modified to allow correctly retrieving reports from entire trees. (Prior to 4.0, a report was only on the supervisor it was called on.)
  • 3.0.4:
    • Fix a problem with adding services to a stopped supervisor.
  • 3.0.3:
    • Implemented request in Issue #37, creating a new method StopWithReport on supervisors that reports what services failed to stop. While a bit tricky to use, see warning about TOCTOU issues in the godoc, it can be useful at program tear-down time.
  • 3.0.2:
    • Fixed issue #35 caused by the 3.0.1 change to panic when calling .Stop on an unServe()d supervisor. It needs to correctly notice that .Stop has been called, and not start up instead, which is the contract of the Service interface.
  • 3.0.1:
    • Fixed issue #34: Calling supervisor.Stop() while something is trying to shut down a service could incorrectly report the service failed to shut down.
    • Calling ".Stop()" on an unstarted supervisor now panics. This is superior to its previous behavior, which is hanging forever. This is justified by the fact that the Supervisor can't provide its guarantees about how services are started and stopped if it is not itself started and stopped correctly. Further pushing me in this direction is that it's fairly easy to use the Supervisor correctly.
  • 3.0:
    • Added a default jitter of up to 50% on the restart intervals. While this is a backwards-compatible change from a source perspective, this does represent a non-trivial behavior change. It should generally be a good thing, but this is released as a major version as a warning.
  • 2.0.4
    • Added option PassThroughPanics, to allow panics to propagate up through the supervisor.
  • 2.0.3
    • Accepted PR #23, making the logging functions in the supervisor public.
    • Added a new Supervisor method RemoveAndWait, allowing you to make a best effort way to wait for a service to terminate.
    • Accepted PR #24, adding an optional IsCompletable interface that Services can implement that indicates they do not need to be restarted upon a normal return.
  • 2.0.2
    • Fixed issue #21. gccgo doesn't like case (<-c), with the parentheses. Of course the parens aren't doing anything useful anyhow. No behavior changes.
  • 2.0.1
    • Test code change only. Addresses the possibility that one of the tests can spuriously fail if they run in a certain order.
  • 2.0.0
    • Major version due to change to the signature of the logging methods:

      A race condition could occur when the Supervisor rendered the service name via fmt.Sprintf("%#v"), because fmt examines the entire object regardless of locks through reflection. 2.0.0 changes the supervisors to snapshot the Service's name once, when it is added, and to pass it to the logging methods.

    • Removal of use of sync/atomic due to possible brokenness in the Debian architecture.

  • 1.1.2
    • TravisCI showed that the fix for 1.1.1 induced a deadlock in Go 1.4 and before.
    • If the supervisor is terminated before a service, the service goroutine could be orphaned trying the shutdown notification to the supervisor. This should no longer occur.
  • 1.1.1
    • Per #14, the fix in 1.1.0 did not actually wait for the Supervisor to stop.
  • 1.1.0
    • Per #12, Supervisor.stop now tries to wait for its children before returning. A careful reading of the original .Stop() contract says this is the correct behavior.
  • 1.0.1
    • Fixed data race on the .state variable.
  • 1.0.0
    • Initial release.

Documentation

Overview

Package suture provides Erlang-like supervisor trees.

This implements Erlang-esque supervisor trees, as adapted for Go. This is an industrial-strength, tested library deployed into hostile environments, not just a proof of concept or a toy.

If you are reading this, you are reading the documentation for the v3 version, which is not the latest. If you want the latest v4, be sure to be using github.com/thejerf/suture/v4. This rewrites the API to be in terms of contexts.

Supervisor Tree -> SuTree -> suture -> holds your code together when it's trying to fall apart.

Why use Suture?

  • You want to write bullet-resistant services that will remain available despite unforeseen failure.
  • You need the code to be smart enough not to consume 100% of the CPU restarting things.
  • You want to easily compose multiple such services in one program.
  • You want the Erlang programmers to stop lording their supervision trees over you.

Suture has 100% test coverage, and is golint clean. This doesn't prove it free of bugs, but it shows I care.

A blog post describing the design decisions is available at http://www.jerf.org/iri/post/2930 .

Using Suture

To idiomatically use Suture, create a Supervisor which is your top level "application" supervisor. This will often occur in your program's "main" function.

Create "Service"s, which implement the Service interface. .Add() them to your Supervisor. Supervisors are also services, so you can create a tree structure here, depending on the exact combination of restarts you want to create.

As a special case, when adding Supervisors to Supervisors, the "sub" supervisor will have the "super" supervisor's Log function copied. This allows you to set one log function on the "top" supervisor, and have it propagate down to all the sub-supervisors. This also allows libraries or modules to provide Supervisors without having to commit their users to a particular logging method.

Finally, as what is probably the last line of your main() function, call .Serve() on your top level supervisor. This will start all the services you've defined.

See the Example for an example, using a simple service that serves out incrementing integers.

Index

Examples

Constants

This section is empty.

Variables

View Source
var ErrTimeout = errors.New("waiting for service to stop has timed out")

ErrTimeout is returned when an attempt to RemoveAndWait for a service to stop has timed out.

View Source
var ErrWrongSupervisor = errors.New("wrong supervisor for this service token, no service removed")

ErrWrongSupervisor is returned by the (*Supervisor).Remove method if you pass a ServiceToken from the wrong Supervisor.

Functions

This section is empty.

Types

type BackoffLogger

type BackoffLogger func(s *Supervisor, entering bool)

BackoffLogger is called when the supervisor enters or exits backoff mode

type BadStopLogger

type BadStopLogger func(*Supervisor, Service, string)

BadStopLogger is called when a service fails to properly stop

type DefaultJitter

type DefaultJitter struct {
	// contains filtered or unexported fields
}

DefaultJitter is the jitter function that is applied when spec.BackoffJitter is set to nil.

func (*DefaultJitter) Jitter

func (dj *DefaultJitter) Jitter(d time.Duration) time.Duration

Jitter will jitter the backoff time by uniformly distributing it into the range [FailureBackoff, 1.5 * FailureBackoff).

type FailureLogger

type FailureLogger func(
	supervisor *Supervisor,
	service Service,
	serviceName string,
	currentFailures float64,
	failureThreshold float64,
	restarting bool,
	error interface{},
	stacktrace []byte,
)

FailureLogger is called when a service fails

type IsCompletable

type IsCompletable interface {
	Complete() bool
}

IsCompletable is an optionally-implementable interface that allows a service to report to a supervisor that it does not need to be restarted because it has terminated normally. When a Service is going to be restarted, the supervisor will check for this method, and if Complete returns true, the service is removed from the supervisor instead of restarted.

This is only executed when the service is not running because it has terminated, and has not yet been restarted.

type Jitter

type Jitter interface {
	Jitter(time.Duration) time.Duration
}

Jitter returns the sum of the input duration and a random jitter. It is compatible with the jitter functions in github.com/lthibault/jitterbug.

type NoJitter

type NoJitter struct{}

NoJitter does not apply any jitter to the input duration

func (NoJitter) Jitter

func (NoJitter) Jitter(d time.Duration) time.Duration

Jitter leaves the input duration d unchanged.

type Service

type Service interface {
	Serve()
	Stop()
}

Service is the interface that describes a service to a Supervisor.

Serve Method

The Serve method is called by a Supervisor to start the service. The service should execute within the goroutine that this is called in. If this function either returns or panics, the Supervisor will call it again.

A Serve method SHOULD do as much cleanup of the state as possible, to prevent any corruption in the previous state from crashing the service again.

Stop Method

This method is used by the supervisor to stop the service. Calling this directly on a Service given to a Supervisor will simply result in the Service being restarted; use the Supervisor's .Remove(ServiceToken) method to stop a service. A supervisor will call .Stop() only once. Thus, it may be as destructive as it likes to get the service to stop.

Once Stop has been called on a Service, the Service SHOULD NOT be reused in any other supervisor! Because of the impossibility of guaranteeing that the service has actually stopped in Go, you can't prove that you won't be starting two goroutines using the exact same memory to store state, causing completely unpredictable behavior.

Stop should not return until the service has actually stopped. "Stopped" here is defined as "the service will stop servicing any further requests in the future". For instance, a common implementation is to receive a message on a dedicated "stop" channel and immediately returning. Once the stop command has been processed, the service is stopped.

Another common Stop implementation is to forcibly close an open socket or other resource, which will cause detectable errors to manifest in the service code. Bear in mind that to perfectly correctly use this approach requires a bit more work to handle the chance of a Stop command coming in before the resource has been created.

If a service does not Stop within the supervisor's timeout duration, a log entry will be made with a descriptive string to that effect. This does not guarantee that the service is hung; it may still get around to being properly stopped in the future. Until the service is fully stopped, both the service and the spawned goroutine trying to stop it will be "leaked".

Stringer Interface

When a Service is added to a Supervisor, the Supervisor will create a string representation of that service used for logging.

If you implement the fmt.Stringer interface, that will be used.

If you do not implement the fmt.Stringer interface, a default fmt.Sprintf("%#v") will be used.

Optional Interface

Services may optionally implement IsCompletable, which allows a service to indicate to a supervisor that it does not need to be restarted if it has terminated.

type ServiceToken

type ServiceToken struct {
	// contains filtered or unexported fields
}

ServiceToken is an opaque identifier that can be used to terminate a service that has been Add()ed to a Supervisor.

type Spec

type Spec struct {
	Log               func(string)
	FailureDecay      float64
	FailureThreshold  float64
	FailureBackoff    time.Duration
	BackoffJitter     Jitter
	Timeout           time.Duration
	LogBadStop        BadStopLogger
	LogFailure        FailureLogger
	LogBackoff        BackoffLogger
	PassThroughPanics bool
}

Spec is used to pass arguments to the New function to create a supervisor. See the New function for full documentation.

type Supervisor

type Supervisor struct {
	Name string

	LogBadStop BadStopLogger
	LogFailure FailureLogger
	LogBackoff BackoffLogger

	sync.Mutex
	// contains filtered or unexported fields
}

Supervisor is the core type of the module that represents a Supervisor.

Supervisors should be constructed either by New or NewSimple.

Once constructed, a Supervisor should be started in one of three ways:

  1. Calling .Serve().
  2. Calling .ServeBackground().
  3. Adding it to an existing Supervisor.

Calling Serve will cause the supervisor to run until it is shut down by an external user calling Stop() on it. If that never happens, it simply runs forever. I suggest creating your services in Supervisors, then making a Serve() call on your top-level Supervisor be the last line of your main func.

Calling ServeBackground will CORRECTLY start the supervisor running in a new goroutine. You do not want to just:

go supervisor.Serve()

because that will briefly create a race condition as it starts up, if you try to .Add() services immediately afterward.

The various Log function should only be modified while the Supervisor is not running, to prevent race conditions.

func New

func New(name string, spec Spec) (s *Supervisor)

New is the full constructor function for a supervisor.

The name is a friendly human name for the supervisor, used in logging. Suture does not care if this is unique, but it is good for your sanity if it is.

If not set, the following values are used:

  • Log: A function is created that uses log.Print.
  • FailureDecay: 30 seconds
  • FailureThreshold: 5 failures
  • FailureBackoff: 15 seconds
  • Timeout: 10 seconds
  • BackoffJitter: DefaultJitter

The Log function will be called when errors occur. Suture will log the following:

  • When a service has failed, with a descriptive message about the current backoff status, and whether it was immediately restarted
  • When the supervisor has gone into its backoff mode, and when it exits it
  • When a service fails to stop

The failureRate, failureThreshold, and failureBackoff controls how failures are handled, in order to avoid the supervisor failure case where the program does nothing but restarting failed services. If you do not care how failures behave, the default values should be fine for the vast majority of services, but if you want the details:

The supervisor tracks the number of failures that have occurred, with an exponential decay on the count. Every FailureDecay seconds, the number of failures that have occurred is cut in half. (This is done smoothly with an exponential function.) When a failure occurs, the number of failures is incremented by one. When the number of failures passes the FailureThreshold, the entire service waits for FailureBackoff seconds before attempting any further restarts, at which point it resets its failure count to zero.

Timeout is how long Suture will wait for a service to properly terminate.

The PassThroughPanics options can be set to let panics in services propagate and crash the program, should this be desirable.

Example (Simple)
package main

import "fmt"

type Incrementor struct {
	current int
	next    chan int
	stop    chan struct{}
}

func (i *Incrementor) Stop() {
	fmt.Println("Stopping the service")
	close(i.stop)
}

func (i *Incrementor) Serve() {
	for {
		select {
		case i.next <- i.current:
			i.current++
		case <-i.stop:
			return
		}
	}
}

func main() {
	supervisor := NewSimple("Supervisor")
	service := &Incrementor{0, make(chan int), make(chan struct{})}
	supervisor.Add(service)

	supervisor.ServeBackground()

	fmt.Println("Got:", <-service.next)
	fmt.Println("Got:", <-service.next)
	supervisor.Stop()

	// We sync here just to guarantee the output of "Stopping the service"
	<-service.stop

}
Output:

Got: 0
Got: 1
Stopping the service

func NewSimple

func NewSimple(name string) *Supervisor

NewSimple is a convenience function to create a service with just a name and the sensible defaults.

func (*Supervisor) Add

func (s *Supervisor) Add(service Service) ServiceToken

Add adds a service to this supervisor.

If the supervisor is currently running, the service will be started immediately. If the supervisor is not currently running, the service will be started when the supervisor is. If the supervisor was already stopped, this is a no-op returning an empty service-token.

The returned ServiceID may be passed to the Remove method of the Supervisor to terminate the service.

As a special behavior, if the service added is itself a supervisor, the supervisor being added will copy the Log function from the Supervisor it is being added to. This allows factoring out providing a Supervisor from its logging. This unconditionally overwrites the child Supervisor's logging functions.

func (*Supervisor) Remove

func (s *Supervisor) Remove(id ServiceToken) error

Remove will remove the given service from the Supervisor, and attempt to Stop() it. The ServiceID token comes from the Add() call. This returns without waiting for the service to stop.

func (*Supervisor) RemoveAndWait

func (s *Supervisor) RemoveAndWait(id ServiceToken, timeout time.Duration) error

RemoveAndWait will remove the given service from the Supervisor and attempt to Stop() it. It will wait up to the given timeout value for the service to terminate. A timeout value of 0 means to wait forever.

If a nil error is returned from this function, then the service was terminated normally. If either the supervisor terminates or the timeout passes, ErrTimeout is returned. (If this isn't even the right supervisor ErrWrongSupervisor is returned.)

func (*Supervisor) Serve

func (s *Supervisor) Serve()

Serve starts the supervisor. You should call this on the top-level supervisor, but nothing else.

func (*Supervisor) ServeBackground

func (s *Supervisor) ServeBackground()

ServeBackground starts running a supervisor in its own goroutine. When this method returns, the supervisor is guaranteed to be in a running state.

func (*Supervisor) Services

func (s *Supervisor) Services() []Service

Services returns a []Service containing a snapshot of the services this Supervisor is managing.

func (*Supervisor) Stop

func (s *Supervisor) Stop()

Stop stops the Supervisor.

This function will not return until either all Services have stopped, or they timeout after the timeout value given to the Supervisor at creation.

func (*Supervisor) StopWithReport

func (s *Supervisor) StopWithReport() UnstoppedServiceReport

StopWithReport will stop the supervisor like calling Stop, but will also return a struct reporting what services failed to stop. This fully encompasses calling Stop, so do not call Stop and StopWithReport any more than you should call Stop twice.

WARNING: Technically, any use of the returned data structure is a TOCTOU violation: https://en.wikipedia.org/wiki/Time-of-check_to_time-of-use Since the data structure was generated and returned to you, any of these services may have stopped since then.

However, this can still be useful information at program teardown time. For instance, logging that a service failed to stop as expected is still useful, as even if it shuts down later, it was still later than you expected.

But if you cast the Service objects back to their underlying objects and start trying to manipulate them ("shut down harder!"), be sure to account for the possibility they are in fact shut down before you get them.

If there are no services to report, the UnstoppedServiceReport will be nil. A zero-length constructed slice is never returned.

Calling this on an already-stopped supervisor is invalid, but will safely return nil anyhow.

func (*Supervisor) String

func (s *Supervisor) String() string

String implements the fmt.Stringer interface.

type UnstoppedService

type UnstoppedService struct {
	Service      Service
	Name         string
	ServiceToken ServiceToken
}

type UnstoppedServiceReport

type UnstoppedServiceReport []UnstoppedService

An UnstoppedServiceReport will be returned by StopWithReport, reporting which services failed to stop.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL