stats

package
v0.0.0-...-f32f910 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 2, 2022 License: MIT Imports: 3 Imported by: 4

README

Stats

Online computation of descriptive statistics

This package provides a struct for computing online descriptive statistics without saving all values in an array or to disk. Install the package:

$ go get github.com/bbengfort/x/stats

Usage as follows:

stats := new(stats.Statistics)

for i := 0; i < 1000; i++ {
    stats.Update(rand.Float64())
}

mu := stats.Mean()
sigma := stats.StdDev()

Basically, as samples come in, you can pass them to the Update method collecting summary statistics as you go. You can then dump the code out into a JSON dictionary as follows:

data := stats.Serialize()

NOTE: The Statistics object is thread-safe by virtue of a sync.RWMutex that locks and unlocks the data structure on every call.

Bulk Loading

It is possible to bulk-load the statistics object by passing multiple float64 values using variadic arguments:

stats.Update(1.2, 3.1, 4.2, 1.2)

Or by passing in an array of float64 values:

var data []float64
stats.Update(data...)

This is much faster than loading values individually in a for loop as demonstrated by the following benchmarks:

BenchmarkStatistics_Update-8       	20000000	        79.2 ns/op
BenchmarkStatistics_Sequential-8   	      30	  56017960 ns/op
BenchmarkStatistics_BulkLoad-8     	     500	   2514950 ns/op

The first benchmark, BenchmarkStatistics_Update-8, is the time it takes to update a single value into the statistics. The second benchmark, BenchmarkStatistics_Sequential-8 uses a for-loop to Update one value at a time from 1,000,000 values. The third benchmark, BenchmarkStatistics_BulkLoad-8, simply passes the entire array of 1M values directly to the function and as a result is 22x faster.

Blocking vs. Non-Blocking

I received a surprising result when I tried to implement a non-blocking version of the Statistics struct by using a buffered-channel. A write-up of that can be found here: Online Distribution.

The benchmarks are as follows:

BenchmarkBlocking-8      	20000000            81.1 ns/op
BenchmarkNonBlocking-8   	10000000	       140 ns/op

As such, the current implementation simply uses thread-safe locks rather than a channel.

Benchmarks

This package also includes a specialized data structure for computing online statistical distribution of time.Duration objects called the Benchmark. Similar to the Statistics object you can update it, but with time.Duration objects, which are then converted into float64 seconds values using the time.Duration.Seconds() method. This reduces the granularity from int64 nanoseconds, but should still be good at about the microsecond granularity.

The reason for the conversion is because computation with int64 quickly overflows especially when computing the sum of squares. By converting to a float, the domain of the online distribution is similar to the domain of the Statistics object.

Documentation

Overview

Package stats implements an online computation of summary statistics.

The primary idea of this package is that samples are coming in at real time, and online computations of the shape of the distribution: the mean, variance, and range need to be computed on-demand. Rather than keeping an array of values, online algorithms update the internal state of the descriptive statistics at runtime, saving memory.

To track statistics in an online fashion, you need to keep track of the various aggregates that are used to compute the final descriptives statistics of the distribution. For simple statistics such as the minimum, maximum, standard deviation, and mean you need to track the number of samples, the sum of samples, and the sum of the squares of all samples (along with the minimum and maximum value seen).

The primary entry point into this function is the Update method, where you can pass sample values and retrieve data back. All other methods are simply computations for values.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Benchmark

type Benchmark struct {
	sync.RWMutex
	Statistics
	// contains filtered or unexported fields
}

Benchmark keeps track of a distrubtion of durations, e.g. to benchmark the performance or timing of an operation. It returns descriptive statistics as durations so that they can be read as timings. Benchmark works in an online fashion similar to the Statistics object, but works on time.Duration samples instead of floats. Instead of minimum and maximum values it returns the fastest and slowest times.

The primary entry point to the object is via the Update method, where one or more time.Durations can be passed. This object has unexported fields because it is thread-safe (via a sync.RWMutex). All properties must be accesesd from read-locked access methods.

Example
stats := new(Benchmark)
samples, _ := loadBenchData()

for _, sample := range samples {
	stats.Update(sample)
}

data, _ := json.MarshalIndent(stats.Serialize(), "", "  ")
fmt.Println(string(data))
Output:

{
  "duration": "0s",
  "fastest": "41.219436ms",
  "mean": "120.993689ms",
  "range": "167.175236ms",
  "samples": 1000000,
  "slowest": "208.394672ms",
  "stddev": "17.283562ms",
  "throughput": 8.264893850648656,
  "timeouts": 0,
  "total": "33h36m33.689461785s",
  "variance": "298.721µs"
}

func (*Benchmark) Append

func (s *Benchmark) Append(o *Benchmark)

Append another benchmark object to the current benchmark object, incrementing the distribution from the other object.

func (*Benchmark) Fastest

func (s *Benchmark) Fastest() time.Duration

Fastest returns the minimum value of durations seen. If no durations have been added to the dataset, then this function returns a zero duration.

func (*Benchmark) Mean

func (s *Benchmark) Mean() time.Duration

Mean returns the average for all durations expressed as float64 seconds and returns a time.Duration which is expressed in int64 nanoseconds. This can mean some loss in precision of the mean value, but also allows the caller to compute the mean in varying timescales. Since microseconds is a pretty fine granularity for timings, truncating the floating point of the nanosecond seems acceptable.

If no durations have been recorded, a zero valued duration is returned.

func (*Benchmark) Range

func (s *Benchmark) Range() time.Duration

Range returns the difference between the slowest and fastest durations. If no samples have been added to the dataset, this function returns a zero duration. It will also return zero if the fastest and slowest durations are equal. E.g. in the case only one duration has been recorded or such that all durations have the same value.

func (*Benchmark) Serialize

func (s *Benchmark) Serialize() map[string]interface{}

Serialize returns a map of summary statistics. This map is useful for dumping statistics to disk (using JSON for example) or for reporting the statistics elsewhere. The values in the maps are string representations of the time.Duration objects, which are reported in a human readable form. They can be converted back to durations with time.ParseDuration.

TODO: Create Dump and Load functions to get statistical data to and from offline sources.

func (*Benchmark) SetDuration

func (s *Benchmark) SetDuration(duration time.Duration)

SetDuration allows an external setting of the duration. This is especially useful in the case where multiple threads are updating the benchmark and the internal measurement of total time might double count concurrent accesses. In fact it is strongly recommended that this method is called from the external measurerer after all updating is complete.

func (*Benchmark) Slowest

func (s *Benchmark) Slowest() time.Duration

Slowest returns the maximum value of durations seen. If no durations have been added to the dataset, then this function returns a zero duration.

func (*Benchmark) StdDev

func (s *Benchmark) StdDev() time.Duration

StdDev returns the standard deviation of samples, the square root of the variance. This function returns a time.Duration which represents a loss in precision from int64 nanoseconds to float64 seconds.

If no more than 1 durations were recorded, returns a zero valued duration.

func (*Benchmark) Throughput

func (s *Benchmark) Throughput() float64

Throughput returns the number of samples per second, measured as the inverse mean: number of samples divided by the total duration in seconds. The duration is computed in two ways:

  • if SetDuration is called, that duration is used
  • otherwise, the total number of observed seconds is used

This metric does not express a duration, so a float64 value is returned instead. If the duration or number of accesses is zero, 0.0 is returned.

func (*Benchmark) Timeouts

func (s *Benchmark) Timeouts() uint64

Timeouts returns the number of timeouts recorded across all samples.

func (*Benchmark) Total

func (s *Benchmark) Total() time.Duration

Total returns the total duration recorded across all samples.

func (*Benchmark) Update

func (s *Benchmark) Update(durations ...time.Duration)

Update the benchmark with a duration or durations (thread-safe). If a duration of 0 is passed, then it is interpreted as a timeout -- e.g. a maximal duration bound had been reached. Timeouts are recorded in a separate counter and can be used to express failure measures.

func (*Benchmark) Variance

func (s *Benchmark) Variance() time.Duration

Variance computes the variability of samples and describes the distance of the distribution from the mean. This function returns a time.Duration, which can mean a loss in precision lower than the microsecond level. This is usually acceptable for most applications.

If no more than 1 durations were recorded, returns a zero valued duration.

type Statistics

type Statistics struct {
	sync.RWMutex
	// contains filtered or unexported fields
}

Statistics keeps track of descriptive statistics in an online fashion at runtime without saving each individual sample in an array. It does this by updating the internal state of summary aggregates including the number of samples seen, the sum of values, and the sum of the value squared. It also tracks the minimum and maximum values seen.

The primary entry point to the object is via the Update method, where one or more samples can be passed. This object has unexported fields because it is thread-safe (via a sync.RWMutex). All properties must be accesesd from read-locked access methods.

Example
stats := new(Statistics)
samples, _ := loadTestData()

for _, sample := range samples {
	stats.Update(sample)
}

data, _ := json.MarshalIndent(stats.Serialize(), "", "  ")
fmt.Println(string(data))
Output:

{
  "maximum": 5.30507026071,
  "mean": 0.00041124313405184064,
  "minimum": -4.72206033824,
  "range": 10.02713059895,
  "samples": 1000000,
  "stddev": 0.9988808397330513,
  "total": 411.2431340518406,
  "variance": 0.9977629319858057
}

func (*Statistics) Append

func (s *Statistics) Append(o *Statistics)

Append another statistics object to the current statistics object, incrementing the distribution from the other object.

func (*Statistics) Maximum

func (s *Statistics) Maximum() float64

Maximum returns the maximum value of samples seen. If no samples have been added to the dataset, then this function returns 0.0.

func (*Statistics) Mean

func (s *Statistics) Mean() float64

Mean returns the average for all samples, computed as the sum of values divided by the total number of samples seen. If no samples have been added then this function returns 0.0. Note that 0.0 is a valid mean and does not necessarily mean that no samples have been tracked.

func (*Statistics) Minimum

func (s *Statistics) Minimum() float64

Minimum returns the minimum value of samples seen. If no samples have been added to the dataset, then this function returns 0.0.

func (*Statistics) N

func (s *Statistics) N() uint64

N returns the number of samples observed.

func (*Statistics) Range

func (s *Statistics) Range() float64

Range returns the difference between the maximum and minimum of samples. If no samples have been added to the dataset, this function returns 0.0. This function will also return zero if the maximum value equals the minimum value, e.g. in the case only one sample has been added or all of the samples are the same value.

func (*Statistics) Serialize

func (s *Statistics) Serialize() map[string]float64

Serialize returns a map of summary statistics. This map is useful for dumping statistics to disk (using JSON for example) or for reporting the statistics elsewhere.

TODO: Create Dump and Load functions to get statistical data to and from offline sources.

func (*Statistics) StdDev

func (s *Statistics) StdDev() float64

StdDev returns the standard deviation of samples, the square root of the variance. Two or more values are required to comput the standard deviation if one or none samples have been added to the data then this function returns 0.0.

func (*Statistics) Total

func (s *Statistics) Total() float64

Total returns the sum of the samples.

func (*Statistics) Update

func (s *Statistics) Update(samples ...float64)

Update the statistics with a sample or samples (thread-safe). Note that this object expects float64 values. While statistical computations for integer values are possible, it is simpler to simply transform the values into floats ahead of time.

func (*Statistics) Variance

func (s *Statistics) Variance() float64

Variance computes the variability of samples and describes the distance of the distribution from the mean. If one or none samples have been added to the data set then this function returns 0.0 (two or more values are required to compute variance).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL