tdigest

package module
v2.1.0+incompatible Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 8, 2017 License: MIT Imports: 6 Imported by: 9

README

tdigest

GoDoc Build Status

This is a Go implementation of Ted Dunning's t-digest, which is a clever data structure/algorithm for computing approximate quantiles of a stream of data.

You should use this if you want to efficiently compute extreme rank statistics of a large stream of data, like the 99.9th percentile.

Usage

An example is available in the Godoc which shows the API:

func ExampleTDigest() {
	rand.Seed(5678)
	values := make(chan float64)

	// Generate 100k uniform random data between 0 and 100
	var (
		n        int     = 100000
		min, max float64 = 0, 100
	)
	go func() {
		for i := 0; i < n; i++ {
			values <- min + rand.Float64()*(max-min)
		}
		close(values)
	}()

	// Pass the values through a TDigest.
	td := New()

	for val := range values {
		// Add the value with weight 1
		td.Add(val, 1)
	}

	// Print the 50th, 90th, 99th, 99.9th, and 99.99th percentiles
	fmt.Printf("50th: %.5f\n", td.Quantile(0.5))
	fmt.Printf("90th: %.5f\n", td.Quantile(0.9))
	fmt.Printf("99th: %.5f\n", td.Quantile(0.99))
	fmt.Printf("99.9th: %.5f\n", td.Quantile(0.999))
	fmt.Printf("99.99th: %.5f\n", td.Quantile(0.9999))
	// Output:
	// 50th: 48.74854
	// 90th: 89.79825
	// 99th: 98.92954
	// 99.9th: 99.90189
	// 99.99th: 99.98740
}

Algorithm

For example, in the Real World, the stream of data might be service timings, measuring how long a server takes to respond to clients. You can feed this stream of data through a t-digest and get out approximations of any quantile you like: the 50th percentile or 95th percentile or 99th or 99.99th or 28.31th are all computable.

Exact quantiles would require that you hold all the data in memory, but the t-digest can hold a small fraction - often just a few kilobytes to represent many millions of datapoints. Measurements of the compression ratio show that compression improves super-linearly as more datapoints are fed into the t-digest.

How good are the approximations? Well, it depends, but they tend to be quite good, especially out towards extreme percentiles like the 99th or 99.9th; Ted Dunning found errors of just a few parts per million at the 99.9th and 0.1th percentiles.

Error will be largest in the middle - the median is the least accurate point in the t-digest.

The actual precision can be controlled with the compression parameter passed to the constructor function NewWithCompression in this package. Lower compression parameters will result in poorer compression, but will improve performance in estimating quantiles. If you care deeply about tuning such things, experiment with the compression ratio.

Benchmarks

Data compresses well, with compression ratios of around 20 for small datasets (1k datapoints) and 500 for largeish ones (1M datapoints). The precise compression ratio depends a bit on your data's distribution - exponential data does well, while ordered data does poorly:

compression benchmark

In general, adding a datapoint takes about 1 to 4 microseconds on my 2014 Macbook Pro. This is fast enough for many purposes, but if you have any concern, you should just run the benchmarks on your targeted syste. You can do that with go test -bench . ./....

Quantiles are very, very quick to calculate, and typically take tens of nanoseconds. They might take up to a few hundred nanoseconds for large, poorly compressed (read: ordered) datasets, but in general, you don't have to worry about the speed of calls to Quantile.

Documentation

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type TDigest

type TDigest struct {
	// contains filtered or unexported fields
}

A TDigest is an efficient data structure for computing streaming approximate quantiles of a dataset.

Example
rand.Seed(5678)
values := make(chan float64)

// Generate 100k uniform random data between 0 and 100
var (
	n        int     = 100000
	min, max float64 = 0, 100
)
go func() {
	for i := 0; i < n; i++ {
		values <- min + rand.Float64()*(max-min)
	}
	close(values)
}()

// Pass the values through a TDigest, compression parameter 100
td := New()

for val := range values {
	// Add the value with weight 1
	td.Add(val, 1)
}

// Print the 50th, 90th, 99th, 99.9th, and 99.99th percentiles
fmt.Printf("50th: %.5f\n", td.Quantile(0.5))
fmt.Printf("90th: %.5f\n", td.Quantile(0.9))
fmt.Printf("99th: %.5f\n", td.Quantile(0.99))
fmt.Printf("99.9th: %.5f\n", td.Quantile(0.999))
fmt.Printf("99.99th: %.5f\n", td.Quantile(0.9999))
Output:

func New

func New() *TDigest

New produces a new TDigest using the default compression level of 100.

func NewWithCompression

func NewWithCompression(compression float64) *TDigest

NewWithCompression produces a new TDigest with a specific compression level. The input compression value, which should be >= 1.0, will control how aggressively the TDigest compresses data together.

The original TDigest paper suggests using a value of 100 for a good balance between precision and efficiency. It will land at very small (think like 1e-6 percentile points) errors at extreme points in the distribution, and compression ratios of around 500 for large data sets (1 millionish datapoints).

func (*TDigest) Add

func (d *TDigest) Add(val float64, weight int)

Add will add a value to the TDigest, updating all quantiles. A weight can be specified; use weight of 1 if you don't care about weighting your dataset.

Add will ignore input values of NaN or Inf.

func (*TDigest) MarshalBinary

func (d *TDigest) MarshalBinary() ([]byte, error)

MarshalBinary serializes d as a sequence of bytes, suitable to be deserialized later with UnmarshalBinary.

func (*TDigest) MergeInto

func (d *TDigest) MergeInto(other *TDigest)

MergeInto(other) will add all of the data within a TDigest into other, combining them into one larger TDigest.

func (*TDigest) Quantile

func (d *TDigest) Quantile(q float64) float64

Quantile(q) will estimate the qth quantile value of the dataset. The input value of q should be in the range [0.0, 1.0]; if it is outside that range, it will be clipped into it automatically.

Calling Quantile on a TDigest with no data will return NaN.

func (*TDigest) UnmarshalBinary

func (d *TDigest) UnmarshalBinary(p []byte) error

UnmarshalBinary populates d with the parsed contents of p, which should have been created with a call to MarshalBinary.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL