streamstats

package module
v0.0.0-...-15a9614 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 10, 2017 License: MIT Imports: 4 Imported by: 0

README

streamstats

GoDoc Build Status Coverage Status

Streaming stats data structures and algorithms in golang that are O(1) time and space in the number of elements processed.

Moment-based Statistics

Single variable moments up to fourth order and first-order covariance use the methods of:

"Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments." Philippe P. Pébay, Technical Report SAND2008-6212, Sandia National Laboratories, September 2008.

which extend the results of:

"Note on a method for calculating corrected sums of squares and products". B. P. Welford (1962). Technometrics 4(3):419–420 (popularized by Donald Knuth in "The Art of Computer Programming")

to arbitrary moments and combinations of arbitrary sized populations allowing parallel aggregation. These moments are also extended to two dependent variables with a covariance Sxy

This also an includes exponentially-weighted moving average with damping factor, 0 < lambda < 1, using update formula m = (1-lambda)*m + lambda*x

Order Statistics

Quantiles and Histograms are based on the P2-algorithm:

"The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations." Raj Jain and Imrich Chlamtac, Communications of the ACM Volume 28 Issue 10, October 1985 Pages 1076-1085

Count Distinct

Count distinct is provided by an implementation of the HyperLogLog data structure based on:

"Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm" Philippe Flajolet and Éric Fusy and Olivier Gandouet and et al. in AOFA ’07: PROCEEDINGS OF THE 2007 INTERNATIONAL CONFERENCE ON ANALYSIS OF ALGORITHMS

This implementation includes some of the HyperLogLog++ enhancements such as the 64-bit hash function which eliminates the large cardinality correction for hash collisions and an empirical bias correction for small cardinalities The implementation is space in-efficient since bits are used to store the counts which could be at most 60 < 2^6

An additional LinearCounting implementation that is backed by a BitVector is available as well. If the maximum possible cardinality is known, this structure uses only 12.5% of the memory as the HyperLogLog and runs much faster for both Add and Distinct. However, the data structure saturates at the maximum value while HyperLogLog can count to virtually unlimited cardinalities.

Set Membership

Approximate set membership is provided by a BloomFilter implementation based on:

"Space/time trade-offs in hash coding with allowable errors" Burton H. Bloom Communications of the ACM Volume 13 Issue 7, July 1970 Pages 422-426

the size m of the filter is rounded up to the nearest power of two for speed of addition and membership check, which could result in a larger filter depending on the cardinality and false positive target you supply. the k different hash functions are derived from top (h1) and bottom (h2) 32-bits of a 64-bit hash function using h[i] = h1 + i* h2 mod m for i in 0...m-1 based on

"Less hashing, same performance: Building a better Bloom filter" Adam Kirsch, Michael Mitzenmacher Random Structures & Algorithms Volume 33 Issue 2, September 2008 Pages 187-218

Benchmarks

Intel(R) Core(TM) i3-4010U CPU @ 1.70GHz
go version go1.7.3 linux/amd64
BenchmarkBloomFilterAdd-4              	20000000	        75.5 ns/op
BenchmarkBloomFilterCheck-4            	20000000	        70.0 ns/op
BenchmarkEWMAAdd-4                     	200000000	         8.28 ns/op
BenchmarkHyperLogLogP10Add-4           	30000000	        56.3 ns/op
BenchmarkHyperLogLogP10Distinct-4      	 1000000	      2178 ns/op
BenchmarkLinearCountingP10Add-4        	50000000	        36.1 ns/op
BenchmarkLinearCountingP10Distinct-4   	10000000	       175 ns/op
BenchmarkMomentStatsAdd-4              	100000000	        19.7 ns/op
BenchmarkP2Histogram8Add-4             	10000000	       165 ns/op
BenchmarkP2Histogram16Add-4            	 5000000	       349 ns/op
BenchmarkP2Histogram32Add-4            	 2000000	       702 ns/op
BenchmarkP2Histogram64Add-4            	 1000000	      1365 ns/op
BenchmarkP2Histogram128Add-4           	  500000	      2673 ns/op
BenchmarkP2QuantileAdd-4               	20000000	        66.4 ns/op

Documentation

Overview

Package streamstats includes data structures and algorithms that are O(1) time and space in the number of elements processed.

Moment-Based Statistics

Single variable moments up to fourth order and first-order covariance use the methods of:

"Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments." Philippe P. Pébay, Technical Report SAND2008-6212, Sandia National Laboratories, September 2008.

which extend the results of:

"Note on a method for calculating corrected sums of squares and products".

B. P. Welford (1962).
Technometrics 4(3):419–420

(popularized by Donald Knuth in "The Art of Computer Programming")

to arbitrary moments and combinations of arbitrary sized populations allowing parallel aggregation. These moments are also extended to two dependent variables with a covariance Sxy

This also an includes exponentially-weighted moving average with damping factor, 0 < *lambda* < 1, using update formula m = (1-lambda)*m + lambda*x

Order Statistics

Quantiles and Histograms are based on the P2-algorithm:

"The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations." Raj Jain and Imrich Chlamtac, Communications of the ACM Volume 28 Issue 10, October 1985 Pages 1076-1085

Count Distinct

Count distinct is provided by an implementation of the HyperLogLog data structure based on:

"Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm" Philippe Flajolet and Éric Fusy and Olivier Gandouet and et al. in AOFA ’07: PROCEEDINGS OF THE 2007 INTERNATIONAL CONFERENCE ON ANALYSIS OF ALGORITHMS

This implementation includes some of the HyperLogLog++ enhancements such as the 64-bit hash function which eliminates the large cardinality correction for hash collisions and an empirical bias correction for small cardinalities The implementation is space in-efficient since bits are used to store the counts which could be at most 60 < 2^6

An additional LinearCounting implementation that is backed by a BitVector is available as well. If the maximum possible cardinality is known, this structure uses only 12.5% of the memory as the HyperLogLog and runs much faster for both `Add` and `Distinct`. However, the data structure saturates at the maximum value while HyperLogLog can count to virtually unlimited cardinalities.

Set Membership

Approximate set membership is provided by a BloomFilter implementation based on:

"Space/time trade-offs in hash coding with allowable errors" Burton H. Bloom Communications of the ACM Volume 13 Issue 7, July 1970 Pages 422-426

the size m of the filter is rounded up to the nearest power of two for speed of addition and membership check, which could result in a larger filter depending on the cardinality and false positive target you supply. the k different hash functions are derived from top (`h1`) and bottom (`h2`) 32-bits of a 64-bit hash function using h[i] = h1 + i* h2 mod m for i in 0...m-1 based on

"Less hashing, same performance: Building a better Bloom filter" Adam Kirsch, Michael Mitzenmacher Random Structures & Algorithms Volume 33 Issue 2, September 2008 Pages 187-218

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type BitVector

type BitVector []uint64

BitVector represents an arbitrary length vector of bits backed by 64-bit words it is used as the data structure backing the Bloom Filter and Linear Counting implementations

func NewBitVector

func NewBitVector(L uint64) BitVector

NewBitVector returns a new BitVector of length L

func (BitVector) Clear

func (b BitVector) Clear(N uint64)

Clear clears the bit at position N for N >= L this will access memory out of bounds of the backing array and panic

func (BitVector) Get

func (b BitVector) Get(N uint64) uint64

Get returns the bit at position N as a uint64 for N >= L this will access memory out of bounds of the backing array and panic

func (BitVector) PopCount

func (b BitVector) PopCount() uint64

PopCount returns the nubmer of set bits in the bit vector the algorithm for PopCount on a single 64-bit word is from 1957 due to Donald B. Gillies and Jeffrey C. P. Miller and referenced by Donald Knuth

func (BitVector) Set

func (b BitVector) Set(N uint64)

Set sets the bit at position N for N >= L this will access memory out of bounds of the backing array and panic

func (BitVector) String

func (b BitVector) String() string

String outputs a string representation of the binary string with the first bit at the left note that any padding zeros are present on the right hand side

type BloomFilter

type BloomFilter struct {
	// contains filtered or unexported fields
}

BloomFilter is a datastructure for approximate set membership with no false negatives and limited false positives

func NewBloomFilter

func NewBloomFilter(Nitems uint64, FalsePositiveRate float64, hash hash.Hash64) *BloomFilter

NewBloomFilter returns a pointer to a new BloomFilter that has been sized in m with k hash functions to target the given false positive rate at the given number of items using the given hash function

func (*BloomFilter) Add

func (bf *BloomFilter) Add(item []byte)

Add puts an item in the set represented by the BloomFilter

func (BloomFilter) Check

func (bf BloomFilter) Check(item []byte) bool

Check returns false if an item in is definitely not in the set represented by the BloomFilter

func (BloomFilter) Distinct

func (bf BloomFilter) Distinct() uint64

Distinct estimates the number of elements in the filter by using the LinearCounting estimate accounting for the k hash functions calculated for each element

func (BloomFilter) FalsePositiveRate

func (bf BloomFilter) FalsePositiveRate() float64

FalsePositiveRate returns the expected false positive rate based on the ratio of filled buckets in the BloomFilter

func (BloomFilter) Intersect

func (bf BloomFilter) Intersect(bfB *BloomFilter) (*BloomFilter, error)

Intersect combines two BloomFilters producing one that contains only of the elements in both BloomFilters the BloomFilters must be the same size m and k as well as use the same hash function

func (BloomFilter) Occupancy

func (bf BloomFilter) Occupancy() float64

Occupancy returns the ratio of filled buckets in the BloomFilter

func (BloomFilter) Union

func (bf BloomFilter) Union(bfB *BloomFilter) (*BloomFilter, error)

Union combines two BloomFilters producing one that contains all of the elements in either BloomFilter the BloomFilters must be the same size m and k as well as use the same hash function

type BoxPlot

type BoxPlot struct {
	P2Quantile
}

BoxPlot represents a BoxPlot with interquartile range and whiskers backed by a P2-Quantile tracking the median, P=0.5

func NewBoxPlot

func NewBoxPlot() BoxPlot

NewBoxPlot returns a new BoxPlot

func (BoxPlot) InterQuartileRange

func (bp BoxPlot) InterQuartileRange() float64

InterQuartileRange returns the estimated interquartile range

func (BoxPlot) IsOutlier

func (bp BoxPlot) IsOutlier(x float64) bool

IsOutlier returns true if the data is outside the whiskers

func (BoxPlot) LowerQuartile

func (bp BoxPlot) LowerQuartile() float64

LowerQuartile returns the estimated lower quartile

func (BoxPlot) LowerWhisker

func (bp BoxPlot) LowerWhisker() float64

LowerWhisker returns the estimated lower whisker, Q1 - 1.5 * IQR

func (BoxPlot) Median

func (bp BoxPlot) Median() float64

Median returns the estimated median

func (BoxPlot) MidHinge

func (bp BoxPlot) MidHinge() float64

MidHinge returns the MidHinge of the data, average of upper and lower quantiles

func (BoxPlot) MidRange

func (bp BoxPlot) MidRange() float64

MidRange returns the MidRange of the data, average of max and min

func (BoxPlot) String

func (bp BoxPlot) String() string

func (BoxPlot) TriMean

func (bp BoxPlot) TriMean() float64

TriMean returns the TriMean of the data, average of Median and MidHinge

func (BoxPlot) UpperQuartile

func (bp BoxPlot) UpperQuartile() float64

UpperQuartile returns the estimated upper quartile

func (BoxPlot) UpperWhisker

func (bp BoxPlot) UpperWhisker() float64

UpperWhisker returns the estimated upper whisker, Q3 + 1.5 * IQR

type CovarStats

type CovarStats struct {
	// contains filtered or unexported fields
}

CovarStats is a data structure for computing stats on two related variables x,y from a stream

func NewCovarStats

func NewCovarStats() *CovarStats

NewCovarStats returns an empty CovarStats structure with no values

func (*CovarStats) Add

func (c *CovarStats) Add(x, y float64)

Add adds a sample of the two variables to the CovarStats data structure

func (*CovarStats) Combine

func (c *CovarStats) Combine(b *CovarStats) CovarStats

Combine returns the combination of two CovarStats datastructures

func (*CovarStats) Correlation

func (c *CovarStats) Correlation() float64

Correlation returns the Pearson product-moment correlation coefficient of the x and y samples seen so far

func (*CovarStats) Intercept

func (c *CovarStats) Intercept() float64

Intercept returns the intercept of the correlation between x and y samples seen so far

func (*CovarStats) N

func (c *CovarStats) N() uint64

N returns the number of samples seen so far

func (*CovarStats) Slope

func (c *CovarStats) Slope() float64

Slope returns the slope of the correlation between x and y samples seen so far

func (*CovarStats) XKurtosis

func (c *CovarStats) XKurtosis() float64

XKurtosis returns the kurtorsis of the x values seen so far

func (*CovarStats) XMean

func (c *CovarStats) XMean() float64

XMean returns the mean of the x values seen so far

func (*CovarStats) XSkewness

func (c *CovarStats) XSkewness() float64

XSkewness returns the skewness of the x values seen so far

func (*CovarStats) XStdDev

func (c *CovarStats) XStdDev() float64

XStdDev returns the standard deviation of the x values seen so far

func (*CovarStats) XVariance

func (c *CovarStats) XVariance() float64

XVariance returns the variance of the x values seen so far

func (*CovarStats) YKurtosis

func (c *CovarStats) YKurtosis() float64

YKurtosis returns the kurtosis of the y values seen so far

func (*CovarStats) YMean

func (c *CovarStats) YMean() float64

YMean returns the mean of the y values seen so far

func (*CovarStats) YSkewness

func (c *CovarStats) YSkewness() float64

YSkewness returns the skewness of the y values seen so far

func (*CovarStats) YStdDev

func (c *CovarStats) YStdDev() float64

YStdDev returns the standard deviation of the y values seen so far

func (*CovarStats) YVariance

func (c *CovarStats) YVariance() float64

YVariance returns the variance of the y values seen so far

type CumulativeDensity

type CumulativeDensity struct {
	X float64
	P float64
}

CumulativeDensity represents the probability P of observing a value less than or equal to X

type EWMA

type EWMA struct {
	// contains filtered or unexported fields
}

EWMA data structure for exponentially weighted moving average

func NewEWMA

func NewEWMA(initialValue float64, lambda float64) EWMA

NewEWMA initializes an EWMA with weighting lambda and given initial value

func (*EWMA) Add

func (e *EWMA) Add(x float64)

Add updates the average value with the stored weight

func (*EWMA) Mean

func (e *EWMA) Mean() float64

Mean returns the exponentially weighted average value

type HyperLogLog

type HyperLogLog struct {
	// contains filtered or unexported fields
}

HyperLogLog a data structure for computing count distinct on arbitrary sized data

func NewHyperLogLog

func NewHyperLogLog(p byte, hash hash.Hash64) *HyperLogLog

NewHyperLogLog returns a new HyperLogLog data structure with 2^p buckets based on Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm Philippe Flajolet and Éric Fusy and Olivier Gandouet and et al. IN AOFA ’07: PROCEEDINGS OF THE 2007 INTERNATIONAL CONFERENCE ON ANALYSIS OF ALGORITHMS This implementation does not include any of the HyperLogLog++ enhancments except for the 64-bit hash function which eliminates the large cardinality correction for hash collisions this is also space in-efficient since bytes are used to store the counts which could be at most 60 < 2^6

func (*HyperLogLog) Add

func (hll *HyperLogLog) Add(item []byte)

Add adds an item to the multiset represented by the HyperLogLog

func (*HyperLogLog) BiasCorrected

func (hll *HyperLogLog) BiasCorrected() uint64

BiasCorrected returns the bias corrected estimated number of distinct items in the multiset

func (*HyperLogLog) Compress

func (hll *HyperLogLog) Compress(factor byte) *HyperLogLog

Compress produces a new HyperLogLog with reduced size by 2^factor with reduced precision if new p < minimumHyperLogLogP, p=minimumHyperLogLogP , if factor=0 it just produces a copy

func (*HyperLogLog) Distinct

func (hll *HyperLogLog) Distinct() uint64

Distinct returns the estimated number of distinct items in the multiset

func (*HyperLogLog) ExpectedError

func (hll *HyperLogLog) ExpectedError() float64

ExpectedError returns the estimated error in the number of distinct items in the multiset

func (*HyperLogLog) Intersect

func (hll *HyperLogLog) Intersect(hllB *HyperLogLog) (*HyperLogLog, error)

Intersect the estimate of two HyperLogLog reducing the precision to the minimum of the two sets the function will return nil and an error if the hash functions mismatch Intersect will always overestimate the size of the intersection

func (*HyperLogLog) LinearCounting

func (hll *HyperLogLog) LinearCounting() uint64

LinearCounting returns the linear counting estimated number of distinct items in the multiset

func (*HyperLogLog) RawEstimate

func (hll *HyperLogLog) RawEstimate() uint64

RawEstimate returns the raw estimated number of distinct items in the multiset

func (*HyperLogLog) Reset

func (hll *HyperLogLog) Reset()

Reset zeros out the estimated number of distinct items in the multiset

func (*HyperLogLog) String

func (hll *HyperLogLog) String() string

func (*HyperLogLog) Union

func (hll *HyperLogLog) Union(hllB *HyperLogLog) (*HyperLogLog, error)

Union the estimate of two HyperLogLog reducing the precision to the minimum of the two sets the function will return nil and an error if the hash functions mismatch

type LinearCounting

type LinearCounting struct {
	// contains filtered or unexported fields
}

LinearCounting is a space efficient data structure for count distinct with hard upper bound

func NewLinearCounting

func NewLinearCounting(p byte, hash hash.Hash64) *LinearCounting

NewLinearCounting initializes a LinearCounting structure with size m=2^p and the given hash function

func (*LinearCounting) Add

func (lc *LinearCounting) Add(item []byte)

Add adds an item to the multiset represented by the LinearCounting structure

func (*LinearCounting) Compress

func (lc *LinearCounting) Compress(factor byte) *LinearCounting

Compress produces a new LinearCouting with reduced size by 2^factor with reduced precision if new p < minLinearCountingP, p=minLinearCountingP , if factor=0 it just produces a copy

func (LinearCounting) Distinct

func (lc LinearCounting) Distinct() uint64

Distinct returns the estimate of the number of distinct elements seen if the backing BitVector is full it returns m, the size of the BitVector

func (LinearCounting) ExpectedError

func (lc LinearCounting) ExpectedError() float64

ExpectedError returns the expected error at the current filling in the LinearCounting

func (*LinearCounting) Intersect

func (lc *LinearCounting) Intersect(lcB *LinearCounting) (*LinearCounting, error)

Intersect the estimate of two LinearCounting reducing the precision to the minimum of the two sets the function will return nil and an error if the hash functions mismatch

func (LinearCounting) Occupancy

func (lc LinearCounting) Occupancy() float64

Occupancy returns the ratio of filled buckets in the LinearCounting bitvector

func (LinearCounting) String

func (lc LinearCounting) String() string

func (*LinearCounting) Union

func (lc *LinearCounting) Union(lcB *LinearCounting) (*LinearCounting, error)

Union the estimate of two LinearCounting reducing the precision to the minimum of the two sets the function will return nil and an error if the hash functions mismatch

type MomentStats

type MomentStats struct {
	// contains filtered or unexported fields
}

MomentStats is a datastructure for computing the first four moments of a stream

func NewMomentStats

func NewMomentStats() *MomentStats

NewMomentStats returns an empty MomentStats structure with no values

func (*MomentStats) Add

func (m *MomentStats) Add(x float64)

Add updates the moment stats

func (*MomentStats) Combine

func (m *MomentStats) Combine(b *MomentStats) MomentStats

Combine combines the stats from two MomentStats structures

func (*MomentStats) Kurtosis

func (m *MomentStats) Kurtosis() float64

Kurtosis returns the excess kurtosis of the samples seen so far

func (*MomentStats) Mean

func (m *MomentStats) Mean() float64

Mean returns the mean of the observations seen so far

func (*MomentStats) N

func (m *MomentStats) N() uint64

N returns the observations stored so far

func (*MomentStats) Skewness

func (m *MomentStats) Skewness() float64

Skewness returns the skewness of the samples seen so far

func (*MomentStats) StdDev

func (m *MomentStats) StdDev() float64

StdDev returns the standard deviation of the samples seen so far

func (*MomentStats) String

func (m *MomentStats) String() string

String returns the standard string representation of the samples seen so far

func (*MomentStats) Variance

func (m *MomentStats) Variance() float64

Variance returns the variance of the observations seen so far

type P2Histogram

type P2Histogram struct {
	// contains filtered or unexported fields
}

P2Histogram is an O(1) time and space data structure for estimating the evenly spaced histogram bins of a series of N data points based on the "The P2 Algorithm for Dynamic Computing Calculation of Quantiles and Histograms Without Storing Observations" by RAJ JAIN and IIMRICH CHLAMTAC Communications of the ACM Volume 28 Issue 10, Oct. 1985 Pages 1076-1085

func NewP2Histogram

func NewP2Histogram(b uint64) P2Histogram

NewP2Histogram intializes the data structure to track b bins

func (*P2Histogram) Add

func (h *P2Histogram) Add(x float64)

Add updates the data structure with a given x value

func (*P2Histogram) CDF

func (h *P2Histogram) CDF(x float64) float64

CDF returns the linear approximation to the CDF at x based on the histogram data

func (*P2Histogram) Histogram

func (h *P2Histogram) Histogram() []CumulativeDensity

Histogram returns the histogram of observations seen so far

func (*P2Histogram) Max

func (h *P2Histogram) Max() float64

Max returns the maximum of observations seen so far

func (*P2Histogram) Min

func (h *P2Histogram) Min() float64

Min returns the minimum of observations seen so far

func (*P2Histogram) N

func (h *P2Histogram) N() uint64

N returns the number of observations seen so far

func (*P2Histogram) Quantile

func (h *P2Histogram) Quantile(p float64) float64

Quantile returns the linear approximation to the given quantile based on the histogram data

type P2Quantile

type P2Quantile struct {
	// contains filtered or unexported fields
}

P2Quantile is an O(1) time and space data structure for estimating the p-quantile of a series of N data points based on the "The P2 Algorithm for Dynamic Computing Calculation of Quantiles and Histograms Without Storing Observations" by RAJ JAIN and IIMRICH CHLAMTAC Communications of the ACM Volume 28 Issue 10, Oct. 1985 Pages 1076-1085

func NewP2Quantile

func NewP2Quantile(p float64) P2Quantile

NewP2Quantile intializes the data structure to track the p-quantile

func (*P2Quantile) Add

func (p *P2Quantile) Add(x float64)

Add updates the data structure with a given x value

func (*P2Quantile) LowerQuantile

func (p *P2Quantile) LowerQuantile() float64

LowerQuantile returns the estimate for the lower quantile, p/2

func (*P2Quantile) Max

func (p *P2Quantile) Max() float64

Max returns the exact maximum value seen so far

func (*P2Quantile) Min

func (p *P2Quantile) Min() float64

Min returns the exact minimum value seen so far

func (*P2Quantile) N

func (p *P2Quantile) N() uint64

N returns the number of observations seen so far

func (*P2Quantile) P

func (p *P2Quantile) P() float64

P returns the quantile being tracked

func (*P2Quantile) Quantile

func (p *P2Quantile) Quantile() float64

Quantile returns the estimated value for the p-quantile

func (*P2Quantile) UpperQuantile

func (p *P2Quantile) UpperQuantile() float64

UpperQuantile returns the estimate for the upper quantile, (1+p/2)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL