perf: golang.org/x/perf/internal/stats Index | Files

package stats

import "golang.org/x/perf/internal/stats"

Package stats implements several statistical distributions, hypothesis tests, and functions for descriptive statistics.

Currently stats is fairly small, but for what it does implement, it focuses on high quality, fast implementations with good, idiomatic Go APIs.

This is a trimmed fork of github.com/aclements/go-moremath/stats.

Index

Package Files

alg.go beta.go deltadist.go dist.go mathx.go normaldist.go package.go sample.go tdist.go ttest.go udist.go utest.go

Variables

var (
    ErrSampleSize        = errors.New("sample is too small")
    ErrZeroVariance      = errors.New("sample has zero variance")
    ErrMismatchedSamples = errors.New("samples have different lengths")
)
var (
    ErrSamplesEqual = errors.New("all samples are equal")
)
var MannWhitneyExactLimit = 50

MannWhitneyExactLimit gives the largest sample size for which the exact U distribution will be used for the Mann-Whitney U-test.

Using the exact distribution is necessary for small sample sizes because the distribution is highly irregular. However, computing the distribution for large sample sizes is both computationally expensive and unnecessary because it quickly approaches a normal approximation. Computing the distribution for two 50 value samples takes a few milliseconds on a 2014 laptop.

var MannWhitneyTiesExactLimit = 25

MannWhitneyTiesExactLimit gives the largest sample size for which the exact U distribution will be used for the Mann-Whitney U-test in the presence of ties.

Computing this distribution is more expensive than computing the distribution without ties, so this is set lower. Computing this distribution for two 25 value samples takes about ten milliseconds on a 2014 laptop.

var StdNormal = NormalDist{0, 1}

StdNormal is the standard normal distribution (Mu = 0, Sigma = 1)

func Bounds Uses

func Bounds(xs []float64) (min float64, max float64)

Bounds returns the minimum and maximum values of xs.

func GeoMean Uses

func GeoMean(xs []float64) float64

GeoMean returns the geometric mean of xs. xs must be positive.

func InvCDF Uses

func InvCDF(dist DistCommon) func(y float64) (x float64)

InvCDF returns the inverse CDF function of the given distribution (also known as the quantile function or the percent point function). This is a function f such that f(dist.CDF(x)) == x. If dist.CDF is only weakly monotonic (that it, there are intervals over which it is constant) and y > 0, f returns the smallest x that satisfies this condition. In general, the inverse CDF is not well-defined for y==0, but for convenience if y==0, f returns the largest x that satisfies this condition. For distributions with infinite support both the largest and smallest x are -Inf; however, for distributions with finite support, this is the lower bound of the support.

If y < 0 or y > 1, f returns NaN.

If dist implements InvCDF(float64) float64, this returns that method. Otherwise, it returns a function that uses a generic numerical method to construct the inverse CDF at y by finding x such that dist.CDF(x) == y. This may have poor precision around points of discontinuity, including f(0) and f(1).

func Mean Uses

func Mean(xs []float64) float64

Mean returns the arithmetic mean of xs.

func Rand Uses

func Rand(dist DistCommon) func(*rand.Rand) float64

Rand returns a random number generator that draws from the given distribution. The returned generator takes an optional source of randomness; if this is nil, it uses the default global source.

If dist implements Rand(*rand.Rand) float64, Rand returns that method. Otherwise, it returns a generic generator based on dist's inverse CDF (which may in turn use an efficient implementation or a generic numerical implementation; see InvCDF).

func StdDev Uses

func StdDev(xs []float64) float64

StdDev returns the sample standard deviation of xs.

func Variance Uses

func Variance(xs []float64) float64

Variance returns the sample variance of xs.

type DeltaDist Uses

type DeltaDist struct {
    T float64
}

DeltaDist is the Dirac delta function, centered at T, with total area 1.

The CDF of the Dirac delta function is the Heaviside step function, centered at T. Specifically, f(T) == 1.

func (DeltaDist) Bounds Uses

func (d DeltaDist) Bounds() (float64, float64)

func (DeltaDist) CDF Uses

func (d DeltaDist) CDF(x float64) float64

func (DeltaDist) InvCDF Uses

func (d DeltaDist) InvCDF(y float64) float64

func (DeltaDist) PDF Uses

func (d DeltaDist) PDF(x float64) float64

type DiscreteDist Uses

type DiscreteDist interface {
    DistCommon

    // PMF returns the value of the probability mass function
    // Pr[X = x'], where x' is x rounded down to the nearest
    // defined point on the distribution.
    //
    // Note for implementers: for integer-valued distributions,
    // round x using int(math.Floor(x)). Do not use int(x), since
    // that truncates toward zero (unless all x <= 0 are handled
    // the same).
    PMF(x float64) float64

    // Step returns s, where the distribution is defined for sℕ.
    Step() float64
}

A DiscreteDist is a discrete statistical distribution.

Most discrete distributions are defined only at integral values of the random variable. However, some are defined at other intervals, so this interface takes a float64 value for the random variable. The probability mass function rounds down to the nearest defined point. Note that float64 values can exactly represent integer values between ±2**53, so this generally shouldn't be an issue for integer-valued distributions (likewise, for half-integer-valued distributions, float64 can exactly represent all values between ±2**52).

type Dist Uses

type Dist interface {
    DistCommon

    // PDF returns the value of the probability density function
    // of this distribution at x.
    PDF(x float64) float64
}

A Dist is a continuous statistical distribution.

type DistCommon Uses

type DistCommon interface {
    // CDF returns the cumulative probability Pr[X <= x].
    //
    // For continuous distributions, the CDF is the integral of
    // the PDF from -inf to x.
    //
    // For discrete distributions, the CDF is the sum of the PMF
    // at all defined points from -inf to x, inclusive. Note that
    // the CDF of a discrete distribution is defined for the whole
    // real line (unlike the PMF) but has discontinuities where
    // the PMF is non-zero.
    //
    // The CDF is a monotonically increasing function and has a
    // domain of all real numbers. If the distribution has bounded
    // support, it has a range of [0, 1]; otherwise it has a range
    // of (0, 1). Finally, CDF(-inf)==0 and CDF(inf)==1.
    CDF(x float64) float64

    // Bounds returns reasonable bounds for this distribution's
    // PDF/PMF and CDF. The total weight outside of these bounds
    // should be approximately 0.
    //
    // For a discrete distribution, both bounds are integer
    // multiples of Step().
    //
    // If this distribution has finite support, it returns exact
    // bounds l, h such that CDF(l')=0 for all l' < l and
    // CDF(h')=1 for all h' >= h.
    Bounds() (float64, float64)
}

A DistCommon is a statistical distribution. DistCommon is a base interface provided by both continuous and discrete distributions.

type LocationHypothesis Uses

type LocationHypothesis int

A LocationHypothesis specifies the alternative hypothesis of a location test such as a t-test or a Mann-Whitney U-test. The default (zero) value is to test against the alternative hypothesis that they differ.

const (
    // LocationLess specifies the alternative hypothesis that the
    // location of the first sample is less than the second. This
    // is a one-tailed test.
    LocationLess LocationHypothesis = -1

    // LocationDiffers specifies the alternative hypothesis that
    // the locations of the two samples are not equal. This is a
    // two-tailed test.
    LocationDiffers LocationHypothesis = 0

    // LocationGreater specifies the alternative hypothesis that
    // the location of the first sample is greater than the
    // second. This is a one-tailed test.
    LocationGreater LocationHypothesis = 1
)

type MannWhitneyUTestResult Uses

type MannWhitneyUTestResult struct {
    // N1 and N2 are the sizes of the input samples.
    N1, N2 int

    // U is the value of the Mann-Whitney U statistic for this
    // test, generalized by counting ties as 0.5.
    //
    // Given the Cartesian product of the two samples, this is the
    // number of pairs in which the value from the first sample is
    // greater than the value of the second, plus 0.5 times the
    // number of pairs where the values from the two samples are
    // equal. Hence, U is always an integer multiple of 0.5 (it is
    // a whole integer if there are no ties) in the range [0, N1*N2].
    //
    // U statistics always come in pairs, depending on which
    // sample is "first". The mirror U for the other sample can be
    // calculated as N1*N2 - U.
    //
    // There are many equivalent statistics with slightly
    // different definitions. The Wilcoxon (1945) W statistic
    // (generalized for ties) is U + (N1(N1+1))/2. It is also
    // common to use 2U to eliminate the half steps and Smid
    // (1956) uses N1*N2 - 2U to additionally center the
    // distribution.
    U   float64

    // AltHypothesis specifies the alternative hypothesis tested
    // by this test against the null hypothesis that there is no
    // difference in the locations of the samples.
    AltHypothesis LocationHypothesis

    // P is the p-value of the Mann-Whitney test for the given
    // null hypothesis.
    P   float64
}

A MannWhitneyUTestResult is the result of a Mann-Whitney U-test.

func MannWhitneyUTest Uses

func MannWhitneyUTest(x1, x2 []float64, alt LocationHypothesis) (*MannWhitneyUTestResult, error)

MannWhitneyUTest performs a Mann-Whitney U-test [1,2] of the null hypothesis that two samples come from the same population against the alternative hypothesis that one sample tends to have larger or smaller values than the other.

This is similar to a t-test, but unlike the t-test, the Mann-Whitney U-test is non-parametric (it does not assume a normal distribution). It has very slightly lower efficiency than the t-test on normal distributions.

Computing the exact U distribution is expensive for large sample sizes, so this uses a normal approximation for sample sizes larger than MannWhitneyExactLimit if there are no ties or MannWhitneyTiesExactLimit if there are ties. This normal approximation uses both the tie correction and the continuity correction.

This can fail with ErrSampleSize if either sample is empty or ErrSamplesEqual if all sample values are equal.

This is also known as a Mann-Whitney-Wilcoxon test and is equivalent to the Wilcoxon rank-sum test, though the Wilcoxon rank-sum test differs in nomenclature.

[1] Mann, Henry B.; Whitney, Donald R. (1947). "On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other". Annals of Mathematical Statistics 18 (1): 50–60.

[2] Klotz, J. H. (1966). "The Wilcoxon, Ties, and the Computer". Journal of the American Statistical Association 61 (315): 772-787.

type NormalDist Uses

type NormalDist struct {
    Mu, Sigma float64
}

NormalDist is a normal (Gaussian) distribution with mean Mu and standard deviation Sigma.

func (NormalDist) Bounds Uses

func (n NormalDist) Bounds() (float64, float64)

func (NormalDist) CDF Uses

func (n NormalDist) CDF(x float64) float64

func (NormalDist) InvCDF Uses

func (n NormalDist) InvCDF(p float64) (x float64)

func (NormalDist) PDF Uses

func (n NormalDist) PDF(x float64) float64

func (NormalDist) Rand Uses

func (n NormalDist) Rand(r *rand.Rand) float64

type Sample Uses

type Sample struct {
    // Xs is the slice of sample values.
    Xs  []float64

    // Weights[i] is the weight of sample Xs[i].  If Weights is
    // nil, all Xs have weight 1.  Weights must have the same
    // length of Xs and all values must be non-negative.
    Weights []float64

    // Sorted indicates that Xs is sorted in ascending order.
    Sorted bool
}

Sample is a collection of possibly weighted data points.

func (Sample) Bounds Uses

func (s Sample) Bounds() (min float64, max float64)

Bounds returns the minimum and maximum values of the Sample.

If the Sample is weighted, this ignores samples with zero weight.

This is constant time if s.Sorted and there are no zero-weighted values.

func (Sample) Copy Uses

func (s Sample) Copy() *Sample

Copy returns a copy of the Sample.

The returned Sample shares no data with the original, so they can be modified (for example, sorted) independently.

func (Sample) GeoMean Uses

func (s Sample) GeoMean() float64

GeoMean returns the geometric mean of the Sample. All samples values must be positive.

func (Sample) IQR Uses

func (s Sample) IQR() float64

IQR returns the interquartile range of the Sample.

This is constant time if s.Sorted and s.Weights == nil.

func (Sample) Mean Uses

func (s Sample) Mean() float64

Mean returns the arithmetic mean of the Sample.

func (Sample) Percentile Uses

func (s Sample) Percentile(pctile float64) float64

Percentile returns the pctileth value from the Sample. This uses interpolation method R8 from Hyndman and Fan (1996).

pctile will be capped to the range [0, 1]. If len(xs) == 0 or all weights are 0, returns NaN.

Percentile(0.5) is the median. Percentile(0.25) and Percentile(0.75) are the first and third quartiles, respectively.

This is constant time if s.Sorted and s.Weights == nil.

func (*Sample) Sort Uses

func (s *Sample) Sort() *Sample

Sort sorts the samples in place in s and returns s.

A sorted sample improves the performance of some algorithms.

func (Sample) StdDev Uses

func (s Sample) StdDev() float64

StdDev returns the sample standard deviation of the Sample.

func (Sample) Sum Uses

func (s Sample) Sum() float64

Sum returns the (possibly weighted) sum of the Sample.

func (Sample) Variance Uses

func (s Sample) Variance() float64

func (Sample) Weight Uses

func (s Sample) Weight() float64

Weight returns the total weight of the Sasmple.

type TDist Uses

type TDist struct {
    V float64
}

A TDist is a Student's t-distribution with V degrees of freedom.

func (TDist) Bounds Uses

func (t TDist) Bounds() (float64, float64)

func (TDist) CDF Uses

func (t TDist) CDF(x float64) float64

func (TDist) PDF Uses

func (t TDist) PDF(x float64) float64

type TTestResult Uses

type TTestResult struct {
    // N1 and N2 are the sizes of the input samples. For a
    // one-sample t-test, N2 is 0.
    N1, N2 int

    // T is the value of the t-statistic for this t-test.
    T   float64

    // DoF is the degrees of freedom for this t-test.
    DoF float64

    // AltHypothesis specifies the alternative hypothesis tested
    // by this test against the null hypothesis that there is no
    // difference in the means of the samples.
    AltHypothesis LocationHypothesis

    // P is p-value for this t-test for the given null hypothesis.
    P   float64
}

A TTestResult is the result of a t-test.

func OneSampleTTest Uses

func OneSampleTTest(x TTestSample, μ0 float64, alt LocationHypothesis) (*TTestResult, error)

OneSampleTTest performs a one-sample t-test on sample x. This tests the null hypothesis that the population mean is equal to μ0. This assumes the distribution of the population of sample means is normal.

func PairedTTest Uses

func PairedTTest(x1, x2 []float64, μ0 float64, alt LocationHypothesis) (*TTestResult, error)

PairedTTest performs a two-sample paired t-test on samples x1 and x2. If μ0 is non-zero, this tests if the average of the difference is significantly different from μ0. If x1 and x2 are identical, this returns nil.

func TwoSampleTTest Uses

func TwoSampleTTest(x1, x2 TTestSample, alt LocationHypothesis) (*TTestResult, error)

TwoSampleTTest performs a two-sample (unpaired) Student's t-test on samples x1 and x2. This is a test of the null hypothesis that x1 and x2 are drawn from populations with equal means. It assumes x1 and x2 are independent samples, that the distributions have equal variance, and that the populations are normally distributed.

func TwoSampleWelchTTest Uses

func TwoSampleWelchTTest(x1, x2 TTestSample, alt LocationHypothesis) (*TTestResult, error)

TwoSampleWelchTTest performs a two-sample (unpaired) Welch's t-test on samples x1 and x2. This is like TwoSampleTTest, but does not assume the distributions have equal variance.

type TTestSample Uses

type TTestSample interface {
    Weight() float64
    Mean() float64
    Variance() float64
}

A TTestSample is a sample that can be used for a one or two sample t-test.

type UDist Uses

type UDist struct {
    N1, N2 int

    // T is the count of the number of ties at each rank in the
    // input distributions. T may be nil, in which case it is
    // assumed there are no ties (which is equivalent to an M+N
    // slice of 1s). It must be the case that Sum(T) == M+N.
    T   []int
}

A UDist is the discrete probability distribution of the Mann-Whitney U statistic for a pair of samples of sizes N1 and N2.

The details of computing this distribution with no ties can be found in Mann, Henry B.; Whitney, Donald R. (1947). "On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other". Annals of Mathematical Statistics 18 (1): 50–60. Computing this distribution in the presence of ties is described in Klotz, J. H. (1966). "The Wilcoxon, Ties, and the Computer". Journal of the American Statistical Association 61 (315): 772-787 and Cheung, Ying Kuen; Klotz, Jerome H. (1997). "The Mann Whitney Wilcoxon Distribution Using Linked Lists". Statistica Sinica 7: 805-813 (the former paper contains details that are glossed over in the latter paper but has mathematical typesetting issues, so it's easiest to get the context from the former paper and the details from the latter).

func (UDist) Bounds Uses

func (d UDist) Bounds() (float64, float64)

func (UDist) CDF Uses

func (d UDist) CDF(U float64) float64

func (UDist) PMF Uses

func (d UDist) PMF(U float64) float64

func (UDist) Step Uses

func (d UDist) Step() float64

Package stats imports 5 packages (graph) and is imported by 2 packages. Updated 2017-09-27. Refresh now. Tools for package owners.