counting

package
v0.0.0-...-2b7dcb4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 17, 2024 License: BSD-2-Clause Imports: 4 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type ProbabilisticCounter

type ProbabilisticCounter struct {
	// contains filtered or unexported fields
}

ProbabilisticCounter implements an a linear-time counting algorithm, also known as "linear counting".

From: "A Linear-Time Probabilistic Counting Algorithm for Database Applications" (http://dblab.kaist.ac.kr/Publication/pdf/ACM90_TODS_v15n2.pdf)

"Linear counting is a two-step process. In step 1, the algorithm allocates a bit map (hash table) of size m in main memory. All entries are initialized to "0"s. The algorithm then scans the relation and applied a hash function to each data value in the column of interest. The hash function generates a bit map address and the algorithm sets this addressed bit to "1". In step 2, the algorithm first counts the number of empty bit map entries (equivalently, the number of "0" entries). It then estimates the column cardinality by dividing this count by the bit map size m (thus obtaining the fraction of empty bit map entries V_n) and plugging the result into the following equation:

n^ = -m * ln V_n (The symbol ^ denotes an estimator)"

Therefore, ProbabilisticCounter is able to compute an approximate distinct count for a sufficiently large number of string values, with an error rate of less than 1.25% on average. It does so while using a very low amount of memory, for instance tracking up to 1 million string values would consume 128 kilobytes.

ProbabilisticCounter *does not* provide a query-based API which would let callers verify whether a given string value has been counted before or not. There are other algorithms which are more efficient at providing that kind of funcionality, for instance "min-count log sketch" and "hyperloglog".

Lastly, ProbabilisticCounter provides serialization/deserialization in case it is needed to use and persist this data structure in a database.

func NewProbabilisticCounter

func NewProbabilisticCounter(cardinality int) *ProbabilisticCounter

It creates a new ProbabilisticCounter given an *estimate* of the true cardinality of the set of values it should count. This estimate is needed to that an internal bit map can be initialized and sized properly.

func NewProbabilisticCounterFromBytes

func NewProbabilisticCounterFromBytes(cardinality int, bytes []byte) (*ProbabilisticCounter, error)

It deserializes a new ProbabilisticCounter which has probably been serialized before with probabiliscCounter.Bytes(). cardinality: an *estimate* of the true cardinality of the set of values it should count so that an internal bit map can be initialized and sized properly.

func (*ProbabilisticCounter) Add

func (pc *ProbabilisticCounter) Add(values ...string)

It adds one or more string values to this counter.

func (*ProbabilisticCounter) Bytes

func (pc *ProbabilisticCounter) Bytes() ([]byte, error)

It serializes this ProbabilisticCounter into a slice of bytes, which can be used to persist this data structure.

func (*ProbabilisticCounter) Count

func (pc *ProbabilisticCounter) Count() int

It provides an (approximate) distinct count of values counter so far, with an error rate less than 1.25% on average.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL