hll: github.com/erikdubbelboer/hll Index | Examples | Files

package hll

import "github.com/erikdubbelboer/hll"

This is a Go implementation of the HyperLogLog++ algorithm from "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm" by Heule, Nunkesser and Hall of Google. This is a cardinality estimation algorithm: given a stream of input elements, it will estimate the number of unique items in the stream. The estimation error can be controlled by choosing how much memory to use. HyperLogLog++ improves on the basic HyperLogLog algorithm by using less space, improving accuracy, and correcting bias.

This code is a translation of the pseudocode contained in Figures 6 and 7 of the Google paper. Not all algorithms are provided in the paper, but we've tried our best to be true to the authors' intent when writing the omitted algorithms. We're not trying to be creative, we're just implementing the algorithm described in the paper as directly as possible.

The HyperLogLog++ paper is available at http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf

Package hll is a generated protocol buffer package.

It is generated from these files:
	hll.proto

It has these top-level messages:
	HllPb

Code:

const (
    p           = 14 // Max memory usage is 0.75 * 2^p bytes
    pPrime      = 25 // Setting this is a bit more complicated, Google recommends 25.
    numToInsert = 1000000
)

// You can use any good hash function, truncated to 8 bytes to fit in a uint64.
hashU64 := func(s string) uint64 {
    sha1Hash := sha1.Sum([]byte(s))
    return binary.LittleEndian.Uint64(sha1Hash[0:8])
}

hll := NewHll(p, pPrime)

// For this example, our inputs will just be strings, e.g. "1", "2"
for i := 0; i < numToInsert; i++ {
    inputString := strconv.Itoa(i)
    hash := hashU64(inputString)

    // To use HLL, you hash your item, convert the hash to uint64, and pass it to Add().
    hll.Add(hash)
}

// Duplicates do not affect the cardinality. The following loop has no effect.
for i := 0; i < 10000; i++ {
    hll.Add(hashU64("1"))
}

// We inserted 1M unique elements, the cardinality should be roughly 1M.
fmt.Printf("%d\n", hll.Cardinality())

Output:

1010201

Index

Examples

Package Files

bias_tables.go bigquery.go bitmanip.go doc.go hll.go hll.pb.go normal.go sparse.go sparseutil.go

Constants

const DefaultBigqueryP = 15
const DefaultBigqueryPPrime = 20
const K0 uint64 = 0xa5b85c5e198ed849
const K1 uint64 = 0x8d58ac26afe12e47
const K2 uint64 = 0xc47b6e9e3a970ed3
const K3 uint64 = 0xc6a4a7935bd1e995
const RHOW_BITS = 6
const RHOW_MASK = uint64(1<<RHOW_BITS) - 1

Variables

var (
    ErrInvalidLengthHll = fmt.Errorf("proto: negative length found during unmarshaling")
)

func BigQueryHash Uses

func BigQueryHash(s string) uint64

type Hll Uses

type Hll struct {
    // contains filtered or unexported fields
}

func NewHll Uses

func NewHll(p, pPrime uint) *Hll

Initialize a new hyper-log-log struct based on inputs p and p'. Google recommends that p be set to 14, and p' to equal either 20 or 25.

func NewHllFromBigquery Uses

func NewHllFromBigquery(data []byte) (*Hll, error)

func (*Hll) Add Uses

func (h *Hll) Add(x uint64)

Add takes a hash and updates the cardinality estimation data structures.

The input should be a hash of whatever type you're estimating of. For example, if you're estimating the cardinality of a stream of strings, you'd pass the hash of each string to this function.

func (*Hll) Cardinality Uses

func (h *Hll) Cardinality() uint64

Returns the estimated cardinality (the number of unique inputs seen so far).

func (*Hll) Combine Uses

func (h *Hll) Combine(other *Hll)

Combine() merges two HyperLogLog++ calculations. This allows you to parallelize cardinality estimation: each thread can process a shard of the input, then the results can be merged later to give the cardinality of the entire data set (the union of the shards).

WARNING: The "other" parameter may be mutated during this call! It may be converted from a sparse to dense representation, which may affect its space usage and precision. This is a deliberate design decision that helps to minimize memory consumption.

The inputs must have the same p and pPrime or this function will panic. The Google paper doesn't give an algorithm for this operation, but its existence is implied, and the ability to do this combine operation is one of the main benefits of using a HyperLogLog-type algorithm in the first place.

func (*Hll) Copy Uses

func (h *Hll) Copy() *Hll

func (*Hll) GobDecode Uses

func (h *Hll) GobDecode(data []byte) error

func (*Hll) GobEncode Uses

func (h *Hll) GobEncode() ([]byte, error)

custom gob encoder and decoders

func (*Hll) MarshalJSON Uses

func (h *Hll) MarshalJSON() ([]byte, error)

func (*Hll) MarshalPb Uses

func (h *Hll) MarshalPb() ([]byte, error)

func (*Hll) UnmarshalJSON Uses

func (h *Hll) UnmarshalJSON(buf []byte) error

func (*Hll) UnmarshalPb Uses

func (h *Hll) UnmarshalPb(buf []byte) error

type HllPb Uses

type HllPb struct {
    P                *int32       `protobuf:"varint,1,req,name=p" json:"p,omitempty"`
    Pp               *int32       `protobuf:"varint,2,req,name=pp" json:"pp,omitempty"`
    M                []byte       `protobuf:"bytes,3,opt" json:"M,omitempty"`
    S                *HllPbSparse `protobuf:"bytes,4,opt,name=s" json:"s,omitempty"`
    XXX_unrecognized []byte       `json:"-"`
}

func (*HllPb) GetM Uses

func (m *HllPb) GetM() []byte

func (*HllPb) GetP Uses

func (m *HllPb) GetP() int32

func (*HllPb) GetPp Uses

func (m *HllPb) GetPp() int32

func (*HllPb) GetS Uses

func (m *HllPb) GetS() *HllPbSparse

func (*HllPb) Marshal Uses

func (m *HllPb) Marshal() (data []byte, err error)

func (*HllPb) MarshalTo Uses

func (m *HllPb) MarshalTo(data []byte) (int, error)

func (*HllPb) ProtoMessage Uses

func (*HllPb) ProtoMessage()

func (*HllPb) Reset Uses

func (m *HllPb) Reset()

func (*HllPb) Size Uses

func (m *HllPb) Size() (n int)

func (*HllPb) String Uses

func (m *HllPb) String() string

func (*HllPb) Unmarshal Uses

func (m *HllPb) Unmarshal(data []byte) error

type HllPbSparse Uses

type HllPbSparse struct {
    Buf              []byte  `protobuf:"bytes,1,opt,name=buf" json:"buf,omitempty"`
    LastVal          *uint64 `protobuf:"varint,2,req,name=lastVal" json:"lastVal,omitempty"`
    NumElements      *uint64 `protobuf:"varint,3,req,name=numElements" json:"numElements,omitempty"`
    XXX_unrecognized []byte  `json:"-"`
}

func (*HllPbSparse) GetBuf Uses

func (m *HllPbSparse) GetBuf() []byte

func (*HllPbSparse) GetLastVal Uses

func (m *HllPbSparse) GetLastVal() uint64

func (*HllPbSparse) GetNumElements Uses

func (m *HllPbSparse) GetNumElements() uint64

func (*HllPbSparse) Marshal Uses

func (m *HllPbSparse) Marshal() (data []byte, err error)

func (*HllPbSparse) MarshalTo Uses

func (m *HllPbSparse) MarshalTo(data []byte) (int, error)

func (*HllPbSparse) ProtoMessage Uses

func (*HllPbSparse) ProtoMessage()

func (*HllPbSparse) Reset Uses

func (m *HllPbSparse) Reset()

func (*HllPbSparse) Size Uses

func (m *HllPbSparse) Size() (n int)

func (*HllPbSparse) String Uses

func (m *HllPbSparse) String() string

func (*HllPbSparse) Unmarshal Uses

func (m *HllPbSparse) Unmarshal(data []byte) error

Package hll imports 11 packages (graph). Updated 2020-12-03. Refresh now. Tools for package owners.