hll

package module
v1.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 24, 2023 License: Apache-2.0 Imports: 11 Imported by: 0

README

HyperLogLog++ for Go

This is a Go implementation of the HyperLogLog++ algorithm from "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm" by Heule, Nunkesser and Hall of Google. This is a cardinality estimation algorithm: given a stream of input elements, it will estimate the number of unique items in the stream. The estimation error can be controlled by choosing how much memory to use. HyperLogLog++ improves on the basic HyperLogLog algorithm by using less space, improving accuracy, and correcting bias.

This code is a translation of the pseudocode contained in Figures 6 and 7 of the Google paper. Not all algorithms are provided in the paper, but we've tried our best to be true to the authors' intent when writing the omitted algorithms. We're not trying to be creative, we're just implementing the algorithm described in the paper as directly as possible. Our deviations are described here.

The HyperLogLog++ paper is available here

Instructions

See the docs.

Example of how to import data from Bigquery:

import (
	"context"
	"fmt"

	"cloud.google.com/go/bigquery"
	"github.com/erikdubbelboer/hll"
	"google.golang.org/api/iterator"
)

func Import() (*hll.Hll, error) {
	bq, err := bigquery.NewClient(context.Background(), "project")
	if err != nil {
		return nil, fmt.Errorf("Can't connect to BigQuery: %w", err)
	}

	q := bq.Query(`
		SELECT
			HLL_COUNT.INIT(column) AS h,
		FROM project.dataset.table
		WHERE ...
	`)
	it, err := q.Read(context.Background())
	if err != nil {
		return nil, fmt.Errorf("failed to query: %w", err)
	}

	var h *hll.Hll

	for {
		var r struct {
			H []byte `bigquery:"h"`
		}

		if err := it.Next(&r); err == iterator.Done {
			break
		} else if err != nil {
			return nil, fmt.Errorf("failed to next: %w", err)
		}

		h, err = hll.NewHllFromBigquery(r.Users)
		if err != nil {
			return nil, err
		}
	}

	return h, nil
}

Adding a new value to a HLL imported from Bigquery:

func Add(h *hll.Hll, val string) {
	h.Add(hll.BigQueryHash(val))
}

Documentation

Overview

This is a Go implementation of the HyperLogLog++ algorithm from "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm" by Heule, Nunkesser and Hall of Google. This is a cardinality estimation algorithm: given a stream of input elements, it will estimate the number of unique items in the stream. The estimation error can be controlled by choosing how much memory to use. HyperLogLog++ improves on the basic HyperLogLog algorithm by using less space, improving accuracy, and correcting bias.

This code is a translation of the pseudocode contained in Figures 6 and 7 of the Google paper. Not all algorithms are provided in the paper, but we've tried our best to be true to the authors' intent when writing the omitted algorithms. We're not trying to be creative, we're just implementing the algorithm described in the paper as directly as possible.

The HyperLogLog++ paper is available at http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf

Package hll is a generated protocol buffer package.

It is generated from these files:
	hll.proto

It has these top-level messages:
	HllPb
Example
const (
	p           = 14 // Max memory usage is 0.75 * 2^p bytes
	pPrime      = 25 // Setting this is a bit more complicated, Google recommends 25.
	numToInsert = 1000000
)

// You can use any good hash function, truncated to 8 bytes to fit in a uint64.
hashU64 := func(s string) uint64 {
	sha1Hash := sha1.Sum([]byte(s))
	return binary.LittleEndian.Uint64(sha1Hash[0:8])
}

hll := NewHll(p, pPrime)

// For this example, our inputs will just be strings, e.g. "1", "2"
for i := 0; i < numToInsert; i++ {
	inputString := strconv.Itoa(i)
	hash := hashU64(inputString)

	// To use HLL, you hash your item, convert the hash to uint64, and pass it to Add().
	hll.Add(hash)
}

// Duplicates do not affect the cardinality. The following loop has no effect.
for i := 0; i < 10000; i++ {
	hll.Add(hashU64("1"))
}

// We inserted 1M unique elements, the cardinality should be roughly 1M.
fmt.Printf("%d\n", hll.Cardinality())
Output:

1010201

Index

Examples

Constants

View Source
const DefaultBigqueryP = 15
View Source
const DefaultBigqueryPPrime = 20
View Source
const K0 uint64 = 0xa5b85c5e198ed849
View Source
const K1 uint64 = 0x8d58ac26afe12e47
View Source
const K2 uint64 = 0xc47b6e9e3a970ed3
View Source
const K3 uint64 = 0xc6a4a7935bd1e995
View Source
const RHOW_BITS = 6
View Source
const RHOW_MASK = uint64(1<<RHOW_BITS) - 1

Variables

View Source
var (
	ErrInvalidLengthHll = fmt.Errorf("proto: negative length found during unmarshaling")
)

Functions

func BigQueryHash

func BigQueryHash(s string) uint64

Types

type Hll

type Hll struct {
	// contains filtered or unexported fields
}

func NewHll

func NewHll(p, pPrime uint) *Hll

Initialize a new hyper-log-log struct based on inputs p and p'. Google recommends that p be set to 14, and p' to equal either 20 or 25.

func NewHllFromBigquery

func NewHllFromBigquery(data []byte) (*Hll, error)

func (*Hll) Add

func (h *Hll) Add(x uint64)

Add takes a hash and updates the cardinality estimation data structures.

The input should be a hash of whatever type you're estimating of. For example, if you're estimating the cardinality of a stream of strings, you'd pass the hash of each string to this function.

func (*Hll) Cardinality

func (h *Hll) Cardinality() uint64

Returns the estimated cardinality (the number of unique inputs seen so far).

func (*Hll) Combine

func (h *Hll) Combine(other *Hll)

Combine() merges two HyperLogLog++ calculations. This allows you to parallelize cardinality estimation: each thread can process a shard of the input, then the results can be merged later to give the cardinality of the entire data set (the union of the shards).

WARNING: The "other" parameter may be mutated during this call! It may be converted from a sparse to dense representation, which may affect its space usage and precision. This is a deliberate design decision that helps to minimize memory consumption.

The inputs must have the same p and pPrime or this function will panic. The Google paper doesn't give an algorithm for this operation, but its existence is implied, and the ability to do this combine operation is one of the main benefits of using a HyperLogLog-type algorithm in the first place.

func (*Hll) Copy

func (h *Hll) Copy() *Hll

func (*Hll) GobDecode

func (h *Hll) GobDecode(data []byte) error

func (*Hll) GobEncode

func (h *Hll) GobEncode() ([]byte, error)

custom gob encoder and decoders

func (*Hll) MarshalJSON

func (h *Hll) MarshalJSON() ([]byte, error)

func (*Hll) MarshalPb

func (h *Hll) MarshalPb() ([]byte, error)

func (*Hll) UnmarshalJSON

func (h *Hll) UnmarshalJSON(buf []byte) error

func (*Hll) UnmarshalPb

func (h *Hll) UnmarshalPb(buf []byte) error

type HllPb

type HllPb struct {
	P                *int32       `protobuf:"varint,1,req,name=p" json:"p,omitempty"`
	Pp               *int32       `protobuf:"varint,2,req,name=pp" json:"pp,omitempty"`
	M                []byte       `protobuf:"bytes,3,opt" json:"M,omitempty"`
	S                *HllPbSparse `protobuf:"bytes,4,opt,name=s" json:"s,omitempty"`
	XXX_unrecognized []byte       `json:"-"`
}

func (*HllPb) GetM

func (m *HllPb) GetM() []byte

func (*HllPb) GetP

func (m *HllPb) GetP() int32

func (*HllPb) GetPp

func (m *HllPb) GetPp() int32

func (*HllPb) GetS

func (m *HllPb) GetS() *HllPbSparse

func (*HllPb) Marshal

func (m *HllPb) Marshal() (data []byte, err error)

func (*HllPb) MarshalTo

func (m *HllPb) MarshalTo(data []byte) (int, error)

func (*HllPb) ProtoMessage

func (*HllPb) ProtoMessage()

func (*HllPb) Reset

func (m *HllPb) Reset()

func (*HllPb) Size

func (m *HllPb) Size() (n int)

func (*HllPb) String

func (m *HllPb) String() string

func (*HllPb) Unmarshal

func (m *HllPb) Unmarshal(data []byte) error

type HllPbSparse

type HllPbSparse struct {
	Buf              []byte  `protobuf:"bytes,1,opt,name=buf" json:"buf,omitempty"`
	LastVal          *uint64 `protobuf:"varint,2,req,name=lastVal" json:"lastVal,omitempty"`
	NumElements      *uint64 `protobuf:"varint,3,req,name=numElements" json:"numElements,omitempty"`
	XXX_unrecognized []byte  `json:"-"`
}

func (*HllPbSparse) GetBuf

func (m *HllPbSparse) GetBuf() []byte

func (*HllPbSparse) GetLastVal

func (m *HllPbSparse) GetLastVal() uint64

func (*HllPbSparse) GetNumElements

func (m *HllPbSparse) GetNumElements() uint64

func (*HllPbSparse) Marshal

func (m *HllPbSparse) Marshal() (data []byte, err error)

func (*HllPbSparse) MarshalTo

func (m *HllPbSparse) MarshalTo(data []byte) (int, error)

func (*HllPbSparse) ProtoMessage

func (*HllPbSparse) ProtoMessage()

func (*HllPbSparse) Reset

func (m *HllPbSparse) Reset()

func (*HllPbSparse) Size

func (m *HllPbSparse) Size() (n int)

func (*HllPbSparse) String

func (m *HllPbSparse) String() string

func (*HllPbSparse) Unmarshal

func (m *HllPbSparse) Unmarshal(data []byte) error

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL