hll

package module
v1.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 26, 2020 License: MIT Imports: 9 Imported by: 0

README

go-hll CircleCI Go Report Card GoDoc

A Go implementation of HyperLogLog that is storage-compatible with the Aggregate Knowledge HLL Storage Spec.

Overview

HyperLogLog (HLL) is a fixed-size, set-like structure used for distinct value counting with tunable precision. For example, in 1280 bytes HLL can estimate the count of tens of billions of distinct values with only a few percent error.

In addition to the algorithm proposed in the original paper, this implementation is augmented to improve its accuracy and memory use without sacrificing much speed.

Motivation

While there are a handful of existing HLL implementations in Go, none of them implement the AK Storage Spec.
The unified storage format is useful for reading and writing HLLs in a multi-lingual environment. At Segment, most of our runtime code is written in Go, but we frequently persist data to PostgreSQL for long-term storage. The Postgres HLL plugin is fairly ubiquitous--it's available for the standalone server, AWS RDS, AWS Aurora, and CitusDB.

An excellent description for the motivation behind the storage strategy can be found in the Java HLL library's README.

Hashing

A good hashing algorithm is crucial to achieving the pseudorandomness that HLL requires in order to perform its calculations. The 64 bit variant of MurmurHash3 is recommended. If using a seed, it must be constant for all inputs to a given HLL. Further, if that HLL is to be unioned, then the same seed must be used for all inputs to the other HLL.

See the Java HLL README for a discussion on why MurmurHash3 is a good choice.

Adaptations to Go

The API is intended to be as similar as possible to Java HLL and Postgresql HLL. There are a couple of features, though, that make it more friendly to Go Programmers.

Zero Value

If default settings are specified using the hll.Defaults function, the zero value can be used directly as an empty HLL.

Since its impossible to reason about an HLL without the settings, operations on a zero value in lieu of default settings will panic.

StrictUnion

The other HLL implementations allow for two HLLs to be union-ed even if their log2m or regwidth parameters differ. However, doing so can produce wildly inaccurate results. This library provides an additional StrictUnion operation that will return an error if attempting a union on HLLs with incompatible settings.

Building

Dependencies are managed with dep. Before building, ensure that dep is installed and on the path.

Download Dependencies

make dep

Test

make test

Usage

package main 

import (
	"fmt"
	
	"github.com/segmentio/go-hll"
)

func main() {
	
	// install default settings.
	hll.Defaults(hll.Settings{
		Log2m: 10,
		Regwidth: 4,
		ExplicitThreshold: hll.AutoExplicitThreshold,
		SparseEnabled: true,
	})
	
	// add elements.
	h := hll.Hll{}
	h.AddRaw(123456789)
	fmt.Print(h.Cardinality())  // prints "1"
	
	// union Hlls
	h2 := hll.Hll{}
	h2.AddRaw(123456789)
	h2.AddRaw(987654321)
	h2.Union(h)
	fmt.Print(h2.Cardinality()) // prints "2"
 
	// write to/read from bytes. 
	h3, _ := hll.FromBytes(h2.ToBytes())
	fmt.Print(h3.Cardinality()) // prints "2"
}

There is a compatibility battery run as part of the unit tests. The battery was produced by the Test Generator in the java-hll package.

Additional Resources

License

Released under the MIT license.

Documentation

Overview

Package hll provides an HLL data type and operations that can be de/serialized by any code implementing the Aggregate Knowledge Storage Spec.

Index

Constants

View Source
const (

	// AutoExplicitThreshold indicates that the threshold at which an Hll goes
	// from using an explicit to a probabalistic representation should be
	// calculated based on the configuration.  Using the calculated threshold is
	// generally preferable.  One exception would be working with a pre-existing
	// data set that uses a particular explicit threshold setting in which case
	// it may be desirable to use the same explicit threshold.
	AutoExplicitThreshold = -1
)

Variables

View Source
var ErrIncompatible = errors.New("cannot StrictUnion Hlls with different regwidth or log2m settings")

ErrIncompatible is returned by StrictUnion in cases where the two Hlls have incompatible settings that prevent the operation from occurring.

View Source
var ErrInsufficientBytes = errors.New("insufficient bytes to deserialize Hll")

ErrInsufficientBytes is returned by FromBytes in cases where the provided byte slice is truncated.

Functions

func Defaults

func Defaults(settings Settings) error

Defaults installs settings that will be used by the zero value Hll. It recommended to call this function once at initialization time and never again. It will return an error if the provided settings are invalid or if a different set of defaults has already been installed.

Types

type Hll

type Hll struct {
	// contains filtered or unexported fields
}

Hll is a probabilistic set of hashed elements. It supports add and union operations in addition to estimating the cardinality. The zero value is an empty set, provided that Defaults has been invoked with default settings. Otherwise, operations on the zero value will cause a panic as it would be a coding error to attempt operations without first configuring the library.

func FromBytes

func FromBytes(bytes []byte) (Hll, error)

FromBytes deserializes the provided byte slice into an Hll. It will return an error if the version is anything other than 1, if the leading bytes specify an invalid configuration, or if the byte slice is truncated.

func NewHll

func NewHll(s Settings) (Hll, error)

NewHll creates a new Hll with the provided settings. It will return an error if the settings are invalid. Since an application usually deals with homogeneous Hlls, it's preferable to install default settings and use the zero value. This function is provided in case an application must juggle different configurations.

func (*Hll) AddRaw

func (h *Hll) AddRaw(value uint64)

AddRaw adds the observed value into the Hll. The value is expected to have been hashed with a good hash function such as Murmur3 or xxHash. If the value does not have sufficient entropy, then the resulting cardinality estimations will not be accurate.

There is an edge case where the raw value of 0 is not added to the Hll. In the sparse or dense representation, a zero value would not affect the cardinality calculations because there are no set bits to observe. In order to be consistent, the explicit representation will therefore ignore a 0 value.

func (*Hll) Cardinality

func (h *Hll) Cardinality() uint64

Cardinality estimates the number of values that have been added to this Hll.

func (*Hll) Clear

func (h *Hll) Clear()

Clear resets this Hll. Unlike other implementations that leave the backing storage in place, this resets the Hll to the empty, zero value.

func (*Hll) Settings

func (h *Hll) Settings() Settings

Settings returns the Settings for this Hll.

func (*Hll) SizeInBytes

func (h *Hll) SizeInBytes() int

SizeInBytes returns the number of bytes needed to serialize this Hll

func (*Hll) StrictUnion

func (h *Hll) StrictUnion(other Hll) error

StrictUnion will calculate the union of this Hll and the other Hll and store the results into the receiver. It will return an error if the two Hlls are not compatible where compatibility is defined as having the same register width and log2m. explicit and sparse thresholds don't factor into compatibility.

func (*Hll) ToBytes

func (h *Hll) ToBytes() []byte

ToBytes returns a byte slice with the serialized Hll value per the storage spec https://github.com/aggregateknowledge/hll-storage-spec/blob/master/STORAGE.md.

func (*Hll) Union

func (h *Hll) Union(other Hll)

Union will calculate the union of this Hll and the other Hll and store the results into the receiver.

Unlike StrictUnion, it allows unions between Hlls with different settings to be combined, though doing so is not recommended because it will result in a loss of accuracy.

As long as your application uses a single group of settings, it is safe to use this function. If there is a possibility that you may union two Hlls with incompatible settings, then it's safer to use StrictUnion and check for errors.

type Settings

type Settings struct {
	// Log2m determines the number of registers in the Hll.  The minimum value
	// is 4 and the maximum value is 31.  The number of registers in the Hll
	// will be calculated as 2^Log2m.
	Log2m int

	// Regwidth is the number of bits dedicated to each register value.  The
	// minimum value is 1 and the maximum value is 8.
	Regwidth int

	// ExplicitThreshold is the cardinality at which the Hll will go from
	// storing explicit values to using a probabilistic model.  A value of 0
	// disables explicit storage entirely.  The value AutoExplicitThreshold can
	// be used to signal the library to calculate an appropriate threshold
	// (recommended).  The maximum allowed value is 131,072.
	ExplicitThreshold int

	// SparseEnabled controls whether the Hll will use the sparse
	// representation.  The thresholds for conversion are automatically
	// calculated by the library when this field is set to true (recommended).
	SparseEnabled bool
}

Settings are used to configure the Hll and how it transitions between the backing storage types.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL