hll

package module
v1.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 10, 2022 License: MIT Imports: 10 Imported by: 0

README

hll-go

GoDoc

An optimised and opinionated version of HyperLogLog, an algorithm for counting unique elements.

This is uses many techniques from "HyperLogLog in Practice", most notably bias correction. It also includes serialisation to protobuf, as well as methods to aid pre-calculation of a custom set of biases if the defaults do not improve accuracy enough for a particular use-case.

Differences

This implementation is currently used effectively in production against several billions of requests per day, with cardinalities ranging between 20,000 and >500,000,000. Significant testing shows accuracy has an upper-bound of +/-3%, but in practice it is almost always <+/-1%.

The chief design goals were:

  • Ability to accurately handle a large range of cardinalities, starting at ~20,000
  • Known up-front memory allocation
  • Fast insert
  • Fast merge
  • Protobuf serialisation

As such, it contains several changes from general purpose implementations (such as HyperLogLog). Namely:

  • Fixed precision of 14
  • 8 bit registers, no tailcuts
  • No sparse representation
  • Fixed use of 16.4kb of memory*
  • Built-in default bias correction
  • Protobuf []byte output
  • Optimised merges, including a rollup helper for merging several Sketches into one

(* Runtime dependent)

Usage

First initialise a Sketch, then add []byte via .Insert(...). Estimates can be obtained from a Sketch at any point via .Estimate().

A simple example follows:

package main

import (
	"crypto/rand"
	"fmt"
	"io"
	"log"

	"github.com/kixa/hll-go"
)

func main() {
	// Initialise sketch using default biases.
	sketch := hll.NewSketch()

	for i := 0; i < 100; i++ {
		bs, err := genPseudoRandomBytes()

		if err != nil {
			log.Fatalf("couldn't generate random bytes: %v", err)
		}

		// Insert bytes into the sketch.
		sketch.Insert(bs)
	}

	// Get an estimate.
	estimate := sketch.Estimate()

	fmt.Printf("inserted 100 elements, estimate: %d\n", estimate)
}

func genPseudoRandomBytes() ([]byte, error) {
	r := make([]byte, 16)
	_, err := io.ReadFull(rand.Reader, r)

	if err != nil {
		return nil, err
	}

	return r, nil
}

Custom Biases

As described in "HyperLogLog in Practice", interpolated bias correction can be applied at low cardinality estimates (<100,000) to improve accuracy.

This is handled transparently in hll-go since the default biases provided in biases.go work for almost all use-cases. However, if you wish to use your own custom biases, you can:

  • (Optionally) Create a set of biases using your own parameters and generation fn.
  • Register your bias map (map[int]float64) via RegisterBiases(...) with a custom key.
  • Create any new sketches via NewCustomSketch(...) with the same key.

(NOTE: Protobuf serialized sketches WILL NOT contain any custom biases. To re-use a custom set for estimates after de-serialisation from protobuf, initialise an empty Sketch with the custom biases via NewCustomSketch(...), then Merge in the de-serialized one)

License

Distributed under MIT License. See LICENSE.md for more information.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrorMismatchedVersion is returned from a Merge when two sketch versions do not match.
	ErrorMismatchedVersion = errors.New("sketch version mismatch")

	// ErrorMalformedPrecision is returned from a Merge when a sketch is found to have differing precisions (or
	// is malformed/incorrectly deserialized).
	ErrorMalformedPrecision = errors.New("sketch precision mismatch")
)

Functions

func RegisterBiases

func RegisterBiases(key string, biases map[int]float64) error

RegisterBiases allows a custom set of biases to be registered for use in NewCustomSketch.

Types

type BiasEstimate

type BiasEstimate struct {
	TrueCardinality         uint64
	RawEstimatedCardinality uint64
	Bias                    float64
}

BiasEstimate holds an average cardinality estimate and a average resulting bias. This is produced by running a number of simulations on a set containing exactly TrueCardinality. (I.e. If a Sketch produces this RawEstimatedCardinality (without BiasCorrection or LinearCounting, we need to weight by Bias to get close to TrueCardinality).

func GenerateBiases

func GenerateBiases(fn func() []byte, options *GenerationOptions) ([]*BiasEstimate, error)

GenerateBiases can be used to create a list of BiasEstimate for arbitrary []byte generated by fn. It should be used if defaultGeneratedBiases do not produce accurate estimates for a specific use-case (or if you have the patience/compute to run more precise estimates). The options used to generate the default biases are returned from DefaultGenerationOptions() and these are used if options is nil.

Depending on fn and options this can take some time and use significant memory...

WARNING: If fn produces a set of unique values less than options.MaxCardinality, this will never return. NOTE: For periodic log.Printf output, set envvar HLL_BIAS_LOG=1.

type GenerationOptions

type GenerationOptions struct {
	MaxCardinality uint64

	Repeats int

	InitialStep int
	StepRate    float64
}

GenerationOptions contains parameters used for GenerateBiases.

func DefaultGenerationOptions

func DefaultGenerationOptions() *GenerationOptions

DefaultGenerationOptions returns a copy of the default GenerationOptions.

type Sketch

type Sketch interface {
	// Insert inserts element into the Sketch.
	Insert(element []byte)

	// Estimate returns an estimate (+/-3%) of the number of uniques (cardinality) of everything that
	//has been inserted.
	Estimate() uint64

	// Merge merges this Sketch with other, returning itself (now combined with other) and an non-nil
	// error if the Merge could not be completed.
	Merge(other Sketch) (Sketch, error)

	// ProtoSketch returns the protobuf serializable struct representing this Sketch and should only be used for
	// embedding Sketches into larger protobuf messages. For all other use-cases use ProtoSerialize.
	// The reference proto format can be found at: https://github.com/kixa/hll-protobuf
	ProtoSketch() *hllProto.Sketch

	// ProtoSerialize returns []byte representing this Sketch.
	// The reference proto format can be found at: https://github.com/kixa/hll-protobuf
	// This is equivalent to calling proto.Marshal on the result of ProtoSketch.
	ProtoSerialize() ([]byte, error)
	// contains filtered or unexported methods
}

Sketch is an interface that wraps a HyperLogLog implementation for counting unique elements.

func FromProtoSketch added in v1.1.1

func FromProtoSketch(sketch *hllProto.Sketch) (Sketch, error)

FromProtoSketch returns a Sketch from a generated protobuf type. The proto schema used can be found in the companion repository: https://github.com/kixa/hll-protobuf

func NewCustomSketch

func NewCustomSketch(biasKey string) (Sketch, error)

NewCustomSketch returns a new Sketch using the biases registered under biasKey. If these biases are not found (previously registered via RegisterBiases), an error will be returned.

func NewSketch

func NewSketch() Sketch

NewSketch returns a new Sketch using the default biases.

func ProtoDeserialize

func ProtoDeserialize(protoBs []byte) (Sketch, error)

ProtoDeserialize returns a Sketch from an encoded protobuf version. The proto schema used can be found in the companion repository: https://github.com/kixa/hll-protobuf

func Rollup

func Rollup(sketches []Sketch) (Sketch, error)

Rollup merges sketches into a single (new) Sketch that is slightly more efficient than successively merging each into a common base, one at a time.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL