gohll

package module
v0.0.0-...-19c80a1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 13, 2018 License: MIT Imports: 8 Imported by: 0

README

GoHLL

build status GoDoc

HLL++ for gophers

What is this?

Have you ever had a large set of data (or maybe even a never ending stream of data) and wanted to know how many unique items there were? Or maybe you had two sets of data, and you wanted to know how many unique items there were in the two sets combined? Or maybe how many items appeared in both datasets? Well, gohll is for you!

HLL is a probabilistic counting algorithm that can tell you how many unique items you have added to it. In addition, you can perform union and intersection operations between multiple HLL objects. It's easy! Let me show you:

// First we make an HLL with an error rate of ~0.1%
h := NewHLLByError(0.001)

// Now it's time to start adding things to it!
for i := 0; i < 100000; i += 1 {
    h.Add(fmt.Sprintf("%d", math.Uint32())
}

uniqueitems := h.Cardinality()

Wow! That was so easy! But wait a second, you may be saying... What about those set operations you were talking about? Well, that can be done quite easily as well!

// let's make 2 hll's... they must have the same error rate!
h1 := NewHLLByError(0.001)
h2 := NewHLLByError(0.001)

// now let's add different things to each one
for i := 0; i < 100000; i += 1 {
    h1.Add(fmt.Sprintf("%d", math.Uint32())
    h2.Add(fmt.Sprintf("%d", math.Uint32())
}

uniqueItemsH1 := h1.Cardinality() // |h1|
uniqueItemsH2 := h2.Cardinality() // |h2|

uniqueItemsEither := h1.CardinalityUnion(h2) // |h1 u h2|
uniqueItemsBoth   := h1.CardinalityIntersection(h2) // |h1 n h2|

h3 := h1.Union(h2)

In this example, all the Cardinality* queries return a float64 with the size of the set under that operation. That is to say, the result of h1.CardinalityUnion(h2) is the number of unique items in either h1 and h2. So, if h1 and h2 both only contain the item "foo", then the cardinality of the union is 1 -- there is only one unique item between them. The intersection finds items that exist in both sets. Finally, the h1.Union(h2) call creates a new HLL that represents both sets h1 and h2 unioned together.

NOTE: Intersections are not natively supported in HLL's so we simply use the inclusion–exclusion principle which has completely different error bounds than any other operation on the HLL (generally much worse)

Can't I just use a map[string]bool object to do that?

Sure, you could. But I could also try to keep track of the unique items in your set in my head but that also wouldn't work out too well. The problem with using a map[string]bool object is the amount of space it takes... if you had 1e10 unique items, each represented by a 10 digit string, you are looking at 480GB of memory being used. On the other hand, an HLL++ with an error rate of 0.01% would have a memory footprint of 1.06GB and would stay at that size even if you keep adding more items!

This does come with some conditions though, you don't know what the original items are -- you can simply see how many unique items there are and do intersection and union operations with other sets. Also, there is an error bound on the results (the lower error you set, the more memory the hll uses), however the trade-offs are quite reasonable.

So basically the question you should be asking yourself is: what questions do I need to answer with my data? Also, is it worthwhile to need potentially vast amounts of memory in order to get exact solutions? Typically the answer is no.

HLL vs HLL++

I've been throwing around the words HLL and HLL++ as if they were the same thing. Let's talk a bit about how they are different.

HLL++ is an extension to HLL (first talked about in this paper) that gives it better biasing properties and much better error rates for small set sizes without increasing memory usage. The biasing issue is addressed by some experiments that were run that gave quantitative numbers as to how the HLL's were being biased for different values. With this knowledge, we are able to adjust for the biasing effects (this is done in the EstimateBias function).

On the other hand, for small set sizes HLL++ uses a smart way of encoding integers to create a miniature HLL with much higher precision. HLL's have a nice property of doing better when you give it more data, however this miniature HLL (the SparseList in our implementation) is designed such that it gives very low errors in this regime (giving errors in the range of 0.018%). In addition, this list could be compressed easily to allow us to use this encoding much longer. Once enough items have been placed into the HLL, the integer encoding is reversed and we insert the old data into a classic HLL structure.

Speed

This library is fast! With an error rate of 0.1% (ie: p=20), while in the sparse regime (ie: small number of items compared to the error rate chosen), we can accomplish about 269,000 insertions per second on a 2011 Macbook Air. For the normal regime (ie: large number of items) we accomplish 1,880,000 insertions per second on the same hardware! Furthermore, cardinality queries (ie: asking "how many unique elements are in this HLL?") are quite quick. While they depend on many subtleties regarding the state of the HLL, they have an upper bound of 7ms per query (average of 2ms) for the same setup.

It may seem that strange that it is slower to add items to the HLL while it is less full, but this makes sense since we go to extra lengths and have a different insertion model when there aren't many items in order to fulfill the error guarantees.

If you care more about insertion speed than you do having good error bounds when the HLL is still relatively empty, simply call h.ToNormal() on the HLL once you have instantiated it in order to skip the sparse phase. This would be desirable if you are pre-loading the HLL with a lot of data and know that once the loading is done, it will be out of the sparse phase anyways (so you may as well get 4x faster insertion speeds to that your loading procedure finishes faster!).

Benchmarks can be run with go test --bench=.

Hashing functions

"Isn't the entropy of your hashing function very important for the error estimates?" you may be asking. Why, yes it is! We choose to use murmurhash3's 64bit hashing function by default, but this can be changed. Simply create a new hashing function that takes in a string and outputs a uint64 and set your HLL object's Hasher property to it. This should only be done before you have inserted any items into the object!

This is quite useful if you know a priori some properties of the data you will insert and can thus pick a more appropriate hashing function.

Resources

Documentation

Index

Constants

View Source
const (
	SPARSE byte = iota
	NORMAL
)

Defined the constants used to identify spase vs normal mode HLL

Variables

View Source
var (
	// ErrInvalidP is returned if an invalid precision is requested
	ErrInvalidP = errors.New("invalid value of P, must be 4<=p<=25")

	// ErrSameP is returned if an operation is requested between two HLL
	// objects with different precisions
	ErrSameP = errors.New("both HLL instances must have the same value of P")

	// ErrErrorRateOutOfBounds is returned if an invalid error rate is
	// requested
	ErrErrorRateOutOfBounds = errors.New("error rate must be 0.26>=errorRate>=0.00025390625")
)

Functions

func MMH3Hash

func MMH3Hash(value string) uint64

MMH3Hash is the default hasher and uses murmurhash to return a uint64 NOTE: This hashing function will clobber the original hash

Types

type HLL

type HLL struct {
	P uint8

	Hasher func(string) uint64
	// contains filtered or unexported fields
}

HLL is the structure holding the HLL registers and maintains state. State includes: - Whether we are in normal or spase mode - Register values - Desired precision - Reference to the hashing function used

func NewHLL

func NewHLL(p uint8) (*HLL, error)

NewHLL creates a new HLL object given a normal mode precision between 4 and 25

func NewHLLByError

func NewHLLByError(errorRate float64) (*HLL, error)

NewHLLByError creates a new HLL object with error rate given by `errorRate`. The error must be between 26% and 0.0253%

func (*HLL) Add

func (h *HLL) Add(value string)

Add will add the given string value to the HLL using the currently set Hasher function

func (*HLL) AddHash

func (h *HLL) AddHash(hash uint64)

AddHash will add the given uint64 hash to the HLL

func (*HLL) AddWithHasher

func (h *HLL) AddWithHasher(value string, hasher func(string) uint64)

AddWithHasher will add the given string value to the HLL using the specified hasher function.

func (*HLL) Cardinality

func (h *HLL) Cardinality() float64

Cardinality returns the estimated cardinality of the current HLL object

func (*HLL) CardinalityIntersection

func (h *HLL) CardinalityIntersection(other *HLL) (float64, error)

CardinalityIntersection returns the estimated cardinality of the intersection between this HLL object and another one. That is, it returns an estimate of the number of unique items that occur in both this and the other HLL object. This is done with the Inclusion–exclusion principle and does not satisfy the error guarantee.

func (*HLL) CardinalityUnion

func (h *HLL) CardinalityUnion(other *HLL) (float64, error)

CardinalityUnion returns the estimated cardinality of the union between this and another HLL object. This result would be the same as first taking the union between this and the other object and then calling Cardinality. However, by calling this function we are not making any changes to the HLL object.

func (*HLL) MarshalBinary

func (h *HLL) MarshalBinary() ([]byte, error)

MarshalBinary implements encoding.BinaryMarshaler. Does not serialize hasher!

func (*HLL) ToNormal

func (h *HLL) ToNormal()

ToNormal will convert the current HLL to normal mode, maintaining any data already inserted into the structure, if it is in sparse mode

func (*HLL) Union

func (h *HLL) Union(other *HLL) error

Union will merge all data in another HLL object into this one.

func (*HLL) UnmarshalBinary

func (h *HLL) UnmarshalBinary(data []byte) error

UnmarshalBinary implements encoding.BinaryUnmarshaler. Preserves the hasher.

Directories

Path Synopsis
MurmurHash implementation.
MurmurHash implementation.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL