cuckoo

package module
v1.0.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 4, 2023 License: MIT Imports: 5 Imported by: 8

README

Cuckoo Filter

GitHub go.mod Go version of a Go module GoDoc GoReportCard

Well-tuned, production-ready cuckoo filter that performs best in class for low false positive rates (at around 0.01%). For details, see full evaluation.

Background

Cuckoo filter is a Bloom filter replacement for approximated set-membership queries. While Bloom filters are well-known space-efficient data structures to serve queries like "if item x is in a set?", they do not support deletion. Their variances to enable deletion (like counting Bloom filters) usually require much more space.

Cuckoo filters provide the flexibility to add and remove items dynamically. A cuckoo filter is based on cuckoo hashing (and therefore named as cuckoo filter). It is essentially a cuckoo hash table storing each key's fingerprint. Cuckoo hash tables can be highly compact, thus a cuckoo filter could use less space than conventional Bloom filters, for applications that require low false positive rates (< 3%).

"Cuckoo Filter: Better Than Bloom" by Bin Fan, Dave Andersen and Michael Kaminsky

Implementation details

The paper cited above leaves several parameters to choose. In this implementation

  1. Every element has 2 possible bucket indices
  2. Buckets have a static size of 4 fingerprints
  3. Fingerprints have a static size of 16 bits

1 and 2 are suggested to be the optimum by the authors. The choice of 3 comes down to the desired false positive rate. Given a target false positive rate of r and a bucket size b, they suggest choosing the fingerprint size f using

f >= log2(2b/r) bits

With the 16 bit fingerprint size in this repository, you can expect r ~= 0.0001. Other implementations use 8 bit, which correspond to a false positive rate of r ~= 0.03.

Example usage

import (
	"fmt"

	cuckoo "github.com/panmari/cuckoofilter"
)

func Example() {
	cf := cuckoo.NewFilter(1000)

	cf.Insert([]byte("pizza"))
	cf.Insert([]byte("tacos"))
	cf.Insert([]byte("tacos")) // Re-insertion is possible.

	fmt.Println(cf.Lookup([]byte("pizza")))
	fmt.Println(cf.Lookup([]byte("missing")))

	cf.Reset()
	fmt.Println(cf.Lookup([]byte("pizza")))
	// Output:
	// true
	// false
	// false
}

For more examples, see the example tests. Operations on a filter are not thread safe by default. See this example for using the filter concurrently.

Documentation

Overview

Package cuckoo provides a Cuckoo Filter, a Bloom filter replacement for approximated set-membership queries.

While Bloom filters are well-known space-efficient data structures to serve queries like "if item x is in a set?", they do not support deletion. Their variances to enable deletion (like counting Bloom filters) usually require much more space.

Cuckoo filters provide the flexibility to add and remove items dynamically. A cuckoo filter is based on cuckoo hashing (and therefore named as cuckoo filter). It is essentially a cuckoo hash table storing each key's fingerprint. Cuckoo hash tables can be highly compact, thus a cuckoo filter could use less space than conventional Bloom filters, for applications that require low false positive rates (< 3%).

"Cuckoo Filter: Better Than Bloom" by Bin Fan, Dave Andersen and Michael Kaminsky (https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf)

Example
package main

import (
	"fmt"

	cuckoo "github.com/panmari/cuckoofilter"
)

func main() {
	cf := cuckoo.NewFilter(1000)

	cf.Insert([]byte("pizza"))
	cf.Insert([]byte("tacos"))
	cf.Insert([]byte("tacos")) // Re-insertion is possible.

	fmt.Println(cf.Lookup([]byte("pizza")))
	fmt.Println(cf.Lookup([]byte("missing")))

	cf.Reset()
	fmt.Println(cf.Lookup([]byte("pizza")))
}
Output:

true
false
false
Example (ThreadSafe)
package main

import (
	"fmt"
	"sync"

	cuckoo "github.com/panmari/cuckoofilter"
)

// Small wrapper around cuckoo filter making it thread safe.
type threadSafeFilter struct {
	cf *cuckoo.Filter
	mu sync.RWMutex
}

func (f *threadSafeFilter) insert(item []byte) {
	// Concurrent inserts need a Write lock.
	f.mu.Lock()
	defer f.mu.Unlock()
	f.cf.Insert(item)
}

func (f *threadSafeFilter) lookup(item []byte) bool {
	// Concurrent lookups need a read lock.
	f.mu.RLock()
	defer f.mu.RUnlock()
	return f.cf.Lookup(item)
}

func main() {
	cf := &threadSafeFilter{
		cf: cuckoo.NewFilter(1000),
	}

	var wg sync.WaitGroup
	// Insert items concurrently...
	for i := byte(0); i < 50; i++ {
		wg.Add(1)
		go func(item byte) {
			defer wg.Done()
			cf.insert([]byte{item})
		}(i)
	}

	// ...while also doing lookups concurrently.
	for i := byte(0); i < 100; i++ {
		wg.Add(1)
		go func(item byte) {
			defer wg.Done()
			// State is not well-defined here, so we can't define expectations.
			cf.lookup([]byte{item})
		}(i)
	}
	wg.Wait()

	// Simple lookups to verify initialization.
	fmt.Println(cf.lookup([]byte{1}))
	fmt.Println(cf.lookup([]byte{99}))

}
Output:

true
false

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Filter

type Filter struct {
	// contains filtered or unexported fields
}

Filter is a probabilistic counter.

func Decode

func Decode(data []byte) (*Filter, error)

Decode returns a Cuckoofilter from a byte slice created using Encode.

func NewFilter

func NewFilter(numElements uint) *Filter

NewFilter returns a new cuckoofilter suitable for the given number of elements. When inserting more elements, insertion speed will drop significantly and insertions might fail altogether. A capacity of 1000000 is a normal default, which allocates about ~2MB on 64-bit machines.

func (*Filter) Count

func (cf *Filter) Count() uint

Count returns the number of items in the filter.

func (*Filter) Delete

func (cf *Filter) Delete(data []byte) bool

Delete data from the filter. Returns true if the data was found and deleted.

Example
package main

import (
	"fmt"

	cuckoo "github.com/panmari/cuckoofilter"
)

func main() {
	cf := cuckoo.NewFilter(1000)

	cf.Insert([]byte("pizza"))
	cf.Insert([]byte("tacos"))

	fmt.Println(cf.Lookup([]byte("pizza")))

	cf.Delete([]byte("pizza"))
	fmt.Println(cf.Lookup([]byte("pizza")))
}
Output:

true
false

func (*Filter) Encode

func (cf *Filter) Encode() []byte

Encode returns a byte slice representing a Cuckoofilter.

func (*Filter) Insert

func (cf *Filter) Insert(data []byte) bool

Insert data into the filter. Returns false if insertion failed. In the resulting state, the filter * Might return false negatives * Deletes are not guaranteed to work To increase success rate of inserts, create a larger filter.

func (*Filter) LoadFactor added in v0.0.6

func (cf *Filter) LoadFactor() float64

LoadFactor returns the fraction slots that are occupied.

func (*Filter) Lookup

func (cf *Filter) Lookup(data []byte) bool

Lookup returns true if data is in the filter.

Example
package main

import (
	"fmt"

	cuckoo "github.com/panmari/cuckoofilter"
)

func main() {
	cf := cuckoo.NewFilter(1000)

	cf.Insert([]byte("pizza"))
	cf.Insert([]byte("tacos"))

	fmt.Println(cf.Lookup([]byte("pizza")))
	fmt.Println(cf.Lookup([]byte("missing")))
}
Output:

true
false

func (*Filter) Reset

func (cf *Filter) Reset()

Reset removes all items from the filter, setting count to 0.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL