countminsketch

package module
v0.0.0-...-b4482ac Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 19, 2016 License: MIT Imports: 9 Imported by: 3

README

countminsketch

An implementation of Count-Min Sketch in Golang.

Introduction of Count-Min Sketch, from Wikipedia[1]

The Count–min sketch (or CM sketch) is a probabilistic sub-linear space streaming algorithm which can be used to summarize a data stream in many different ways. The algorithm was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan.

Count–min sketches are somewhat similar to Bloom filters; the main distinction is that Bloom filters represent sets, while CM sketches represent multisets and frequency tables. Spectral Bloom filters with multi-set policy, are conceptually isomorphic to the Count-Min Sketch.

The code is deeply inspired by an implementation of Bloom filters in golang, bloom.

Same to bloom, the hashing function used is FNV, provided by Go package (hash/fnv). For a item, the 64-bit FNV hash is computed, and upper and lower 32 bit numbers, call them h1 and h2, are used. Then, the i th hashing function is:

h1 + h2*i

Sketch Accuracy

Accuracy guarantees will be made in terms of a pair of user specified parameters, ε and δ, meaning that the error in answering a query is within a factor of ε with probability δ[2]

For a sketch of size w × d with total count N , it follows that any estimate has error at most 2N/w, with probability at least 1 - (1/2)^d. So setting the parameters w and d large enough allows us to achieve very high accuracy while using relatively little space[3].

Suppose we want an error of at most 0.1% (of the sum of all frequencies), with 99.9% certainty. Then we want 2/w = 1/1000, we set w = 2000, and = 0.001, i.e. d = log 0.001 / log 0.5 ≤ 10. Using uint counters, the space required by the array of counters is w × d × 4 = 80KB in 32 bit OS, and w × d × 8 = 160KB in 64 bit OS [3].

To create with given error rate and confidence, we could use constructor NewWithEstimates.

Parallelization

The parallelizing part of Count-Min Sketch is the hashing step. But in this implementation, only one basic hashing step is computed. So the parallelization is not necessary.

If you have to, try to split the data and count separately. And at last Merge them.

Install

This package is "go-gettable", just:

go get github.com/shenwei356/countminsketch

Usage

import "github.com/shenwei356/countminsketch"

var epsilon, delta float64
epsilon, delta = 0.1, 0.9
s := countminsketch.NewWithEstimates(epsilon, delta)
fmt.Printf("ε: %f, δ: %f -> d: %d, w: %d\n", epsilon, delta, s.D(), s.W())

epsilon, delta = 0.0001, 0.9999
s = countminsketch.NewWithEstimates(epsilon, delta)
fmt.Printf("ε: %f, δ: %f -> d: %d, w: %d\n", epsilon, delta, s.D(), s.W())

key := "abc"
s.UpdateString(key, 1)
fmt.Printf("%s:%d\n\n", key, s.EstimateString(key))

//////////////////////////////////////////////////
file := "data"
s.UpdateString(key, 2)
_, err := s.WriteToFile(file)
defer func() {
	err := os.Remove(file)
	checkerr(err)
}()

cm, err := countminsketch.NewFromFile(file)
checkerr(err)

fmt.Printf("%s:%d\n", key, cm.EstimateString(key))

//////////////////////////////////////////////////
s = countminsketch.NewWithEstimates(0.1, 0.9)
s.UpdateString(key, 10)
bytes, err := s.MarshalJSON()
checkerr(err)
fmt.Println(string(bytes))

err = s.UnmarshalJSON(bytes)
checkerr(err)
s.UpdateString(key, 10)

fmt.Printf("%s:%d\n", key, s.EstimateString(key))

Output

ε: 0.100000, δ: 0.900000 -> d: 4, w: 20
ε: 0.000100, δ: 0.999900 -> d: 14, w: 20000
abc:1

abc:3
{"d":4,"w":20,"count":[[0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0],[0,0,0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10]]}
abc:20

Benchmark

Benchmark_Update_ε0_001_δ0_999           5000000               515 ns/op
Benchmark_Estimates_ε0_001_δ0_999        5000000               481 ns/op
Benchmark_Update_ε0_000001_δ0_9999       2000000               941 ns/op
Benchmark_Estimates_ε0_000001_δ0_9999    2000000               841 ns/op

Documentation

GoDoc

Reference

  1. Wikipedia
  2. An Improved Data Stream Summary: The Count-Min Sketch and its Applications
  3. Approximating Data with the Count-Min Data Structure
  4. https://github.com/jehiah/countmin
  5. https://github.com/mtchavez/countmin

Copyright (c) 2014-2016, Wei Shen (shenwei356@gmail.com)

MIT License

Documentation

Overview

Package countminsketch is an implementation of Count-Min Sketch in Golang.

http://github.com/shenwei356/countmin/

The code is deeply inspired by an implementation of Bloom filters in golang, [bloom](https://github.com/willf/bloom).

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CountMinSketch

type CountMinSketch struct {
	// contains filtered or unexported fields
}

CountMinSketch struct. d is the number of hashing functions, w is the size of every hash table. count, a matrix, is used to store the count. uint is used to store count, the maximum count is 1<<32-1 in 32 bit OS, and 1<<64-1 in 64 bit OS.

func New

func New(d uint, w uint) (s *CountMinSketch, err error)

New creates a new Count-Min Sketch with _d_ hashing functions and _w_ hash value range

func NewFromFile

func NewFromFile(file string) (*CountMinSketch, error)

NewFromFile creates a new Count-Min Sketch from dumpped file

func NewWithEstimates

func NewWithEstimates(epsilon, delta float64) (*CountMinSketch, error)

NewWithEstimates creates a new Count-Min Sketch with given error rate and confidence. Accuracy guarantees will be made in terms of a pair of user specified parameters, ε and δ, meaning that the error in answering a query is within a factor of ε with probability δ

func (*CountMinSketch) D

func (s *CountMinSketch) D() uint

D returns the number of hashing functions

func (*CountMinSketch) Estimate

func (s *CountMinSketch) Estimate(key []byte) uint64

Estimate the frequency of a key. It is point query.

func (*CountMinSketch) EstimateString

func (s *CountMinSketch) EstimateString(key string) uint64

EstimateString estimate the frequency of a key of string

func (*CountMinSketch) GobDecode

func (s *CountMinSketch) GobDecode(data []byte) error

GobDecode implements gob.GobDecoder interface.

func (*CountMinSketch) GobEncode

func (s *CountMinSketch) GobEncode() ([]byte, error)

GobEncode implements gob.GobEncoder interface.

func (*CountMinSketch) MarshalJSON

func (s *CountMinSketch) MarshalJSON() ([]byte, error)

MarshalJSON implements json.Marshaler interface. Based on https://github.com/willf/bloom/blob/master/bloom.go

func (*CountMinSketch) Merge

func (s *CountMinSketch) Merge(other *CountMinSketch) error

Merge combines this CountMinSketch with another one

func (*CountMinSketch) ReadFrom

func (s *CountMinSketch) ReadFrom(stream io.Reader) (int64, error)

ReadFrom a binary representation of the CountMinSketch from an i/o stream. Based on https://github.com/willf/bloom/blob/master/bloom.go

func (*CountMinSketch) ReadFromFile

func (s *CountMinSketch) ReadFromFile(file string) (int64, error)

ReadFromFile reads Count-Min Sketch from file

func (*CountMinSketch) UnmarshalJSON

func (s *CountMinSketch) UnmarshalJSON(data []byte) error

UnmarshalJSON implements json.Unmarshaler interface. Based on https://github.com/willf/bloom/blob/master/bloom.go

func (*CountMinSketch) Update

func (s *CountMinSketch) Update(key []byte, count uint64)

Update the frequency of a key

func (*CountMinSketch) UpdateString

func (s *CountMinSketch) UpdateString(key string, count uint64)

UpdateString updates the frequency of a key

func (*CountMinSketch) W

func (s *CountMinSketch) W() uint

W returns the size of hashing functions

func (*CountMinSketch) WriteTo

func (s *CountMinSketch) WriteTo(stream io.Writer) (int64, error)

WriteTo writes a binary representation of the CountMinSketch to an i/o stream. Based on https://github.com/willf/bloom/blob/master/bloom.go

func (*CountMinSketch) WriteToFile

func (s *CountMinSketch) WriteToFile(file string) (int64, error)

WriteToFile writes the Count-Min Sketch to file

type CountMinSketchJSON

type CountMinSketchJSON struct {
	D     uint       `json:"d"`
	W     uint       `json:"w"`
	Count [][]uint64 `json:"count"`
}

CountMinSketchJSON is the JSON struct of CountMinSketch for marshal and unmarshal

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL