rsqf

package module
v0.0.0-...-f4eb522 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 19, 2017 License: BSD-3-Clause Imports: 3 Imported by: 0

README

Rank and Seek Quotient Filter

Godoc Build Status Go Report Card codecov

A general purpose counting filter: making every bit count by Pandey et al., SIGMOD’17

A Rank and Seek Quotient Filter (RSQF) is an Approximate Membership Query data structure. It is similar to the popular Bloom Filter (BF) however where a BF only provides insert, and lookup.

An RSQF provides:

  • insert
  • delete
  • lookup
  • resize1
  • merge

1 - resize does not permit an increase in p (e.g. unique identity capacity). It does however increase the total number of available slots. This provides two benefits:

  1. Faster queries as the remainders can be aligned closer to their home slot with shorter runs.
  2. Increase in the number of entries allowing for deletion.

Overview

This implementation of RSQF has a fixed error rate of 1/512 or ~0.2%. This was done so that each block is ensured to be contiguous in memory.

r is therefore fixed at 9 (r = log2(1/0.001953)). q will vary relative to p (e.g. for 1m entries p = log2(1,000,000/0.001953)).

A filter that accepts 1 million entries will require ~11.67 bits per entry and require approximately 1.46MB (Si) of memory.

Status

  • Rank
  • Select
  • Hash (fnv, might consider Murmur3, CityHash, or xxHash)
  • FirstAvailableSlot
  • Insert
  • MayContain
  • Resize
  • Merge
  • Count (CQF)

Sizing

The following table shows the approximate sizing for this RSQF implementation at various values of n.

n δ p r q Q size
100,000 1/512 26 9 17 184.32 KB
1,000,000 1/512 29 9 20 1.47 MB
10,000,000 1/512 33 9 24 23.59 MB
100,000,000 1/512 36 9 27 188.74 MB
1,000,000,000 1/512 39 9 30 1.51 GB

Glossary

This glossary summarises the variables specified in the stonybrook paper.

Note: Where integers are specified, fractional values are rounded up to the nearest integer when calculated from floating point values.

n - (integer)
Maximum number of insertions (e.g. 1,000,000).
δ - (fraction)
Error rate or false-positive rate (e.g. 1/512 or 1/100).
p - (integer)
Number of bits required from the hashed input to achieve the target error for the given number of insertions (n). The p-bit hash is split into high bits (quotient) and low bits (remainder).
p = log2(n/δ)
r / remainder - (integer)
The number of remainder bits which are written to `Q.remainders`.
r = log2(1/δ)
q / quotient - (integer)
The number of quotient bits used to indicate the expected `home slot` in the filter.
q = p - r
run
A run is a consecutive group of remainders where the quotient is equal.
h0(a) = h0(b) = h0(c)
occupied - (bit)
A bit that indicates the position of a home slot for a given `run`.
runend - (bit)
A bit that indicates the end of a `run`.
Q - (struct)
The RSQF data structure which contains 2q r-bits of available space allocated by a `block` array. The memory in bits required by the struct can be calculated as follows:
2^q/64 * (8+64(r+2))
Q.occupieds - (bit vector)
Q.runends - (bit vector)
Q.remainders - (bit vector)
block
A block is `64(r + 2) + 8` bit structure. It is composed of the following fields:
  • offset - 1 x 8-bit.
  • runends - 1 x 64-bit.
  • occupieds - 1 x 64-bit.
  • remainders - r x 64-bit.
home slot - (array index)
The home slot is the location where a remainder would be placed if h0(x) is unoccupied by another value.
i = h0(x); Q[i].remainders == h1(x)
slot i - (array index)
i = h0(x)
h(x) - (integer)
A universal hashing function. For this library FNV-1a (64-bit) was employed as it is available in the standard library.
h0(x) / i - (integer)
The masked upper bits of the hash shifted right `r` times.
h0 = (h(x) >> r) & (2^q - 1)
h1(x) - (integer)
The masked lower half of the hash.
h1 = h(x) & (2^r - 1)
B - (bit vector)
Variable representing a bit-vector. Typically one of `Q.occupieds`, `Q.runends`, or `Q.remainders`.
RANK(B, i) - (integer)
Rank returns the number of 1s in B up to position i.
RANK(B, i) =
SELECT(B, i) - (integer)
Select returns the index of the ith 1 in B.
SELECT(B, i) =
Oi - (integer)
Is every 64th slot which is stored in `Q[i].offset` to save space. The offset is calculated with the algorithm that follows.
Oi = SELECT(Q.runends, RANK(Q.occupieds, i)) - i
Oj - (integer)
Oj is a derived intermediate slot value which is discovered using the algorithm that follows.
i = h0(x)
Oi = SELECT(Q.runends, RANK(Q.occupieds, i)) - i
d = RANK(q.occupieds[i + 1, ..., j], j - i - 1)
t = SELECT(Q.runends[i + Oi + 1,...,2^q - 1], d)
Oj = i + Oi + t - j

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrFilterOverflow = errors.New("RSQF overflow")

ErrFilterOverflow is returned if an insert would result in an overflow within the filter.

Functions

func Rank

func Rank(B, i uint64) uint64

Rank returns the number of 1s in B up to position i. Where position i can be between 0 to 63.

func Select

func Select(B, i uint64) uint64

Select returns the index of the ith 1 in B. If return is 64 it spans into the next bit vector. ~83ns... too damn slow! References:

https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel

Types

type Rsqf

type Rsqf struct {
	Q []block
	// contains filtered or unexported fields
}

Rsqf is the core datastructure for this filter. Might evolve to using a 64-bit array which expands the filters size to 3-bits + r per slot from 2.125 + r.

func New

func New(n float64) *Rsqf

New returns a new Rsqf with a fixed 1% error rate.

func (*Rsqf) Hash

func (q *Rsqf) Hash(b []byte) uint64

Hash applies a 64-bit hashing algorithm to b and then splits the result into h0 and h1. Shifting h0 to the right by the remainder size.

func (*Rsqf) Insert

func (q *Rsqf) Insert(x uint64) error

Insert places the hash x into the filter where space is available.

func Insert(Q, x)

r <- rank(Q.occupieds, b)
s <- select(Q.runends, t)
if h0(x) > s then // home slot advantage
	Q.remainders[h0(x)] <- h1(x)
	Q.runends[h0(x)] <- 1
else // oh noes someones in our home slot
	s <- s + 1 // next slot
	n <- FirstAvailableSlot(Q, x) // end of this run
	while n > s do // can prob do this as a block shift op on remainders/runends
		Q.remainders[n] <- Q.remainders[n - 1] // shift remainder right.
		Q.runends[n] <- Q.runends[n - 1] // shift runend value right
		n <- n - 1 // decrement to previous slot
	Q.remainders[s] <- h1(x) // insert slot
	if Q.occupieds[h0(x)] == 1 then
		Q.runends[s - 1] <- 0 // zero previous runend
	Q.runends[s] <- 1 // set current runend
Q.occupieds[h0(x)] <- 1 // force set occupieds for h0(x)
return

func (*Rsqf) MayContain

func (q *Rsqf) MayContain(x []byte)

MayContain tests if the hash exists in this filter. False positives are possible however false negatives cannot occur.

func MayContain(Q, x)

	b <- h0(x)
	if Q.occupieds[b] = 0 then
		return 0
	t <- rank(Q.occupieds, b)
	l <- select(Q.runends, t)
	v <- h1(x)
	repeat
		if Q.remainders[l] == v then
			return 1
		l <- l - 1
	until l < b or Q.runends[l] = 1
  return false

func (*Rsqf) Put

func (q *Rsqf) Put(h0, h1 uint64)

Put treats the Remainders block as a block of memory.

func (*Rsqf) Put2

func (q *Rsqf) Put2(h0, h1 uint64)

Put2 treats each row in the Remainders block as a bit field for the associated bit position in a given the remainder.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL