Rank and Seek Quotient Filter
A general purpose counting filter: making every bit count by Pandey et al., SIGMOD’17
A Rank and Seek Quotient Filter (RSQF) is an Approximate Membership Query data
structure. It is similar to the popular Bloom Filter (BF) however where a BF
only provides insert, and lookup.
An RSQF provides:
- insert
- delete
- lookup
- resize1
- merge
1 - resize does not permit an increase in p (e.g. unique identity capacity).
It does however increase the total number of available slots. This provides
two benefits:
- Faster queries as the remainders can be aligned closer to their home slot
with shorter runs.
- Increase in the number of entries allowing for deletion.
Overview
This implementation of RSQF has a fixed error rate of 1/512 or ~0.2%. This was
done so that each block is ensured to be contiguous in memory.
r is therefore fixed at 9 (r = log2(1/0.001953)
). q will vary relative to p
(e.g. for 1m entries p = log2(1,000,000/0.001953)
).
A filter that accepts 1 million entries will require ~11.67 bits per entry and
require approximately 1.46MB (Si) of memory.
Status
- Rank
- Select
- Hash (fnv, might consider Murmur3, CityHash, or xxHash)
- FirstAvailableSlot
- Insert
- MayContain
- Resize
- Merge
- Count (CQF)
Sizing
The following table shows the approximate sizing for this RSQF implementation
at various values of n.
n |
δ |
p |
r |
q |
Q size |
100,000 |
1/512 |
26 |
9 |
17 |
184.32 KB |
1,000,000 |
1/512 |
29 |
9 |
20 |
1.47 MB |
10,000,000 |
1/512 |
33 |
9 |
24 |
23.59 MB |
100,000,000 |
1/512 |
36 |
9 |
27 |
188.74 MB |
1,000,000,000 |
1/512 |
39 |
9 |
30 |
1.51 GB |
Glossary
This glossary summarises the variables specified in the stonybrook paper.
Note: Where integers are specified, fractional values are rounded up to
the nearest integer when calculated from floating point values.
- n - (integer)
- Maximum number of insertions (e.g. 1,000,000).
- δ - (fraction)
- Error rate or false-positive rate (e.g. 1/512 or 1/100).
- p - (integer)
- Number of bits required from the hashed input to achieve the target error
for the given number of insertions (n). The p-bit hash is split into high bits
(quotient) and low bits (remainder).
p = log2(n/δ)
- r / remainder - (integer)
- The number of remainder bits which are written to `Q.remainders`.
r = log2(1/δ)
- q / quotient - (integer)
- The number of quotient bits used to indicate the expected `home slot` in
the filter.
q = p - r
- run
- A run is a consecutive group of remainders where the quotient is
equal.
h0(a) = h0(b) = h0(c)
- occupied - (bit)
- A bit that indicates the position of a home slot for a given `run`.
- runend - (bit)
- A bit that indicates the end of a `run`.
- Q - (struct)
- The RSQF data structure which contains 2q r-bits of available
space allocated by a `block` array. The memory in bits required by the
struct can be calculated as follows:
2^q/64 * (8+64(r+2))
- Q.occupieds - (bit vector)
- Q.runends - (bit vector)
- Q.remainders - (bit vector)
- block
- A block is `64(r + 2) + 8` bit structure. It is composed of the
following fields:
- offset - 1 x 8-bit.
- runends - 1 x 64-bit.
- occupieds - 1 x 64-bit.
- remainders - r x 64-bit.
- home slot - (array index)
- The home slot is the location where a remainder would be placed if h0(x)
is unoccupied by another value.
i = h0(x); Q[i].remainders == h1(x)
- slot i - (array index)
i = h0(x)
- h(x) - (integer)
- A universal hashing function. For this library FNV-1a (64-bit) was
employed as it is available in the standard library.
- h0(x) / i - (integer)
- The masked upper bits of the hash shifted right `r` times.
h0 = (h(x) >> r) & (2^q - 1)
- h1(x) - (integer)
- The masked lower half of the hash.
h1 = h(x) & (2^r - 1)
- B - (bit vector)
- Variable representing a bit-vector. Typically one of `Q.occupieds`,
`Q.runends`, or `Q.remainders`.
- RANK(B, i) - (integer)
- Rank returns the number of 1s in B up to position i.
RANK(B, i) =
- SELECT(B, i) - (integer)
- Select returns the index of the ith 1 in B.
SELECT(B, i) =
- Oi - (integer)
- Is every 64th slot which is stored in `Q[i].offset` to save
space. The offset is calculated with the algorithm that follows.
Oi = SELECT(Q.runends, RANK(Q.occupieds, i)) - i
- Oj - (integer)
- Oj is a derived intermediate slot value which is discovered
using the algorithm that follows.
i = h0(x)
Oi = SELECT(Q.runends, RANK(Q.occupieds, i)) - i
d = RANK(q.occupieds[i + 1, ..., j], j - i - 1)
t = SELECT(Q.runends[i + Oi + 1,...,2^q - 1], d)
Oj = i + Oi + t - j