Documentation ¶
Index ¶
- Constants
- Variables
- func BytesToSig(data []byte) ([]uint64, error)
- func Containment(q, x []uint64, qSize, xSize int) float64
- func Recs2Chan(recs []*DomainRecord) <-chan *DomainRecord
- func SigToBytes(sig []uint64) []byte
- type BySize
- type DomainRecord
- type Lsh
- type LshEnsemble
- func BootstrapLshEnsembleEquiDepth(numPart, numHash, maxK, totalNumDomains int, ...) (*LshEnsemble, error)
- func BootstrapLshEnsembleOptimal(numPart, numHash, maxK int, sortedDomainFactory func() <-chan *DomainRecord) (*LshEnsemble, error)
- func BootstrapLshEnsemblePlusEquiDepth(numPart, numHash, maxK, totalNumDomains int, ...) (*LshEnsemble, error)
- func BootstrapLshEnsemblePlusOptimal(numPart, numHash, maxK int, sortedDomainFactory func() <-chan *DomainRecord) (*LshEnsemble, error)
- func NewLshEnsemble(parts []Partition, numHash, maxK int) *LshEnsemble
- func NewLshEnsemblePlus(parts []Partition, numHash, maxK int) *LshEnsemble
- func (e *LshEnsemble) Add(key interface{}, sig []uint64, partInd int)
- func (e *LshEnsemble) Index()
- func (e *LshEnsemble) Prepare(key interface{}, sig []uint64, size int) error
- func (e *LshEnsemble) Query(sig []uint64, size int, threshold float64, done <-chan struct{}) <-chan interface{}
- func (e *LshEnsemble) QueryTimed(sig []uint64, size int, threshold float64) (result []interface{}, dur time.Duration)
- type LshForest
- type LshForestArray
- type Minhash
- type Partition
Constants ¶
const HashValueSize = 8
HashValueSize is 8, the number of byte used for each hash value
Variables ¶
var NewLshForest = NewLshForest32
NewLshForest default constructor uses 32 bit hash value
Functions ¶
func BytesToSig ¶
BytesToSig converts a byte slice into a signature
func Containment ¶
Containment returns the estimated containment of |Q \intersect X| / |Q|. q and x are the signatures of Q and X respectively. If either size is 0, the result is defined to be 0.
func Recs2Chan ¶
func Recs2Chan(recs []*DomainRecord) <-chan *DomainRecord
Recs2Chan is a utility function that converts a DomainRecord slice in memory to a DomainRecord channel.
func SigToBytes ¶
SigToBytes serializes the signature into byte slice
Types ¶
type BySize ¶
type BySize []*DomainRecord
BySize is a wrapper for sorting domains.
func (BySize) Subset ¶
func (rs BySize) Subset(lower, upper int) []*DomainRecord
Subset returns a subset of the domains given the size lower bound and upper bound.
type DomainRecord ¶
type DomainRecord struct { // The unique key of this domain. Key interface{} // The domain size. Size int // The MinHash signature of this domain. Signature []uint64 }
DomainRecord represents a domain record.
type Lsh ¶
type Lsh interface { // Add addes a new key into the index, it won't be searchable // until the next time Index() is called since the add. Add(key interface{}, sig []uint64) // Index makes all keys added so far searchable. Index() // Query searches the index given a minhash signature, and // the LSH parameters k and l. Result keys will be written to // the channel out. // Closing channel done will cancels the query execution. Query(sig []uint64, k, l int, out chan<- interface{}, done <-chan struct{}) // OptimalKL computes the optimal LSH parameters k and l given // x, the index domain size, q, the query domain size, and t, // the containment threshold. The resulting false positive (fp) // and false negative (fn) probabilities are returned as well. OptimalKL(x, q int, t float64) (optK, optL int, fp, fn float64) }
Lsh interface is implemented by LshForst and LshForestArray.
type LshEnsemble ¶
type LshEnsemble struct { Partitions []Partition // contains filtered or unexported fields }
LshEnsemble represents an LSH Ensemble index.
func BootstrapLshEnsembleEquiDepth ¶
func BootstrapLshEnsembleEquiDepth(numPart, numHash, maxK, totalNumDomains int, sortedDomains <-chan *DomainRecord) (*LshEnsemble, error)
BootstrapLshEnsembleEquiDepth builds an index from a channel of domains using equi-depth partitions -- partitions have approximately the same number of domains. The returned index consists of MinHash LSH implemented using LshForest. numPart is the number of partitions to create. numHash is the number of hash functions in MinHash. maxK is the maximum value for the MinHash parameter K - the number of hash functions per "band". sortedDomains is a DomainRecord channel emitting domains in sorted order by their sizes.
func BootstrapLshEnsembleOptimal ¶
func BootstrapLshEnsembleOptimal(numPart, numHash, maxK int, sortedDomainFactory func() <-chan *DomainRecord) (*LshEnsemble, error)
BootstrapLshEnsembleOptimal builds an index from domains using optimal partitioning. The returned index consists of MinHash LSH implemented using LshForest. numPart is the number of partitions to create. numHash is the number of hash functions in MinHash. maxK is the maximum value for the MinHash parameter K - the number of hash functions per "band". sortedDomainFactory is factory function that returns a DomainRecord channel emitting domains in sorted order by their sizes.
func BootstrapLshEnsemblePlusEquiDepth ¶
func BootstrapLshEnsemblePlusEquiDepth(numPart, numHash, maxK, totalNumDomains int, sortedDomains <-chan *DomainRecord) (*LshEnsemble, error)
BootstrapLshEnsemblePlusEquiDepth builds an index from a channel of domains using equi-depth partitions -- partitions have approximately the same number of domains. The returned index consists of MinHash LSH implemented using LshForestArray. numPart is the number of partitions to create. numHash is the number of hash functions in MinHash. maxK is the maximum value for the MinHash parameter K - the number of hash functions per "band". sortedDomains is a DomainRecord channel emitting domains in sorted order by their sizes.
func BootstrapLshEnsemblePlusOptimal ¶
func BootstrapLshEnsemblePlusOptimal(numPart, numHash, maxK int, sortedDomainFactory func() <-chan *DomainRecord) (*LshEnsemble, error)
BootstrapLshEnsemblePlusOptimal builds an index from domains using optimal partitioning. The returned index consists of MinHash LSH implemented using LshForestArray. numPart is the number of partitions to create. numHash is the number of hash functions in MinHash. maxK is the maximum value for the MinHash parameter K - the number of hash functions per "band". sortedDomainFactory is factory function that returns a DomainRecord channel emitting domains in sorted order by their sizes.
func NewLshEnsemble ¶
func NewLshEnsemble(parts []Partition, numHash, maxK int) *LshEnsemble
NewLshEnsemble initializes a new index consists of MinHash LSH implemented using LshForest. numHash is the number of hash functions in MinHash. maxK is the maximum value for the MinHash parameter K - the number of hash functions per "band".
func NewLshEnsemblePlus ¶
func NewLshEnsemblePlus(parts []Partition, numHash, maxK int) *LshEnsemble
NewLshEnsemblePlus initializes a new index consists of MinHash LSH implemented using LshForestArray. numHash is the number of hash functions in MinHash. maxK is the maximum value for the MinHash parameter K - the number of hash functions per "band".
func (*LshEnsemble) Add ¶
func (e *LshEnsemble) Add(key interface{}, sig []uint64, partInd int)
Add a new domain to the index given its partition ID - the index of the partition. The added domain won't be searchable until the Index() function is called.
func (*LshEnsemble) Prepare ¶
func (e *LshEnsemble) Prepare(key interface{}, sig []uint64, size int) error
Prepare adds a new domain to the index given its size, and partition will be selected automatically. It could be more efficient to use Add(). The added domain won't be searchable until the Index() function is called.
func (*LshEnsemble) Query ¶
func (e *LshEnsemble) Query(sig []uint64, size int, threshold float64, done <-chan struct{}) <-chan interface{}
Query returns the candidate domain keys in a channel. This function is given the MinHash signature of the query domain, sig, the domain size, the containment threshold, and a cancellation channel. Closing channel done will cancel the query execution. The query signature must be generated using the same seed as the signatures of the indexed domains, and have the same number of hash functions.
func (*LshEnsemble) QueryTimed ¶
func (e *LshEnsemble) QueryTimed(sig []uint64, size int, threshold float64) (result []interface{}, dur time.Duration)
QueryTimed is similar to Query, returns the candidate domain keys in a slice as well as the running time.
type LshForest ¶
type LshForest struct {
// contains filtered or unexported fields
}
LshForest represents a MinHash LSH implemented using LSH Forest (http://ilpubs.stanford.edu:8090/678/1/2005-14.pdf). It supports query-time setting of the MinHash LSH parameters L (number of bands) and K (number of hash functions per band).
func NewLshForest16 ¶
NewLshForest16 uses 16-bit hash values. MinHash signatures with 64 or 32 bit hash values will have their hash values trimed.
func NewLshForest32 ¶
NewLshForest32 uses 32-bit hash values. MinHash signatures with 64 bit hash values will have their hash values trimed.
func NewLshForest64 ¶
NewLshForest64 uses 64-bit hash values.
func (*LshForest) Add ¶
Add a key with MinHash signature into the index. The key won't be searchable until Index() is called.
type LshForestArray ¶
type LshForestArray struct {
// contains filtered or unexported fields
}
LshForestArray represents a MinHash LSH implemented using an array of LshForest. It allows a wider range for the K and L parameters.
func NewLshForestArray ¶
func NewLshForestArray(maxK, numHash int) *LshForestArray
NewLshForestArray initializes with parameters: maxK is the maximum value for the MinHash parameter K - the number of hash functions per "band". numHash is the number of hash functions in MinHash.
func (*LshForestArray) Add ¶
func (a *LshForestArray) Add(key interface{}, sig []uint64)
Add a key with MinHash signature into the index. The key won't be searchable until Index() is called.
func (*LshForestArray) Index ¶
func (a *LshForestArray) Index()
Index makes all the keys added searchable.
func (*LshForestArray) OptimalKL ¶
func (a *LshForestArray) OptimalKL(x, q int, t float64) (optK, optL int, fp, fn float64)
OptimalKL returns the optimal K and L for containment search, and the false positive and negative probabilities. where x is the indexed domain size, q is the query domain size, and t is the containment threshold.
func (*LshForestArray) Query ¶
func (a *LshForestArray) Query(sig []uint64, K, L int, out chan<- interface{}, done <-chan struct{})
Query returns candidate keys given the query signature and parameters.
type Minhash ¶
type Minhash struct {
// contains filtered or unexported fields
}
Minhash represents a MinHash object
func NewMinhash ¶
NewMinhash initializes a MinHash object with a seed and the number of hash functions.