minhash

package module
v0.0.0-...-ad340ca Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 15, 2019 License: MIT Imports: 3 Imported by: 8

README

go-minhash: bottom-k and minhash LSH implementations

godoc: http://godoc.org/github.com/dgryski/go-minhash

Documentation

Overview

Package minhash implements the bottom-k sketch for streaming set similarity.

For more information,

http://research.neustar.biz/2012/07/09/sketch-of-the-day-k-minimum-values/

MinHashing:
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
https://en.wikipedia.org/wiki/MinHash

BottomK:
http://www.math.tau.ac.il/~haimk/papers/p225-cohen.pdf
http://cohenwang.org/edith/Papers/metrics394-cohen.pdf

http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer07distinct.pdf

This package works best when provided with a strong 64-bit hash function, such as CityHash, Spooky, Murmur3, or SipHash.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func SimilarityBbit

func SimilarityBbit(sig1, sig2 []uint64, b uint) float64

SimilarityBbit computes an estimate for the similarity between two b-bit signatures

Types

type BottomK

type BottomK struct {
	// contains filtered or unexported fields
}

BottomK is a bottom-k sketch of a set

func NewBottomK

func NewBottomK(h Hash64, k int) *BottomK

NewBottomK returns a new BottomK implementation.

func (*BottomK) Cardinality

func (m *BottomK) Cardinality() int

Cardinality estimates the cardinality of the set

func (*BottomK) Merge

func (m *BottomK) Merge(m2 *BottomK)

Merge combines the signatures of the second set, creating the signature of their union.

func (*BottomK) Push

func (m *BottomK) Push(b []byte)

Push adds an element to the set.

func (*BottomK) Signature

func (m *BottomK) Signature() []uint64

Signature returns a signature for the set.

func (*BottomK) Similarity

func (m *BottomK) Similarity(m2 *BottomK) float64

Similarity computes an estimate for the similarity between the two sets.

type Hash64

type Hash64 func([]byte) uint64

type MinWise

type MinWise struct {
	// contains filtered or unexported fields
}

MinWise is a collection of minimum hashes for a set

func NewMinWise

func NewMinWise(h1, h2 Hash64, size int) *MinWise

NewMinWise returns a new MinWise Hashing implementation

func NewMinWiseFromSignatures

func NewMinWiseFromSignatures(h1, h2 Hash64, signatures []uint64) *MinWise

NewMinWiseFromSignatures returns a new MinWise Hashing implementation using a user-provided set of signatures

func (*MinWise) Cardinality

func (m *MinWise) Cardinality() int

Cardinality estimates the cardinality of the set

func (*MinWise) Merge

func (m *MinWise) Merge(m2 *MinWise)

Merge combines the signatures of the second set, creating the signature of their union.

func (*MinWise) Push

func (m *MinWise) Push(b []byte)

Push adds an element to the set.

func (*MinWise) Signature

func (m *MinWise) Signature() []uint64

Signature returns a signature for the set.

func (*MinWise) SignatureBbit

func (m *MinWise) SignatureBbit(b uint) []uint64

SignatureBbit returns a b-bit reduction of the signature. This will result in unused bits at the high-end of the words if b does not divide 64 evenly.

func (*MinWise) Similarity

func (m *MinWise) Similarity(m2 *MinWise) float64

Similarity computes an estimate for the similarity between the two sets.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL