GoMinHash

module
v0.0.0-...-c9b0c51 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 17, 2019 License: MIT

README

GoMinHash

MinHash Background and Algorithm

MinHash is a locality sensitive hashing technique for hashing documents to determine similarity. This approach is approx. equal to the Jaccard Index of tokenized word sets. For comparison, we will use the MinHash algorithm to calculate short signature vectors to represent the documents. These MinHash signatures can then be compared quickly by counting the ratio of components in which the signatures agree over the total number of components. MinHash is approx. equal to Jaccard. We show this in the hash_test unittest file.

Definitions: Shingle - unsigned 32 bit integer which represents the id of the string hashed You can think of shingles as being unique ids for all possible strings before they are hashed Str = x, f(x) -> g where f'(g) -> x The same valued string will always produce the same shingle value Str = x = y, f(x) = f(y) = g, where x = y MinHash - Hashed signature of a document; It's computed by taking the minimum hashed value for all k hash functions in the hash family, H. Once the min hash value is found then it is appended to the signature. SingleSet - Set of shingles; A single shingle set represents one document

https://en.wikipedia.org/wiki/MinHash

How to test

$ go test

Example Usage

left := "I made this."
right := "You made this? I made this."
sim := minHashSimilarity(GenerateMinHash(left), GenerateMinHash(right))

sim will contain the Jaccard Similarity between the left and right strings. By generating the min hash signatures we can quickly obtain similarities between the two string segments.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL