simhash

package module

v1.0.0-...-ab6ea10 Latest Latest Go to latest Published: Jul 1, 2017 License: MIT Imports: 4 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/go-dedup/simhash

README ¶

simhash

simhash - Go simhash package

simhash is a Go implementation of Charikar's simhash algorithm.

simhash is a hash with the useful property that similar documents produce similar hashes. Therefore, if two documents are similar, the Hamming-distance between the simhash of the documents will be small.

This package only implements the simhash algorithm. To make use of this package to enable quickly identifying near-duplicate documents within a large collection of documents, check out the sho (SimHash Oracle) package at github.com/go-dedup/simhash/sho. It has a simple API that is easy to use.

Installation

go get github.com/go-dedup/simhash

Usage

Using simhash first requires tokenizing a document into a set of features (done through the FeatureSet interface). This package provides an implementation, WordFeatureSet, which breaks tokenizes the document into individual words. Better results are possible here, and future work will go towards this.

API

Example usage:

> example_test.go

package simhash_test

import (
	"fmt"

	"github.com/go-dedup/simhash"
)

// for standalone test, change package to main and the next func def to,
// func main() {
func Example_output() {
	var docs = [][]byte{
		[]byte("this is a test phrase"),
		[]byte("this is a test phrass"),
		[]byte("these are test phrases"),
		[]byte("foo bar"),
	}

	hashes := make([]uint64, len(docs))
	for i, d := range docs {
		hashes[i] = simhash.Simhash(simhash.NewWordFeatureSet(d))
		fmt.Printf("Simhash of '%s': %x\n", d, hashes[i])
	}

	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[1], simhash.Compare(hashes[0], hashes[1]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[2], simhash.Compare(hashes[0], hashes[2]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[3], simhash.Compare(hashes[0], hashes[3]))

	// Output:
	// Simhash of 'this is a test phrase': 8c3a5f7e9ecb3f35
	// Simhash of 'this is a test phrass': 8c3a5f7e9ecb3f21
	// Simhash of 'these are test phrases': ddfdbf7fbfaffb1d
	// Simhash of 'foo bar': d8dbe7186bad3db3
	// Comparison of `this is a test phrase` and `this is a test phrass`: 2
	// Comparison of `this is a test phrase` and `these are test phrases`: 22
	// Comparison of `this is a test phrase` and `foo bar`: 29

}

All patches welcome.

TODO

It does not support Chinese very well:

> chinese_test.go

package simhash_test

import (
	"fmt"

	"github.com/go-dedup/simhash"
	"github.com/go-dedup/simhash/sho"

	"golang.org/x/text/unicode/norm"
)

// for standalone test, change package to `main` and the next func def to,
// func main() {
func Example_Chinese_output() {
	var docs = [][]byte{
		[]byte("当山峰没有棱角的时候"),
		[]byte("当山谷没有棱角的时候"),
		[]byte("棱角的时候"),
		[]byte("你妈妈喊你回家吃饭哦，回家罗回家罗"),
		[]byte("你妈妈叫你回家吃饭啦，回家罗回家罗"),
	}

	// Code starts

	oracle := sho.NewOracle()
	r := uint8(3)
	hashes := make([]uint64, len(docs))
	for i, d := range docs {
		hashes[i] = simhash.Simhash(simhash.NewUnicodeWordFeatureSet(d, norm.NFC))
		hash := hashes[i]
		if oracle.Seen(hash, r) {
			fmt.Printf("=: Simhash of %x for '%s' ignored.\n", hash, d)
		} else {
			oracle.See(hash)
			fmt.Printf("+: Simhash of %x for '%s' added.\n", hash, d)
		}
	}

	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[1], simhash.Compare(hashes[0], hashes[1]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[2], simhash.Compare(hashes[0], hashes[2]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[3], simhash.Compare(hashes[0], hashes[3]))

	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[3], docs[4], simhash.Compare(hashes[0], hashes[1]))

	// Code ends

	// Output:
	// +: Simhash of a5edea16c0c7a180 for '当山峰没有棱角的时候' added.
	// +: Simhash of 2e285bd230856c9 for '当山谷没有棱角的时候' added.
	// +: Simhash of 53ecd232f2383dee for '棱角的时候' added.
	// +: Simhash of e4e6edb1f89fa9ff for '你妈妈喊你回家吃饭哦，回家罗回家罗' added.
	// +: Simhash of ffe1e5ffffd7b9e7 for '你妈妈叫你回家吃饭啦，回家罗回家罗' added.
	// Comparison of `当山峰没有棱角的时候` and `当山谷没有棱角的时候`: 41
	// Comparison of `当山峰没有棱角的时候` and `棱角的时候`: 32
	// Comparison of `当山峰没有棱角的时候` and `你妈妈喊你回家吃饭哦，回家罗回家罗`: 27
	// Comparison of `你妈妈喊你回家吃饭哦，回家罗回家罗` and `你妈妈叫你回家吃饭啦，回家罗回家罗`: 41
}

All patches welcome.

Credits

mfonda/simhash forked source

The most high quality open-source Go simhash implementation available. it is even used internally by Yahoo Inc:

Similar Projects

All the following similar projects have been considered before adopting mfonda/simhash instead.

dgryski/go-simstore One of the earliest but "not very promising"
AllenDang/simhash Ported from C# code, but don't like its interface
yanyiwu/gosimhash For Chinese only. Don't like keeping two packages for the same purpose, and don't like its dependency on "结巴"中文分词 approach

Documentation ¶

Overview ¶

simhash package implements Charikar's simhash algorithm to generate a 64-bit fingerprint of a given document.

simhash fingerprints have the property that similar documents will have a similar fingerprint. Therefore, the hamming distance between two fingerprints will be small if the documents are similar

Example (Output) ¶

for standalone test, change package to main and the next func def to, func main() {

package main

import (
	"fmt"

	"github.com/go-dedup/simhash"
)

func main() {
	var docs = [][]byte{
		[]byte("this is a test phrase"),
		[]byte("this is a test phrass"),
		[]byte("these are test phrases"),
		[]byte("foo bar"),
	}

	hashes := make([]uint64, len(docs))
	for i, d := range docs {
		hashes[i] = simhash.Simhash(simhash.NewWordFeatureSet(d))
		fmt.Printf("Simhash of '%s': %x\n", d, hashes[i])
	}

	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[1], simhash.Compare(hashes[0], hashes[1]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[2], simhash.Compare(hashes[0], hashes[2]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[3], simhash.Compare(hashes[0], hashes[3]))

}

Output:

Simhash of 'this is a test phrase': 8c3a5f7e9ecb3f35
Simhash of 'this is a test phrass': 8c3a5f7e9ecb3f21
Simhash of 'these are test phrases': ddfdbf7fbfaffb1d
Simhash of 'foo bar': d8dbe7186bad3db3
Comparison of `this is a test phrase` and `this is a test phrass`: 2
Comparison of `this is a test phrase` and `these are test phrases`: 22
Comparison of `this is a test phrase` and `foo bar`: 29

Index ¶

func Compare(a uint64, b uint64) uint8
func Fingerprint(v Vector) uint64
func NewFeature(f []byte) feature
func NewFeatureWithWeight(f []byte, weight int) feature
func Shingle(w int, b [][]byte) [][]byte
func Simhash(fs FeatureSet) uint64
func SimhashBytes(b [][]byte) uint64
type Feature
type FeatureSet
type UnicodeWordFeatureSet
- func NewUnicodeWordFeatureSet(b []byte, f norm.Form) *UnicodeWordFeatureSet
- func (w *UnicodeWordFeatureSet) GetFeatures() []Feature
type Vector
- func Vectorize(features []Feature) Vector
- func VectorizeBytes(features [][]byte) Vector
type WordFeatureSet
- func NewWordFeatureSet(b []byte) *WordFeatureSet
- func (w *WordFeatureSet) GetFeatures() []Feature

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Compare ¶

func Compare(a uint64, b uint64) uint8

Compare calculates the Hamming distance between two 64-bit integers

Currently, this is calculated using the Kernighan method [1]. Other methods exist which may be more efficient and are worth exploring at some point

[1] http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan

func Fingerprint ¶

func Fingerprint(v Vector) uint64

Fingerprint returns a 64-bit fingerprint of the given vector. The fingerprint f of a given 64-dimension vector v is defined as follows:

f[i] = 1 if v[i] >= 0
f[i] = 0 if v[i] < 0

func NewFeature ¶

func NewFeature(f []byte) feature

Returns a new feature representing the given byte slice, using a weight of 1

func NewFeatureWithWeight ¶

func NewFeatureWithWeight(f []byte, weight int) feature

Returns a new feature representing the given byte slice with the given weight

func Shingle ¶

func Shingle(w int, b [][]byte) [][]byte

Shingle returns the w-shingling of the given set of bytes. For example, if the given input was {"this", "is", "a", "test"}, this returns {"this is", "is a", "a test"}

func Simhash ¶

func Simhash(fs FeatureSet) uint64

Returns a 64-bit simhash of the given feature set

func SimhashBytes ¶

func SimhashBytes(b [][]byte) uint64

Returns a 64-bit simhash of the given bytes

Types ¶

type Feature ¶

type Feature interface {
	// Sum returns the 64-bit sum of this feature
	Sum() uint64

	// Weight returns the weight of this feature
	Weight() int
}

Feature consists of a 64-bit hash and a weight

type FeatureSet ¶

type FeatureSet interface {
	GetFeatures() []Feature
}

FeatureSet represents a set of features in a given document

type UnicodeWordFeatureSet ¶

type UnicodeWordFeatureSet struct {
	// contains filtered or unexported fields
}

UnicodeWordFeatureSet is a feature set in which each word is a feature, all equal weight.

See: http://blog.golang.org/normalization See: https://groups.google.com/forum/#!topic/golang-nuts/YyH1f_qCZVc

func NewUnicodeWordFeatureSet ¶

func NewUnicodeWordFeatureSet(b []byte, f norm.Form) *UnicodeWordFeatureSet

Example (InChinese) ¶

text := []byte("当山峰没有棱角的时候")
fs := NewUnicodeWordFeatureSet(text, norm.NFKC)
fmt.Printf("%#v\n", fs)
actual := fs.GetFeatures()
fmt.Printf("%#v\n", actual)

Output:

&simhash.UnicodeWordFeatureSet{b:[]uint8{0xe5, 0xbd, 0x93, 0xe5, 0xb1, 0xb1, 0xe5, 0xb3, 0xb0, 0xe6, 0xb2, 0xa1, 0xe6, 0x9c, 0x89, 0xe6, 0xa3, 0xb1, 0xe8, 0xa7, 0x92, 0xe7, 0x9a, 0x84, 0xe6, 0x97, 0xb6, 0xe5, 0x80, 0x99}, f:2}
[]simhash.Feature{simhash.feature{sum:0xa5edea16c0c7a180, weight:1}}

Example (InWestern) ¶

text := []byte("la fin d'un bel après-midi d'été")
fs := NewUnicodeWordFeatureSet(text, norm.NFKC)
fmt.Printf("%#v\n", fs)
actual := fs.GetFeatures()
fmt.Printf("%#v\n", actual)

Output:

&simhash.UnicodeWordFeatureSet{b:[]uint8{0x6c, 0x61, 0x20, 0x66, 0x69, 0x6e, 0x20, 0x64, 0x27, 0x75, 0x6e, 0x20, 0x62, 0x65, 0x6c, 0x20, 0x61, 0x70, 0x72, 0xc3, 0xa8, 0x73, 0x2d, 0x6d, 0x69, 0x64, 0x69, 0x20, 0x64, 0x27, 0xc3, 0xa9, 0x74, 0xc3, 0xa9}, f:2}
[]simhash.Feature{simhash.feature{sum:0x8325c07b4eb2548, weight:1}, simhash.feature{sum:0xd8cbc5186ba13198, weight:1}, simhash.feature{sum:0x15cdbd7eed98cfab, weight:1}, simhash.feature{sum:0xd8d9a1186bad324a, weight:1}, simhash.feature{sum:0x3adb901f8c8a7b5e, weight:1}, simhash.feature{sum:0x7e8f29c36ffb774e, weight:1}}

func (*UnicodeWordFeatureSet) GetFeatures ¶

func (w *UnicodeWordFeatureSet) GetFeatures() []Feature

Returns a []Feature representing each word in the byte slice

type Vector ¶

type Vector [64]int

func Vectorize ¶

func Vectorize(features []Feature) Vector

Vectorize generates 64 dimension vectors given a set of features. Vectors are initialized to zero. The i-th element of the vector is then incremented by weight of the i-th feature if the i-th bit of the feature is set, and decremented by the weight of the i-th feature otherwise.

func VectorizeBytes ¶

func VectorizeBytes(features [][]byte) Vector

VectorizeBytes generates 64 dimension vectors given a set of [][]byte, where each []byte is a feature with even weight.

Vectors are initialized to zero. The i-th element of the vector is then incremented by weight of the i-th feature if the i-th bit of the feature is set, and decremented by the weight of the i-th feature otherwise.

type WordFeatureSet ¶

type WordFeatureSet struct {
	// contains filtered or unexported fields
}

WordFeatureSet is a feature set in which each word is a feature, all equal weight.

func NewWordFeatureSet ¶

func NewWordFeatureSet(b []byte) *WordFeatureSet

Example ¶

text := []byte("a a abc abc test test string.")
fs := NewWordFeatureSet(text)
fmt.Printf("%#v\n", fs)
actual := fs.GetFeatures()
fmt.Printf("%#v\n", actual)

Output:

&simhash.WordFeatureSet{b:[]uint8{0x61, 0x20, 0x61, 0x20, 0x61, 0x62, 0x63, 0x20, 0x61, 0x62, 0x63, 0x20, 0x74, 0x65, 0x73, 0x74, 0x20, 0x74, 0x65, 0x73, 0x74, 0x20, 0x73, 0x74, 0x72, 0x69, 0x6e, 0x67, 0x2e}}
[]simhash.Feature{simhash.feature{sum:0xaf63bd4c8601b7be, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7be, weight:1}, simhash.feature{sum:0xd8dcca186bafadcb, weight:1}, simhash.feature{sum:0xd8dcca186bafadcb, weight:1}, simhash.feature{sum:0x8c093f7e9fccbf69, weight:1}, simhash.feature{sum:0x8c093f7e9fccbf69, weight:1}, simhash.feature{sum:0x9926dcde0a17d48e, weight:1}}

func (*WordFeatureSet) GetFeatures ¶

func (w *WordFeatureSet) GetFeatures() []Feature

Returns a []Feature representing each word in the byte slice

Source Files ¶

View all Source files

simhash.go

Directories ¶

Path	Synopsis
sho Package html-distance is a go library for computing the proximity of the HTML pages.	Package html-distance is a go library for computing the proximity of the HTML pages.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

simhash

TOC

simhash - Go simhash package

Installation

Usage

API

> example_test.go

TODO

> chinese_test.go

Credits

Similar Projects

Documentation ¶

Overview ¶

Index ¶

Examples ¶

Constants ¶

Variables ¶

Functions ¶

func Compare ¶

func Fingerprint ¶

func NewFeature ¶

func NewFeatureWithWeight ¶

func Shingle ¶

func Simhash ¶

func SimhashBytes ¶

Types ¶

type Feature ¶

type FeatureSet ¶

type UnicodeWordFeatureSet ¶

func NewUnicodeWordFeatureSet ¶

func (*UnicodeWordFeatureSet) GetFeatures ¶

type Vector ¶

func Vectorize ¶

func VectorizeBytes ¶

type WordFeatureSet ¶

func NewWordFeatureSet ¶

func (*WordFeatureSet) GetFeatures ¶

Source Files ¶

Directories ¶