simhash

package module
v0.0.0-...-9ecaca7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 4, 2017 License: MIT Imports: 4 Imported by: 14

README

simhash

MIT License GoDoc Go Report Card travis Status

TOC

simhash - Go simhash package

simhash is a Go implementation of Charikar's simhash algorithm.

simhash is a hash with the useful property that similar documents produce similar hashes. Therefore, if two documents are similar, the Hamming-distance between the simhash of the documents will be small.

This package only implements the simhash algorithm. To make use of this package to enable quickly identifying near-duplicate documents within a large collection of documents, check out the sho (SimHash Oracle) package at github.com/go-dedup/simhash/sho. It has a simple API that is easy to use.

Design principle

The design principle of these packages follows the "Unix philosophy": "Do One Thing and Do It Well". Thus the storing & checking, and different language handling are available in different building blocks, and can be added on request, or substituted at will, keeping the size of the core code minimum.

Thus, you can use exactly what you want to use without being forced to accept a huge package with features you don't want.

Installation

go get github.com/go-dedup/simhash

Usage

Using simhash first requires tokenizing a document into a set of features (done through the FeatureSet interface). This package provides an implementation, WordFeatureSet, which breaks tokenizes the document into individual words. Better results are possible here, and future work will go towards this.

API

Example usage:

> example_test.go
//package main

package simhash_test

import (
	"fmt"

	"github.com/go-dedup/simhash"
)

// for standalone test, change package to `main` and the next func def to,
// func main() {
func Example_output() {
	hashes := make([]uint64, len(docs))
	sh := simhash.NewSimhash()
	for i, d := range docs {
		hashes[i] = sh.GetSimhash(sh.NewWordFeatureSet(d))
		fmt.Printf("Simhash of '%s': %x\n", d, hashes[i])
	}

	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[1], simhash.Compare(hashes[0], hashes[1]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[2], simhash.Compare(hashes[0], hashes[2]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[3], simhash.Compare(hashes[0], hashes[3]))

	// Output:
	// Simhash of 'this is a test phrase': 8c3a5f7e9ecb3f35
	// Simhash of 'this is a test phrass': 8c3a5f7e9ecb3f21
	// Simhash of 'these are test phrases': ddfdbf7fbfaffb1d
	// Simhash of 'foo bar': d8dbe7186bad3db3
	// Comparison of `this is a test phrase` and `this is a test phrass`: 2
	// Comparison of `this is a test phrase` and `these are test phrases`: 22
	// Comparison of `this is a test phrase` and `foo bar`: 29
}

var docs = [][]byte{
	[]byte("this is a test phrase"),
	[]byte("this is a test phrass"),
	[]byte("these are test phrases"),
	[]byte("foo bar"),
}

All patches welcome.

Purpose

A few more words on the similarity checking and near-duplicate detection. The best article I found explaining it clearly is:

Near-Duplicate Detection
https://moz.com/devblog/near-duplicate-detection/

This article, from the Moz Developer Blog, explained in details and in graph that,

  • Why Does Duplication Matter
  • What to Do About It, and
  • How to Identify Duplication

and it went on to explain the different algorithms to do so.

Among the algorithms that solve the problem the best, one is MinHash, which is the first that I tried, but found it to be bloated, cumbersome to use, and not working as I expected. The other one is SimHash, which is what all these are about. SimHash is designed by Google. It is simple, straightforward, thus very efficient and powerful. I like it very much, and should have used it in the first place.

FYI, this is why I needed and looked for such similarity checking and near-duplicate detection algorithms in the first place -- in the world that we cannot avoid the rule-breakers and spammers, at least we can use technologies to get rid of them for ourselves.

Versions

Having forked from mfonda/simhash, go-dedup/simhash has been through a serious of interface changes. Detailed documents of such changes, and the reasons behind it, also how to use the original (v1) design API can be found here.

The key characteristics of current design are,

  • most of simhash related functions are provided as method(/member) functions of SimhashBase type(/class), as oppose to package functions before.
  • and also very importantly, the UnicodeWordFeatureSet related functions no longer exist in core code any more, because
  • the language-specific handling have been refactored out to a thin language handling layer.
  • the goal of version 2 is to have different languages to have a unified user interface (API).

Such modular approach (v2 design) helps to reduce and limit the size of the core code, while make it easy to extend the core function with easy to use building blocks.

The added bonus is that, the original (v1) design does not support Chinese very well:

> simhashUTF/chinese_test.go
package simhashUTF_test

import (
	"fmt"

	"github.com/go-dedup/simhash"
	"github.com/go-dedup/simhash/sho"
	"github.com/go-dedup/simhash/simhashUTF"

	"golang.org/x/text/unicode/norm"
)

// for standalone test, change package to `main` and the next func def to,
// func main() {
func Example_Chinese_output() {
	var docs = [][]byte{
		[]byte("当山峰没有棱角的时候"),
		[]byte("当山谷没有棱角的时候"),
		[]byte("棱角的时候"),
		[]byte("你妈妈喊你回家吃饭哦,回家罗回家罗"),
		[]byte("你妈妈叫你回家吃饭啦,回家罗回家罗"),
	}

	// Code starts

	oracle := sho.NewOracle()
	r := uint8(3)
	hashes := make([]uint64, len(docs))
	sh := simhashUTF.NewUTFSimhash(norm.NFKC)
	for i, d := range docs {
		hashes[i] = sh.GetSimhash(sh.NewUnicodeWordFeatureSet(d, norm.NFC))
		hash := hashes[i]
		if oracle.Seen(hash, r) {
			fmt.Printf("=: Simhash of %x for '%s' ignored.\n", hash, d)
		} else {
			oracle.See(hash)
			fmt.Printf("+: Simhash of %x for '%s' added.\n", hash, d)
		}
	}

	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[1], simhash.Compare(hashes[0], hashes[1]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[2], simhash.Compare(hashes[0], hashes[2]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[3], simhash.Compare(hashes[0], hashes[3]))

	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[3], docs[4], simhash.Compare(hashes[0], hashes[1]))

	// Code ends

	// Output:
	// +: Simhash of a5edea16c0c7a180 for '当山峰没有棱角的时候' added.
	// +: Simhash of 2e285bd230856c9 for '当山谷没有棱角的时候' added.
	// +: Simhash of 53ecd232f2383dee for '棱角的时候' added.
	// +: Simhash of e4e6edb1f89fa9ff for '你妈妈喊你回家吃饭哦,回家罗回家罗' added.
	// +: Simhash of ffe1e5ffffd7b9e7 for '你妈妈叫你回家吃饭啦,回家罗回家罗' added.
	// Comparison of `当山峰没有棱角的时候` and `当山谷没有棱角的时候`: 41
	// Comparison of `当山峰没有棱角的时候` and `棱角的时候`: 32
	// Comparison of `当山峰没有棱角的时候` and `你妈妈喊你回家吃饭哦,回家罗回家罗`: 27
	// Comparison of `你妈妈喊你回家吃饭哦,回家罗回家罗` and `你妈妈叫你回家吃饭啦,回家罗回家罗`: 41
}

The result of similarity checking on Chinese text is very bad. But thanks to version 2's architecture, it is very easy to extend simhash to deal with Chinese:

> simhashCJK/example_test.go
// package main

package simhashCJK_test

import (
	"fmt"

	"github.com/go-dedup/simhash"
	"github.com/go-dedup/simhash/simhashCJK"
)

// for standalone test, change package to `main` and the next func def to,
// func main() {
func Example_output() {
	hashes := make([]uint64, len(docs))
	sh := simhashCJK.NewSimhash()
	for i, d := range docs {
		fs := sh.NewWordFeatureSet(d)
		// fmt.Printf("%#v\n", fs)
		// actual := fs.GetFeatures()
		// fmt.Printf("%#v\n", actual)
		hashes[i] = sh.GetSimhash(fs)
		fmt.Printf("Simhash of '%s': %x\n", d, hashes[i])
	}

	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[1], simhash.Compare(hashes[0], hashes[1]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[2], simhash.Compare(hashes[0], hashes[2]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[3], simhash.Compare(hashes[0], hashes[3]))

	// Output:
	// Simhash of '当山峰没有棱角的时候': d7185f186a2eea5a
	// Simhash of '当山谷没有棱角的时候': d71a5f186a2eea5a
	// Simhash of '棱角的时候': d71a5f186a2ffa52
	// Simhash of '你妈妈喊你回家吃饭哦,回家罗回家罗': d71bf7186a32b9f0
	// Comparison of `当山峰没有棱角的时候` and `当山谷没有棱角的时候`: 1
	// Comparison of `当山峰没有棱角的时候` and `棱角的时候`: 4
	// Comparison of `当山峰没有棱角的时候` and `你妈妈喊你回家吃饭哦,回家罗回家罗`: 16
}

var docs = [][]byte{
	[]byte("当山峰没有棱角的时候"),
	[]byte("当山谷没有棱角的时候"),
	[]byte("棱角的时候"),
	[]byte("你妈妈喊你回家吃饭哦,回家罗回家罗"),
}

With the above, now the problem has been fix. Check the result here.

Credits

The most high quality open-source Go simhash implementation available. it is even used internally by Yahoo Inc:

Yahoo Inc

Similar Projects

All the following similar projects have been considered before adopting mfonda/simhash instead.

Documentation

Overview

simhash package implements Charikar's simhash algorithm to generate a 64-bit fingerprint of a given document.

simhash fingerprints have the property that similar documents will have a similar fingerprint. Therefore, the hamming distance between two fingerprints will be small if the documents are similar

Example (Output)

for standalone test, change package to `main` and the next func def to, func main() {

//package main

package main

import (
	"fmt"

	"github.com/go-dedup/simhash"
)

// for standalone test, change package to `main` and the next func def to,
// func main() {
func main() {
	hashes := make([]uint64, len(docs))
	sh := simhash.NewSimhash()
	for i, d := range docs {
		hashes[i] = sh.GetSimhash(sh.NewWordFeatureSet(d))
		fmt.Printf("Simhash of '%s': %x\n", d, hashes[i])
	}

	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[1], simhash.Compare(hashes[0], hashes[1]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[2], simhash.Compare(hashes[0], hashes[2]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[3], simhash.Compare(hashes[0], hashes[3]))

}

var docs = [][]byte{
	[]byte("this is a test phrase"),
	[]byte("this is a test phrass"),
	[]byte("these are test phrases"),
	[]byte("foo bar"),
}
Output:

Simhash of 'this is a test phrase': 8c3a5f7e9ecb3f35
Simhash of 'this is a test phrass': 8c3a5f7e9ecb3f21
Simhash of 'these are test phrases': ddfdbf7fbfaffb1d
Simhash of 'foo bar': d8dbe7186bad3db3
Comparison of `this is a test phrase` and `this is a test phrass`: 2
Comparison of `this is a test phrase` and `these are test phrases`: 22
Comparison of `this is a test phrase` and `foo bar`: 29

Index

Examples

Constants

This section is empty.

Variables

Functions

func Compare

func Compare(a uint64, b uint64) uint8

Compare calculates the Hamming distance between two 64-bit integers

Currently, this is calculated using the Kernighan method [1]. Other methods exist which may be more efficient and are worth exploring at some point

[1] http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan

Example

for standalone test, change package to `main` and the next func def to, func main() {

//package main

package main

import (
	"fmt"

	"github.com/go-dedup/simhash"
)

func testit() {
	hashes := make([]uint64, len(doc2))
	sh := simhash.NewSimhash()
	for i, d := range doc2 {
		hashes[i] = sh.GetSimhash(sh.NewWordFeatureSet(d))
		fmt.Printf("Simhash of '%s': %x\n", d, hashes[i])
	}

	fmt.Printf("Comparison of `%s` and `%s`: %d\n", doc2[0], doc2[1], simhash.Compare(hashes[0], hashes[1]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", doc2[0], doc2[2], simhash.Compare(hashes[0], hashes[2]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", doc2[0], doc2[3], simhash.Compare(hashes[0], hashes[3]))
}

// for standalone test, change package to `main` and the next func def to,
// func main() {
func main() {
	doc2 = [][]byte{
		[]byte("Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic"),
		[]byte("2016 Ford Mustang 2016 Ford Mustang white with black stripes, this car is in showroom shape and it only has 14,000kms. this beast has never been in an accident nor does it have one scratch on the body. i purchased 20… 14,000km | Automatic"),
		[]byte("2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic"),
		[]byte("2015 Ford Explorer Sport SUV, Crossover This vehicle is a real beauty and a pleasure to drive. It is in excellent condition and has been store inside since purchased in 2015. It has not been driven in winter other then to go for service.!… 18,600km | Automatic"),
	}
	testit()

	fmt.Println("================")
	doc2 = [][]byte{
		[]byte("2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic"),
		[]byte("2015 Ford Explorer Sport SUV, Crossover This vehicle is a real beauty and a pleasure to drive. It is in excellent condition and has been store inside since purchased in 2015. It has not been driven in winter other then to go for service.!… 18,600km | Automatic"),
		[]byte("Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic"),
		[]byte("2016 Ford Mustang 2016 Ford Mustang white with black stripes, this car is in showroom shape and it only has 14,000kms. this beast has never been in an accident nor does it have one scratch on the body. i purchased 20… 14,000km | Automatic"),
	}
	testit()

}

var doc2 = [][]byte{}
Output:

Simhash of 'Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic': 1832c51ee6eb2e3e
Simhash of '2016 Ford Mustang 2016 Ford Mustang white with black stripes, this car is in showroom shape and it only has 14,000kms. this beast has never been in an accident nor does it have one scratch on the body. i purchased 20… 14,000km | Automatic': 832df1ef4eb2e3e
Simhash of '2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic': 8329706e4eb2f3d
Simhash of '2015 Ford Explorer Sport SUV, Crossover This vehicle is a real beauty and a pleasure to drive. It is in excellent condition and has been store inside since purchased in 2015. It has not been driven in winter other then to go for service.!… 18,600km | Automatic': 8b2df0ea6eb2f3c
Comparison of `Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic` and `2016 Ford Mustang 2016 Ford Mustang white with black stripes, this car is in showroom shape and it only has 14,000kms. this beast has never been in an accident nor does it have one scratch on the body. i purchased 20… 14,000km | Automatic`: 6
Comparison of `Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic` and `2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic`: 10
Comparison of `Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic` and `2015 Ford Explorer Sport SUV, Crossover This vehicle is a real beauty and a pleasure to drive. It is in excellent condition and has been store inside since purchased in 2015. It has not been driven in winter other then to go for service.!… 18,600km | Automatic`: 9
================
Simhash of '2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic': 8329706e4eb2f3d
Simhash of '2015 Ford Explorer Sport SUV, Crossover This vehicle is a real beauty and a pleasure to drive. It is in excellent condition and has been store inside since purchased in 2015. It has not been driven in winter other then to go for service.!… 18,600km | Automatic': 8b2df0ea6eb2f3c
Simhash of 'Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic': 1832c51ee6eb2e3e
Simhash of '2016 Ford Mustang 2016 Ford Mustang white with black stripes, this car is in showroom shape and it only has 14,000kms. this beast has never been in an accident nor does it have one scratch on the body. i purchased 20… 14,000km | Automatic': 832df1ef4eb2e3e
Comparison of `2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic` and `2015 Ford Explorer Sport SUV, Crossover This vehicle is a real beauty and a pleasure to drive. It is in excellent condition and has been store inside since purchased in 2015. It has not been driven in winter other then to go for service.!… 18,600km | Automatic`: 7
Comparison of `2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic` and `Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic`: 10
Comparison of `2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic` and `2016 Ford Mustang 2016 Ford Mustang white with black stripes, this car is in showroom shape and it only has 14,000kms. this beast has never been in an accident nor does it have one scratch on the body. i purchased 20… 14,000km | Automatic`: 8

func NewFeature

func NewFeature(f []byte) feature

Returns a new feature representing the given byte slice, using a weight of 1

func NewFeatureWithWeight

func NewFeatureWithWeight(f []byte, weight int) feature

Returns a new feature representing the given byte slice with the given weight

Types

type Feature

type Feature interface {
	// Sum returns the 64-bit sum of this feature
	Sum() uint64

	// Weight returns the weight of this feature
	Weight() int
}

Feature consists of a 64-bit hash and a weight

func BuildFeatures

func BuildFeatures(doc string, doc2words text.Doc2Words) []Feature

BuildFeatures returns a []Feature representing each word in the byte slice

Example
for _, d := range testDoc {
	fmt.Printf("%#v\n", BuildFeatures(string(d), Doc2words))
}
Output:

[]simhash.Feature{simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7b9, weight:1}, simhash.feature{sum:0xd98001186c3a6c5d, weight:1}, simhash.feature{sum:0x7a37c1ae2e57fa88, weight:1}, simhash.feature{sum:0x8326407b4eb32ae, weight:1}, simhash.feature{sum:0xd8b0a7186b8a3730, weight:1}, simhash.feature{sum:0xd8d9b1186bad4d2f, weight:1}, simhash.feature{sum:0x2c5b792934c8464e, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0x26feff7ef74c67b7, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xd89cc2186b79bc7e, weight:1}, simhash.feature{sum:0x93104c7ea350e1e1, weight:1}, simhash.feature{sum:0x8329307b4eb82ae, weight:1}, simhash.feature{sum:0x14dfbd7eecce8288, weight:1}, simhash.feature{sum:0x8325507b4eb192b, weight:1}, simhash.feature{sum:0xd8cbcd186ba13ffc, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c18, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7be, weight:1}, simhash.feature{sum:0x214486cdc2d73f89, weight:1}, simhash.feature{sum:0x3d52262f868f65ad, weight:1}, simhash.feature{sum:0xd8d299186ba70599, weight:1}, simhash.feature{sum:0xd8adc6186b88367f, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0x58bc5a1361284f0c, weight:1}, simhash.feature{sum:0xd8c8ad186b9ed323, weight:1}, simhash.feature{sum:0xd8a2cd186b7e3a1e, weight:1}, simhash.feature{sum:0x8325907b4eb2076, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xd89cc2186b79bc7e, weight:1}, simhash.feature{sum:0xf160267ed875749b, weight:1}, simhash.feature{sum:0x150fb27eecf79469, weight:1}, simhash.feature{sum:0x8a8c7bb9849d48f6, weight:1}, simhash.feature{sum:0x34e6e73324cc4c1c, weight:1}, simhash.feature{sum:0x8325407b4eb17fe, weight:1}, simhash.feature{sum:0xd89cc2186b79bc7e, weight:1}, simhash.feature{sum:0xbc78285d51f8f350, weight:1}, simhash.feature{sum:0x8325907b4eb2076, weight:1}, simhash.feature{sum:0x8c1a417e9fdb35c5, weight:1}, simhash.feature{sum:0x2c5b792934c8464e, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x1c7c2e0d9eb9677d, weight:1}, simhash.feature{sum:0x3035a365e168961e, weight:1}}
[]simhash.Feature{simhash.feature{sum:0xc5c6ff7fe1f34c8a, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0x3075dfaf5552d79e, weight:1}, simhash.feature{sum:0xc5c6ff7fe1f34c8a, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0x3075dfaf5552d79e, weight:1}, simhash.feature{sum:0x192cc0ca1d77458, weight:1}, simhash.feature{sum:0x6f5db37e8ecc76fd, weight:1}, simhash.feature{sum:0xdbfd3fbe6190d762, weight:1}, simhash.feature{sum:0x54c9ed4b266da2a5, weight:1}, simhash.feature{sum:0x8c1a417e9fdb35c5, weight:1}, simhash.feature{sum:0xd8d5c1186ba97fdd, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xda392f7af918887b, weight:1}, simhash.feature{sum:0x357ef82f825da4b8, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xb77e117eb8748afb, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0x22b4b6630fb27c45, weight:1}, simhash.feature{sum:0x8c1a417e9fdb35c5, weight:1}, simhash.feature{sum:0x8fc1c6be36e055d6, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0x2b94c0591a2848b9, weight:1}, simhash.feature{sum:0x26feff7ef74c67b7, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0x8326707b4eb37b4, weight:1}, simhash.feature{sum:0x91a1dacc76ac782e, weight:1}, simhash.feature{sum:0xd8b0a7186b8a3736, weight:1}, simhash.feature{sum:0x150fbd7eecf7a6ce, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xf160267ed875749b, weight:1}, simhash.feature{sum:0xd8adc6186b88367f, weight:1}, simhash.feature{sum:0xc4e8fa88937cb69, weight:1}, simhash.feature{sum:0x8325907b4eb207e, weight:1}, simhash.feature{sum:0xd89cc2186b79bc7e, weight:1}, simhash.feature{sum:0x27132a7ef75d598d, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7b6, weight:1}, simhash.feature{sum:0x873cad20b5b03ae4, weight:1}, simhash.feature{sum:0x8329607b4eb8787, weight:1}, simhash.feature{sum:0x2e047881b2f11bf2, weight:1}, simhash.feature{sum:0x3035a365e168961e, weight:1}}
[]simhash.Feature{simhash.feature{sum:0xc5c6ff7fe1f34c8f, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0xfd0c9853db565f2f, weight:1}, simhash.feature{sum:0xd7f4302f4de077d2, weight:1}, simhash.feature{sum:0xf2b15d4ce63f5477, weight:1}, simhash.feature{sum:0xd8bac5186b92beb0, weight:1}, simhash.feature{sum:0x27132a7ef75d598d, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0x12d6ee02dfea32b8, weight:1}, simhash.feature{sum:0x6976a39422c2abd8, weight:1}, simhash.feature{sum:0x8325a07b4eb21ac, weight:1}, simhash.feature{sum:0xde2e60d07d4ebdb0, weight:1}, simhash.feature{sum:0xfff236c5f092af95, weight:1}, simhash.feature{sum:0xd8adc1186b882df7, weight:1}, simhash.feature{sum:0x656c734f40ac6679, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0x5d60a51e6eb33462, weight:1}, simhash.feature{sum:0xe98a0708a4b03ab7, weight:1}, simhash.feature{sum:0x150fb27eecf79469, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0x8fde7a602c8faa3a, weight:1}, simhash.feature{sum:0x8329707b4eb895d, weight:1}, simhash.feature{sum:0xb8019c1cc35ecab1, weight:1}, simhash.feature{sum:0x37cbb6f821eaff03, weight:1}, simhash.feature{sum:0x150fb27eecf79469, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xa4fb0fc51c2e551b, weight:1}, simhash.feature{sum:0x8329707b4eb895c, weight:1}, simhash.feature{sum:0x246c8b28007c1970, weight:1}, simhash.feature{sum:0x3716c7ee2e72321, weight:1}, simhash.feature{sum:0xf8b0c02f5fe0b257, weight:1}, simhash.feature{sum:0xd89cb9186b79aca8, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x4be610a6aef6c731, weight:1}, simhash.feature{sum:0x1cb6df7ef1041835, weight:1}, simhash.feature{sum:0xfcdeddf9a175b394, weight:1}, simhash.feature{sum:0x3035a365e168961e, weight:1}}
[]simhash.Feature{simhash.feature{sum:0xc5c6ff7fe1f34c89, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0x1cbe3da5da62b610, weight:1}, simhash.feature{sum:0x7aa9362fa9816155, weight:1}, simhash.feature{sum:0xd89fad186b7bce35, weight:1}, simhash.feature{sum:0x12b142e963d1682d, weight:1}, simhash.feature{sum:0x8c1a417e9fdb35c5, weight:1}, simhash.feature{sum:0x38b39e054e1c1b67, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7be, weight:1}, simhash.feature{sum:0x9ab4937ea75b5c59, weight:1}, simhash.feature{sum:0x268587373f1f77b5, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7be, weight:1}, simhash.feature{sum:0x41ef33d8e01cb16c, weight:1}, simhash.feature{sum:0x8325407b4eb17fe, weight:1}, simhash.feature{sum:0xec8584acc12fcf27, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0x7306888cb4e8ab75, weight:1}, simhash.feature{sum:0x6976a39422c2abd8, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0x26feff7ef74c67b7, weight:1}, simhash.feature{sum:0x5a079a2f9797da68, weight:1}, simhash.feature{sum:0x9e8e79746e7ee735, weight:1}, simhash.feature{sum:0x3d52262f868f65ad, weight:1}, simhash.feature{sum:0x873cad20b5b03ae4, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xc5c6ff7fe1f34c89, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0xd8b0a7186b8a3730, weight:1}, simhash.feature{sum:0x26feff7ef74c67b7, weight:1}, simhash.feature{sum:0x16af988c443cff2b, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xd88a107ce8dad5a6, weight:1}, simhash.feature{sum:0x5bf18352ec4156d9, weight:1}, simhash.feature{sum:0x8c1a4d7e9fdb4a74, weight:1}, simhash.feature{sum:0x8325407b4eb17fe, weight:1}, simhash.feature{sum:0x8326107b4eb2dc7, weight:1}, simhash.feature{sum:0xd8cbc7186ba1352e, weight:1}, simhash.feature{sum:0xf90ceea98fba79f6, weight:1}, simhash.feature{sum:0xe8860067f74f9fbc, weight:1}, simhash.feature{sum:0x3035a365e168961e, weight:1}}
[]simhash.Feature{simhash.feature{sum:0xc5c6ff7fe1f34c8f, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0xfd0c9853db565f2f, weight:1}, simhash.feature{sum:0xd7f4302f4de077d2, weight:1}, simhash.feature{sum:0xf2b15d4ce63f5477, weight:1}, simhash.feature{sum:0xd8bac5186b92beb0, weight:1}, simhash.feature{sum:0x27132a7ef75d598d, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0x12d6ee02dfea32b8, weight:1}, simhash.feature{sum:0x6976a39422c2abd8, weight:1}, simhash.feature{sum:0x8325a07b4eb21ac, weight:1}, simhash.feature{sum:0xde2e60d07d4ebdb0, weight:1}, simhash.feature{sum:0xfff236c5f092af95, weight:1}, simhash.feature{sum:0xd8adc1186b882df7, weight:1}, simhash.feature{sum:0x656c734f40ac6679, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0x5d60a51e6eb33462, weight:1}, simhash.feature{sum:0xe98a0708a4b03ab7, weight:1}, simhash.feature{sum:0x150fb27eecf79469, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0x8fde7a602c8faa3a, weight:1}, simhash.feature{sum:0x8329707b4eb895d, weight:1}, simhash.feature{sum:0xb8019c1cc35ecab1, weight:1}, simhash.feature{sum:0x37cbb6f821eaff03, weight:1}, simhash.feature{sum:0x150fb27eecf79469, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xa4fb0fc51c2e551b, weight:1}, simhash.feature{sum:0x8329707b4eb895c, weight:1}, simhash.feature{sum:0x246c8b28007c1970, weight:1}, simhash.feature{sum:0x3716c7ee2e72321, weight:1}, simhash.feature{sum:0xf8b0c02f5fe0b257, weight:1}, simhash.feature{sum:0xd89cb9186b79aca8, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x4be610a6aef6c731, weight:1}, simhash.feature{sum:0x1cb6df7ef1041835, weight:1}, simhash.feature{sum:0xfcdeddf9a175b394, weight:1}, simhash.feature{sum:0x3035a365e168961e, weight:1}}
[]simhash.Feature{simhash.feature{sum:0xc5c6ff7fe1f34c89, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0x1cbe3da5da62b610, weight:1}, simhash.feature{sum:0x7aa9362fa9816155, weight:1}, simhash.feature{sum:0xd89fad186b7bce35, weight:1}, simhash.feature{sum:0x12b142e963d1682d, weight:1}, simhash.feature{sum:0x8c1a417e9fdb35c5, weight:1}, simhash.feature{sum:0x38b39e054e1c1b67, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7be, weight:1}, simhash.feature{sum:0x9ab4937ea75b5c59, weight:1}, simhash.feature{sum:0x268587373f1f77b5, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7be, weight:1}, simhash.feature{sum:0x41ef33d8e01cb16c, weight:1}, simhash.feature{sum:0x8325407b4eb17fe, weight:1}, simhash.feature{sum:0xec8584acc12fcf27, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0x7306888cb4e8ab75, weight:1}, simhash.feature{sum:0x6976a39422c2abd8, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0x26feff7ef74c67b7, weight:1}, simhash.feature{sum:0x5a079a2f9797da68, weight:1}, simhash.feature{sum:0x9e8e79746e7ee735, weight:1}, simhash.feature{sum:0x3d52262f868f65ad, weight:1}, simhash.feature{sum:0x873cad20b5b03ae4, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xc5c6ff7fe1f34c89, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0xd8b0a7186b8a3730, weight:1}, simhash.feature{sum:0x26feff7ef74c67b7, weight:1}, simhash.feature{sum:0x16af988c443cff2b, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xd88a107ce8dad5a6, weight:1}, simhash.feature{sum:0x5bf18352ec4156d9, weight:1}, simhash.feature{sum:0x8c1a4d7e9fdb4a74, weight:1}, simhash.feature{sum:0x8325407b4eb17fe, weight:1}, simhash.feature{sum:0x8326107b4eb2dc7, weight:1}, simhash.feature{sum:0xd8cbc7186ba1352e, weight:1}, simhash.feature{sum:0xf90ceea98fba79f6, weight:1}, simhash.feature{sum:0xe8860067f74f9fbc, weight:1}, simhash.feature{sum:0x3035a365e168961e, weight:1}}
[]simhash.Feature{simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7b9, weight:1}, simhash.feature{sum:0xd98001186c3a6c5d, weight:1}, simhash.feature{sum:0x7a37c1ae2e57fa88, weight:1}, simhash.feature{sum:0x8326407b4eb32ae, weight:1}, simhash.feature{sum:0xd8b0a7186b8a3730, weight:1}, simhash.feature{sum:0xd8d9b1186bad4d2f, weight:1}, simhash.feature{sum:0x2c5b792934c8464e, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0x26feff7ef74c67b7, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xd89cc2186b79bc7e, weight:1}, simhash.feature{sum:0x93104c7ea350e1e1, weight:1}, simhash.feature{sum:0x8329307b4eb82ae, weight:1}, simhash.feature{sum:0x14dfbd7eecce8288, weight:1}, simhash.feature{sum:0x8325507b4eb192b, weight:1}, simhash.feature{sum:0xd8cbcd186ba13ffc, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c18, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7be, weight:1}, simhash.feature{sum:0x214486cdc2d73f89, weight:1}, simhash.feature{sum:0x3d52262f868f65ad, weight:1}, simhash.feature{sum:0xd8d299186ba70599, weight:1}, simhash.feature{sum:0xd8adc6186b88367f, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0x58bc5a1361284f0c, weight:1}, simhash.feature{sum:0xd8c8ad186b9ed323, weight:1}, simhash.feature{sum:0xd8a2cd186b7e3a1e, weight:1}, simhash.feature{sum:0x8325907b4eb2076, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xd89cc2186b79bc7e, weight:1}, simhash.feature{sum:0xf160267ed875749b, weight:1}, simhash.feature{sum:0x150fb27eecf79469, weight:1}, simhash.feature{sum:0x8a8c7bb9849d48f6, weight:1}, simhash.feature{sum:0x34e6e73324cc4c1c, weight:1}, simhash.feature{sum:0x8325407b4eb17fe, weight:1}, simhash.feature{sum:0xd89cc2186b79bc7e, weight:1}, simhash.feature{sum:0xbc78285d51f8f350, weight:1}, simhash.feature{sum:0x8325907b4eb2076, weight:1}, simhash.feature{sum:0x8c1a417e9fdb35c5, weight:1}, simhash.feature{sum:0x2c5b792934c8464e, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x1c7c2e0d9eb9677d, weight:1}, simhash.feature{sum:0x3035a365e168961e, weight:1}}
[]simhash.Feature{simhash.feature{sum:0xc5c6ff7fe1f34c8a, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0x3075dfaf5552d79e, weight:1}, simhash.feature{sum:0xc5c6ff7fe1f34c8a, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0x3075dfaf5552d79e, weight:1}, simhash.feature{sum:0x192cc0ca1d77458, weight:1}, simhash.feature{sum:0x6f5db37e8ecc76fd, weight:1}, simhash.feature{sum:0xdbfd3fbe6190d762, weight:1}, simhash.feature{sum:0x54c9ed4b266da2a5, weight:1}, simhash.feature{sum:0x8c1a417e9fdb35c5, weight:1}, simhash.feature{sum:0xd8d5c1186ba97fdd, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xda392f7af918887b, weight:1}, simhash.feature{sum:0x357ef82f825da4b8, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xb77e117eb8748afb, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0x22b4b6630fb27c45, weight:1}, simhash.feature{sum:0x8c1a417e9fdb35c5, weight:1}, simhash.feature{sum:0x8fc1c6be36e055d6, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0x2b94c0591a2848b9, weight:1}, simhash.feature{sum:0x26feff7ef74c67b7, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0x8326707b4eb37b4, weight:1}, simhash.feature{sum:0x91a1dacc76ac782e, weight:1}, simhash.feature{sum:0xd8b0a7186b8a3736, weight:1}, simhash.feature{sum:0x150fbd7eecf7a6ce, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xf160267ed875749b, weight:1}, simhash.feature{sum:0xd8adc6186b88367f, weight:1}, simhash.feature{sum:0xc4e8fa88937cb69, weight:1}, simhash.feature{sum:0x8325907b4eb207e, weight:1}, simhash.feature{sum:0xd89cc2186b79bc7e, weight:1}, simhash.feature{sum:0x27132a7ef75d598d, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7b6, weight:1}, simhash.feature{sum:0x873cad20b5b03ae4, weight:1}, simhash.feature{sum:0x8329607b4eb8787, weight:1}, simhash.feature{sum:0x2e047881b2f11bf2, weight:1}, simhash.feature{sum:0x3035a365e168961e, weight:1}}

func DoGetFeatures

func DoGetFeatures(b []byte, r *regexp.Regexp) []Feature

Splits the given []byte using the given regexp, then returns a slice containing a Feature constructed from each piece matched by the regexp

type FeatureSet

type FeatureSet interface {
	GetFeatures() []Feature
}

FeatureSet represents a set of features in a given document

type Simhash

type Simhash interface {
	NewSimhash() *Simhash
	Vectorize(features []Feature) Vector
	VectorizeBytes(features [][]byte) Vector
	Fingerprint(v Vector) uint64
	BuildSimhash(doc string, doc2words text.Doc2Words) uint64
	GetSimhash(fs FeatureSet) uint64
	SimhashBytes(b [][]byte) uint64
	NewWordFeatureSet(b []byte) *WordFeatureSet
	Shingle(w int, b [][]byte) [][]byte
}

type SimhashBase

type SimhashBase struct {
}

func NewSimhash

func NewSimhash() *SimhashBase

NewSimhash makes a new Simhash

func (*SimhashBase) BuildSimhash

func (st *SimhashBase) BuildSimhash(doc string, doc2words text.Doc2Words) uint64

BuildSimhash returns a 64-bit simhash of the given string

func (*SimhashBase) Fingerprint

func (st *SimhashBase) Fingerprint(v Vector) uint64

Fingerprint returns a 64-bit fingerprint of the given vector. The fingerprint f of a given 64-dimension vector v is defined as follows:

f[i] = 1 if v[i] >= 0
f[i] = 0 if v[i] < 0

func (*SimhashBase) GetSimhash

func (st *SimhashBase) GetSimhash(fs FeatureSet) uint64

GetSimhash returns a 64-bit simhash of the given feature set

func (*SimhashBase) NewWordFeatureSet

func (st *SimhashBase) NewWordFeatureSet(b []byte) *WordFeatureSet

func (*SimhashBase) Shingle

func (st *SimhashBase) Shingle(w int, b [][]byte) [][]byte

Shingle returns the w-shingling of the given set of bytes. For example, if the given input was {"this", "is", "a", "test"}, this returns {"this is", "is a", "a test"}

func (*SimhashBase) SimhashBytes

func (st *SimhashBase) SimhashBytes(b [][]byte) uint64

Returns a 64-bit simhash of the given bytes

func (*SimhashBase) Vectorize

func (st *SimhashBase) Vectorize(features []Feature) Vector

Vectorize generates 64 dimension vectors given a set of features. Vectors are initialized to zero. The i-th element of the vector is then incremented by weight of the i-th feature if the i-th bit of the feature is set, and decremented by the weight of the i-th feature otherwise.

func (*SimhashBase) VectorizeBytes

func (st *SimhashBase) VectorizeBytes(features [][]byte) Vector

VectorizeBytes generates 64 dimension vectors given a set of [][]byte, where each []byte is a feature with even weight.

Vectors are initialized to zero. The i-th element of the vector is then incremented by weight of the i-th feature if the i-th bit of the feature is set, and decremented by the weight of the i-th feature otherwise.

type Vector

type Vector [64]int

type WordFeatureSet

type WordFeatureSet struct {
	B []byte
}

WordFeatureSet is a feature set in which each word is a feature, all equal weight.

func (*WordFeatureSet) GetFeatures

func (w *WordFeatureSet) GetFeatures() []Feature

Returns a []Feature representing each word in the byte slice

func (*WordFeatureSet) Normalize

func (w *WordFeatureSet) Normalize()

Directories

Path Synopsis
sho -- SimHash Oracle, checks if a fingerprint is similar to existing ones.
sho -- SimHash Oracle, checks if a fingerprint is similar to existing ones.
simhashCJK -- simhash language-specific handling for CJK.
simhashCJK -- simhash language-specific handling for CJK.
simhashEng -- simhash language-specific handling for English.
simhashEng -- simhash language-specific handling for English.
simhashUTF -- simhash language-specific handling for UTF.
simhashUTF -- simhash language-specific handling for UTF.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL