gryffin: github.com/yahoo/gryffin/html-distance Index | Files

package distance

import "github.com/yahoo/gryffin/html-distance"

Package html-distance is a go library for computing the proximity of the HTML pages. The implementation similiarity fingerprint is Charikar's simhash.

Distance is the hamming distance of the fingerprints. Since fingerprint is of size 64 (inherited from hash/fnv), Similiarity is defined as 1 - d / 64.

In normal scenario, similarity > 95% (i.e. d>3) could be considered as duplicated html pages.

Index

Package Files

bktree.go feature.go

func Distance Uses

func Distance(a, b uint64) uint8

Distance return the similarity distance between two fingerprint.

func Fingerprint Uses

func Fingerprint(r io.Reader, shingle int) uint64

Fingerprint generates the fingerprint of an HTML from the io.Reader r and a shingle factor. Shingle refers to the level of shuffling. E.g. with shingle factor =2, input "a", "b", "c" will be converted to "a b", "b c"

type Oracle Uses

type Oracle struct {
    // contains filtered or unexported fields
}

Oracle answers the query if a fingerprint has been seen.

func NewOracle Uses

func NewOracle() *Oracle

NewOracle return an oracle that could tell if the fingerprint has been seen or not.

func (*Oracle) See Uses

func (n *Oracle) See(f uint64) *Oracle

See asks the oracle to see the fingerprint.

func (*Oracle) Seen Uses

func (n *Oracle) Seen(f uint64, r uint8) bool

Seen asks the oracle if anything closed to the fingerprint in a range (r) is seen before.

Package distance imports 4 packages (graph) and is imported by 1 packages. Updated 2016-07-21. Refresh now. Tools for package owners.