distance

package
v0.0.0-...-e540a08 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 12, 2021 License: BSD-3-Clause Imports: 5 Imported by: 2

README

html-distance

html-distance is a go library for computing the proximity of the HTML pages. The implementation similiarity fingerprint is Charikar's simhash.

We used BK Tree (Burkhard and Keller) for verifying if a fingerprint is closed to a set of fingerprint within a defined proximity distance.

Distance is the hamming distance of the fingerprints. Since fingerprint is of size 64 (inherited from hash/fnv), Similiarity is defined as 1 - d / 64.

In normal scenario, similarity > 95% (i.e. d>3) could be considered as duplicated html pages.

Get the source

go get github.com/yahoo/gryffin/html-distance/...

Install

go install github.com/yahoo/gryffin/html-distance/cmd/html-distance

Command Line Interface

Usage of html-distance:

    html-distance url1 url2

Example 1

$ html-distance https://www.flickr.com/photos/120759744@N07/20389369791/ https://www.flickr.com/photos/120759744@N07/20374523532/in/photostream/

Fetching https://www.flickr.com/photos/120759744@N07/20389369791/, Got 200
Fetching https://www.flickr.com/photos/120759744@N07/20374523532/in/photostream/, Got 200
Feature distance is 0. HTML Similarity is 100.00%

Example 2

$ html-distance https://www.yahoo.com/politics/kasichs-reception-on-gay-marriage-important-126109300441.html https://www.yahoo.com/tech/s/verizon-drop-phone-contracts-end-discounted-phones-201530971--finance.html

Fetching https://www.yahoo.com/politics/kasichs-reception-on-gay-marriage-important-126109300441.html, Got 200
Fetching https://www.yahoo.com/tech/s/verizon-drop-phone-contracts-end-discounted-phones-201530971--finance.html, Got 200
Feature distance is 2. HTML Similarity is 96.88%

Example 3

$ html-distance https://www.flickr.com/photos/120759744@N07/20389369791/ https://www.yahoo.com/tech/s/verizon-drop-phone-contracts-end-discounted-phones-201530971--finance.html

Fetching https://www.flickr.com/photos/120759744@N07/20389369791/, Got 200
Fetching https://www.yahoo.com/tech/s/verizon-drop-phone-contracts-end-discounted-phones-201530971--finance.html, Got 200
Feature distance is 9. HTML Similarity is 85.94%

Documentation

Overview

Package distance is a go library for computing the proximity of the HTML pages. The implementation similiarity fingerprint is Charikar's simhash.

Distance is the hamming distance of the fingerprints. Since fingerprint is of size 64 (inherited from hash/fnv), Similiarity is defined as 1 - d / 64.

In normal scenario, similarity > 95% (i.e. d>3) could be considered as duplicated html pages.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Distance

func Distance(a, b uint64) uint8

Distance return the similarity distance between two fingerprint.

func Fingerprint

func Fingerprint(r io.Reader, shingle int) uint64

Fingerprint generates the fingerprint of an HTML from the io.Reader r and a shingle factor. Shingle refers to the level of shuffling. E.g. with shingle factor =2, input "a", "b", "c" will be converted to "a b", "b c"

Types

type Oracle

type Oracle struct {
	// contains filtered or unexported fields
}

Oracle answers the query if a fingerprint has been seen.

func NewOracle

func NewOracle() *Oracle

NewOracle return an oracle that could tell if the fingerprint has been seen or not.

func (*Oracle) See

func (n *Oracle) See(f uint64) *Oracle

See asks the oracle to see the fingerprint.

func (*Oracle) Seen

func (n *Oracle) Seen(f uint64, r uint8) bool

Seen asks the oracle if anything closed to the fingerprint in a range (r) is seen before.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL