distance

package module
v0.0.0-...-ce56720 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 30, 2017 License: BSD-3-Clause Imports: 4 Imported by: 0

README

html-distance

html-distance is a go library for computing the proximity of the HTML pages. The implementation similiarity fingerprint is Charikar's simhash.

We used BK Tree (Burkhard and Keller) for verifying if a fingerprint is closed to a set of fingerprint within a defined proximity distance.

Distance is the hamming distance of the fingerprints. Since fingerprint is of size 64 (inherited from hash/fnv), Similiarity is defined as 1 - d / 64.

In normal scenario, similarity > 95% (i.e. d>3) could be considered as duplicated html pages.

Get the source

go get github.com/yahoo/gryffin/html-distance/...

Install

go install github.com/yahoo/gryffin/html-distance/cmd/html-distance

Command Line Interface

Usage of html-distance:

    html-distance url1 url2

Example 1

$ html-distance https://www.flickr.com/photos/120759744@N07/20389369791/ https://www.flickr.com/photos/120759744@N07/20374523532/in/photostream/

Fetching https://www.flickr.com/photos/120759744@N07/20389369791/, Got 200
Fetching https://www.flickr.com/photos/120759744@N07/20374523532/in/photostream/, Got 200
Feature distance is 0. HTML Similarity is 100.00%

Example 2

$ html-distance https://www.yahoo.com/politics/kasichs-reception-on-gay-marriage-important-126109300441.html https://www.yahoo.com/tech/s/verizon-drop-phone-contracts-end-discounted-phones-201530971--finance.html

Fetching https://www.yahoo.com/politics/kasichs-reception-on-gay-marriage-important-126109300441.html, Got 200
Fetching https://www.yahoo.com/tech/s/verizon-drop-phone-contracts-end-discounted-phones-201530971--finance.html, Got 200
Feature distance is 2. HTML Similarity is 96.88%

Example 3

$ html-distance https://www.flickr.com/photos/120759744@N07/20389369791/ https://www.yahoo.com/tech/s/verizon-drop-phone-contracts-end-discounted-phones-201530971--finance.html

Fetching https://www.flickr.com/photos/120759744@N07/20389369791/, Got 200
Fetching https://www.yahoo.com/tech/s/verizon-drop-phone-contracts-end-discounted-phones-201530971--finance.html, Got 200
Feature distance is 9. HTML Similarity is 85.94%

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Fingerprint

func Fingerprint(r io.Reader, shingle int) uint64

Fingerprint generates the fingerprint of an HTML from the io.Reader r and a shingle factor. Shingle refers to the level of shuffling. E.g. with shingle factor =2, input "a", "b", "c" will be converted to "a b", "b c"

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL