runewise

package
v0.0.0-...-c3bdb9d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 31, 2013 License: MIT Imports: 4 Imported by: 0

Documentation

Overview

Copyright 2013 Zack Pierce. Use of this source code is governed by a MIT-style license that can be found in the LICENSE file.

Package stralgo/runewise implements various string algorithms with an emphasis on similarity metrics, implemented with support for multi-byte runes.

Index

Constants

View Source
const (
	WinklerBoostThreshold  = 0.7 // JaroWinklerSimilarity suggested parameter. If the JaroSimilarity for the compared strings is above this value, add an additional boost factor based on the shared prefix length and prefix scale.
	WinklerMaxPrefixLength = 4   // JaroWinklerSimilarity suggested parameter. Used to control the maximum size of identical prefixes used in the prefix boost factor.
	WinklerPrefixScale     = 0.1 // JaroWinklerSimilarity suggested parameter. Used to control the scale of bonus added for a pair having a JaroSimilarity above the threshold and with shared string prefixes.
)

Variables

This section is empty.

Functions

func DamerauLevenshteinDistance

func DamerauLevenshteinDistance(a, b []rune) (int, error)

DamerauLevenshteinDistance calculates the magnitude of difference between two strings using the Damerau- Levenshtein algorithm with adjacent-only transpositions, runewise.

This edit distance is the minimum number of single-rune edits (insertions, deletions, substitutions, or transpositions) to transform one string into the other. DamerauLevenshtein differs from Levenshtein primarily in that DamerauLevenshtein considers adjacent-rune transpositions.

The larger the result, the more different the strings.

See: http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance

func DiceCoefficient

func DiceCoefficient(a, b []rune) (float64, error)

DiceCoefficent calculates the simiarlity of two strings per the Sorensen-Dice coefficient, runewise.

The resulting value is scaled between 0 and 1.0, and a higher value means a higher similarity.

This algorithm is also known as the Sorensen Index, and is very close to the White Similarity metric, with the key distinctions that DiceCoefficient does not differentiate between whitespace and other characters and also does not account for bigram frequency count differences between the compared strings.

See: http://en.wikipedia.org/wiki/Dice_coefficient

Returns an error if both of the input strings contain less than two runes.

func HammingDistance

func HammingDistance(a, b []rune) (uint, error)

See: http://en.wikipedia.org/wiki/Hamming_distance

Returns an error if the string rune counts are not equal.

func JaroSimilarity

func JaroSimilarity(a, b []rune) float64

JaroSimilarity calculates the similarity between two strings using the original Jaro distance formula.

The result is between 0 and 1.0, and the higher the score, the more similar the two strings are. 1.0 is a perfect match.

If either input argument is empty ([]rune("")) or nil, the result will be 0.0. This is due to a quirk in the formal definition of the algorithm which counts the number of matching characters. In the empty or nil cases, no matches may be found at all.

See (the first half of) : http://en.wikipedia.org/wiki/Jaro-Winkler_distance

See also : http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/JaroWinklerDistance.html

func JaroWinklerSimilarity

func JaroWinklerSimilarity(a, b []rune) float64

JaroWinklerSimilarity calculates the similarity between two input strings using the Jaro-Winkler distance formula.

Winkler's suggested constants for max considered common prefix length (4), common prefix scaling factor (0.1), and boost threshold (0.7) are used.

The result is between 0 and 1.0, and the higher the score, the more similar the two strings are. 1.0 is a perfect match.

If either input argument is empty ([]rune("")) or nil, the result will be 0.0. This is due to a quirk in the formal definition of the algorithm which counts the number of matching characters. In the empty or nil cases, no matches may be found at all.

See : http://en.wikipedia.org/wiki/Jaro-Winkler_distance

See : http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/JaroWinklerDistance.html

Note that the wikipedia article does not include a description of Winkler's boost threshold, an explanation of which can be found in the lingpipe documentation (linked above), and is demonstrated in Winkler's original code.

In short, the boost threshold has the following effect:

if calculatedJaroSimilarity < WinklerBoostThreshold {
	return calculatedJaroSimilarity
} else {
	return calculatedJaroSimilarity + prefixSimilarityBonus
}

The prefixSimilarityBonus is the modification to the original Jaro formula described on the Wikipedia article, and is equivalent to:

Min(calculatedLengthOfCommonPrefix, WinklerMaxPrefixLength)*WinklerPrefixScale*(1 - calculatedJaroSimilarity)

func JaroWinklerSimilarityParametric

func JaroWinklerSimilarityParametric(a, b []rune, prefixScale float64, maxPrefixLength int, boostThreshold float64) float64

JaroWinklerSimilarityParametric calculates similarity between two input strings using the Jaro-Winkler distance formula.

The product of prefixScale and maxPrefixLength should be between 0.0 and 1.0. Assuming this is true, the result will be between 0 and 1.0.

The higher the score, the more similar the two strings are. 1.0 is a perfect match.

See : http://en.wikipedia.org/wiki/Jaro-Winkler_distance

See : http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/JaroWinklerDistance.html

Note that the wikipedia article does not include a description of Winkler's boost threshold, an explanation of which can be found in the lingpipe documentation (linked above), and is demonstrated in Winkler's original code.

In short, the boost threshold has the following effect:

if calculatedJaroSimilarity < boostThreshold {
	return calculatedJaroSimilarity
} else {
	return calculatedJaroSimilarity + prefixSimilarityBonus
}

The prefixSimilarityBonus is the modification to the original Jaro formula described on the Wikipedia article, and is equivalent to:

Min(calculatedLengthOfCommonPrefix, maxPrefixLength)*prefixScale*(1 - calculatedJaroSimilarity)/

func LevenshteinDistance

func LevenshteinDistance(a, b []rune) (int, error)

LevenshteinDistance calculates the magnitude of difference between two strings using the Levenshtein Distance metric.

This edit distance is the minimum number of single-rune edits (insertions, deletions, or substitutions) needed to transform one string into the other.

The larger the result, the more different the strings.

See: http://en.wikipedia.org/wiki/Levenshtein_distance

func WhiteSimilarity

func WhiteSimilarity(a, b []rune) (float64, error)

WhiteSimilarity calculates the similarity of two strings through a variation on the Sorensen-Dice Coefficient algorithm.

The resulting value is scaled between 0 and 1.0, and a higher value means a higher similarity.

WhiteSimilarity differs from DiceCoefficient in that it disregards bigrams that include whitespace, and applies an upper-case filter, and accounts for bigram frequency.

See: http://www.catalysoft.com/articles/strikeamatch.html

Returns an error if neither of the input strings contains at least one rune bigram without whitespace.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL