runewise

package

v0.0.0-...-c3bdb9d Latest Latest Go to latest Published: Oct 31, 2013 License: MIT Imports: 4 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/ZackPierce/stralgo

Links

Open Source Insights

Documentation ¶

Overview ¶

Package stralgo/runewise implements various string algorithms with an emphasis on similarity metrics, implemented with support for multi-byte runes.

Index ¶

Constants
func DamerauLevenshteinDistance(a, b []rune) (int, error)
func DiceCoefficient(a, b []rune) (float64, error)
func HammingDistance(a, b []rune) (uint, error)
func JaroSimilarity(a, b []rune) float64
func JaroWinklerSimilarity(a, b []rune) float64
func JaroWinklerSimilarityParametric(a, b []rune, prefixScale float64, maxPrefixLength int, boostThreshold float64) float64
func LevenshteinDistance(a, b []rune) (int, error)
func WhiteSimilarity(a, b []rune) (float64, error)

Constants ¶

View Source

const (
	WinklerBoostThreshold  = 0.7 // JaroWinklerSimilarity suggested parameter. If the JaroSimilarity for the compared strings is above this value, add an additional boost factor based on the shared prefix length and prefix scale.
	WinklerMaxPrefixLength = 4   // JaroWinklerSimilarity suggested parameter. Used to control the maximum size of identical prefixes used in the prefix boost factor.
	WinklerPrefixScale     = 0.1 // JaroWinklerSimilarity suggested parameter. Used to control the scale of bonus added for a pair having a JaroSimilarity above the threshold and with shared string prefixes.
)

Variables ¶

This section is empty.

Functions ¶

func DamerauLevenshteinDistance ¶

func DamerauLevenshteinDistance(a, b []rune) (int, error)

DamerauLevenshteinDistance calculates the magnitude of difference between two strings using the Damerau- Levenshtein algorithm with adjacent-only transpositions, runewise.

This edit distance is the minimum number of single-rune edits (insertions, deletions, substitutions, or transpositions) to transform one string into the other. DamerauLevenshtein differs from Levenshtein primarily in that DamerauLevenshtein considers adjacent-rune transpositions.

The larger the result, the more different the strings.

See: http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance

func DiceCoefficient ¶

func DiceCoefficient(a, b []rune) (float64, error)

DiceCoefficent calculates the simiarlity of two strings per the Sorensen-Dice coefficient, runewise.

The resulting value is scaled between 0 and 1.0, and a higher value means a higher similarity.

This algorithm is also known as the Sorensen Index, and is very close to the White Similarity metric, with the key distinctions that DiceCoefficient does not differentiate between whitespace and other characters and also does not account for bigram frequency count differences between the compared strings.

See: http://en.wikipedia.org/wiki/Dice_coefficient

Returns an error if both of the input strings contain less than two runes.

func HammingDistance ¶

func HammingDistance(a, b []rune) (uint, error)

See: http://en.wikipedia.org/wiki/Hamming_distance

Returns an error if the string rune counts are not equal.

func JaroSimilarity ¶

func JaroSimilarity(a, b []rune) float64

JaroSimilarity calculates the similarity between two strings using the original Jaro distance formula.

The result is between 0 and 1.0, and the higher the score, the more similar the two strings are. 1.0 is a perfect match.

If either input argument is empty ([]rune("")) or nil, the result will be 0.0. This is due to a quirk in the formal definition of the algorithm which counts the number of matching characters. In the empty or nil cases, no matches may be found at all.

See (the first half of) : http://en.wikipedia.org/wiki/Jaro-Winkler_distance

func JaroWinklerSimilarity ¶

func JaroWinklerSimilarity(a, b []rune) float64

JaroWinklerSimilarity calculates the similarity between two input strings using the Jaro-Winkler distance formula.

Winkler's suggested constants for max considered common prefix length (4), common prefix scaling factor (0.1), and boost threshold (0.7) are used.

The result is between 0 and 1.0, and the higher the score, the more similar the two strings are. 1.0 is a perfect match.

If either input argument is empty ([]rune("")) or nil, the result will be 0.0. This is due to a quirk in the formal definition of the algorithm which counts the number of matching characters. In the empty or nil cases, no matches may be found at all.

See : http://en.wikipedia.org/wiki/Jaro-Winkler_distance

See : http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/JaroWinklerDistance.html

Note that the wikipedia article does not include a description of Winkler's boost threshold, an explanation of which can be found in the lingpipe documentation (linked above), and is demonstrated in Winkler's original code.

In short, the boost threshold has the following effect:

if calculatedJaroSimilarity < WinklerBoostThreshold {
	return calculatedJaroSimilarity
} else {
	return calculatedJaroSimilarity + prefixSimilarityBonus
}

The prefixSimilarityBonus is the modification to the original Jaro formula described on the Wikipedia article, and is equivalent to:

Min(calculatedLengthOfCommonPrefix, WinklerMaxPrefixLength)*WinklerPrefixScale*(1 - calculatedJaroSimilarity)

func JaroWinklerSimilarityParametric ¶

func JaroWinklerSimilarityParametric(a, b []rune, prefixScale float64, maxPrefixLength int, boostThreshold float64) float64

JaroWinklerSimilarityParametric calculates similarity between two input strings using the Jaro-Winkler distance formula.

The product of prefixScale and maxPrefixLength should be between 0.0 and 1.0. Assuming this is true, the result will be between 0 and 1.0.

The higher the score, the more similar the two strings are. 1.0 is a perfect match.

See : http://en.wikipedia.org/wiki/Jaro-Winkler_distance

See : http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/JaroWinklerDistance.html

Note that the wikipedia article does not include a description of Winkler's boost threshold, an explanation of which can be found in the lingpipe documentation (linked above), and is demonstrated in Winkler's original code.

In short, the boost threshold has the following effect:

if calculatedJaroSimilarity < boostThreshold {
	return calculatedJaroSimilarity
} else {
	return calculatedJaroSimilarity + prefixSimilarityBonus
}

The prefixSimilarityBonus is the modification to the original Jaro formula described on the Wikipedia article, and is equivalent to:

Min(calculatedLengthOfCommonPrefix, maxPrefixLength)*prefixScale*(1 - calculatedJaroSimilarity)/

func LevenshteinDistance ¶

func LevenshteinDistance(a, b []rune) (int, error)

LevenshteinDistance calculates the magnitude of difference between two strings using the Levenshtein Distance metric.

This edit distance is the minimum number of single-rune edits (insertions, deletions, or substitutions) needed to transform one string into the other.

The larger the result, the more different the strings.

See: http://en.wikipedia.org/wiki/Levenshtein_distance

func WhiteSimilarity ¶

func WhiteSimilarity(a, b []rune) (float64, error)

WhiteSimilarity calculates the similarity of two strings through a variation on the Sorensen-Dice Coefficient algorithm.

The resulting value is scaled between 0 and 1.0, and a higher value means a higher similarity.

WhiteSimilarity differs from DiceCoefficient in that it disregards bigrams that include whitespace, and applies an upper-case filter, and accounts for bigram frequency.

See: http://www.catalysoft.com/articles/strikeamatch.html

Returns an error if neither of the input strings contains at least one rune bigram without whitespace.

Types ¶

This section is empty.

Source Files ¶

View all Source files

runewise.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL