bytewise

package
v0.0.0-...-c3bdb9d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 31, 2013 License: MIT Imports: 2 Imported by: 0

Documentation

Overview

Copyright 2013 Zack Pierce. Use of this source code is governed by a MIT-style license that can be found in the LICENSE file.

Package stralgo/bytewise implements various string algorithms with an emphasis on similarity metrics, implemented in per-byte fashion.

This bytewise approach is suited for speedy comparisons when the target strings contain no multi-byte runes.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DamerauLevenshteinDistance

func DamerauLevenshteinDistance(a, b string) (int, error)

DamerauLevenshteinDistance calculates the magnitude of difference between two strings using the Damerau- Levenshtein algorithm with adjacent-only transpositions, bytewise.

This edit distance is the minimum number of single-byte edits (insertions, deletions, substitutions, or transpositions) to transform one string into the other. DamerauLevenshtein differs from Levenshtein primarily in that DamerauLevenshtein considers adjacent-byte transpositions.

The larger the result, the more different the strings.

See: http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance

func DiceCoefficient

func DiceCoefficient(a, b string) (float64, error)

DiceCoefficent calculates the simiarlity of two strings per the Sorensen-Dice coefficient, bytewise.

The resulting value is scaled between 0 and 1.0, and a higher value means a higher similarity.

This algorithm is also known as the Sorensen Index, and is very close to the White Similarity metric, with the key distinctions that DiceCoefficient does not differentiate between whitespace and other characters and also does not account for bigram frequency count differences between the compared strings.

See: http://en.wikipedia.org/wiki/Sorensen-Dice_coefficient

Note that this algorithm implementation operates upon individual bytes and does not account for multibyte unicode runes.

Returns an error if both of the input strings contain less than two bytes.

func HammingDistance

func HammingDistance(a, b string) (uint, error)

See: http://en.wikipedia.org/wiki/Hamming_distance

Returns an error if the string lengths are not equal.

Note that this algorithm implementation operates upon individual bytes, and does not account for multibyte unicode runes.

func LevenshteinDistance

func LevenshteinDistance(a, b string) (int, error)

LevenshteinDistance calculates the magnitude of difference between two strings using the Levenshtein Distance metric, bytewise.

This edit distance is the minimum number of single-byte edits (insertions, deletions, or substitutions) needed to transform one string into another.

The larger the result, the more different the strings.

See: http://en.wikipedia.org/wiki/Levenshtein_distance

func WhiteSimilarity

func WhiteSimilarity(a, b string) (float64, error)

WhiteSimilarity calculates the similarity of two strings through a variation on the Sorensen-Dice Coefficient algorithm, bytewise.

The resulting value is scaled between 0 and 1.0, and a higher value means a higher similarity.

WhiteSimilarity differs from DiceCoefficient in that it disregards bigrams that include (single-byte) whitespace, applies an upper-case filter, and accounts for bigram frequency.

See: http://www.catalysoft.com/articles/strikeamatch.html

Note that this algorithm implementation operates upon individual bytes and does not account for multibyte unicode runes.

Returns an error if neither of the input strings contains at least one byte bigram without whitespace.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL