matchr

package module
v0.0.0-...-7bed6ef Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 6, 2022 License: GPL-2.0 Imports: 5 Imported by: 19

README

matchr

Go Reference

An approximate string matching library for the Go programming language.

Rationale

Data used in record linkage can often be of dubious quality. Typographical errors or changing data elements (to name a few things) make establishing similarity between two sets of data difficult. Rather than use exact string comparison in such situations, it is vital to have a means to identify how similar two strings are. Similarity functions can cater to certain data sets in order to make better matching decisions. The matchr library provides several of these similarity functions.

Documentation

Index

Constants

View Source
const GAP_COST = float64(0.5)

Variables

This section is empty.

Functions

func DamerauLevenshtein

func DamerauLevenshtein(s1 string, s2 string) (distance int)

DamerauLevenshtein computes the Damerau-Levenshtein distance between two strings. The returned value - distance - is the number of insertions, deletions, substitutions, and transpositions it takes to transform one string (s1) into another (s2). Each step in the transformation "costs" one distance point. It is similar to the Optimal String Alignment, algorithm, but is more complex because it allows multiple edits on substrings.

This implementation is based off of the one found on Wikipedia at http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Distance_with_adjacent_transpositions as well as KevinStern's Java implementation found at https://github.com/KevinStern/software-and-algorithms.

func DoubleMetaphone

func DoubleMetaphone(s1 string) (string, string)

DoubleMetaphone computes the Double-Metaphone value of the input string. This value is a phonetic representation of how the string sounds, with affordances for many different language dialects. It was originally developed by Lawrence Phillips in the 1990s.

More information about this algorithm can be found on Wikipedia at http://en.wikipedia.org/wiki/Metaphone.

func Hamming

func Hamming(s1 string, s2 string) (distance int, err error)

Hamming computes the Hamming distance between two equal-length strings. This is the number of times the two strings differ between characters at the same index. This implementation is based off of the algorithm description found at http://en.wikipedia.org/wiki/Hamming_distance.

func Jaro

func Jaro(r1 string, r2 string) (distance float64)

Jaro computes the Jaro edit distance between two strings. It represents this with a float64 between 0 and 1 inclusive, with 0 indicating the two strings are not at all similar and 1 indicating the two strings are exact matches.

See http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance for a full description.

func JaroWinkler

func JaroWinkler(r1 string, r2 string, longTolerance bool) (distance float64)

JaroWinkler computes the Jaro-Winkler edit distance between two strings. This is a modification of the Jaro algorithm that gives additional weight to prefix matches.

func Levenshtein

func Levenshtein(s1 string, s2 string) (distance int)

Levenshtein computes the Levenshtein distance between two strings. The returned value - distance - is the number of insertions, deletions, and substitutions it takes to transform one string (s1) into another (s2). Each step in the transformation "costs" one distance point.

func LongestCommonSubsequence

func LongestCommonSubsequence(s1, s2 string) int

LongestCommonSubsequence computes the longest substring between two strings. The returned value is the length of the substring, which contains letters from both strings, while maintaining the order of the letters.

func NYSIIS

func NYSIIS(s1 string) string

NYSIIS computes the NYSIIS phonetic encoding of the input string. It is a modification of the traditional Soundex algorithm.

func OSA

func OSA(s1 string, s2 string) (distance int)

OSA computes the Optimal String Alignment distance between two strings. The returned value - distance - is the number of insertions, deletions, substitutions, and transpositions it takes to transform one string (s1) into another (s2). Each step in the transformation "costs" one distance point. It is similar to Damerau-Levenshtein, but is simpler because it does not allow multiple edits on any substring.

func Phonex

func Phonex(s1 string) string

Phonex computes the Phonex phonetic encoding of the input string. Phonex is a modification of the venerable Soundex algorithm. It accounts for a few more letter combinations to improve accuracy on some data sets.

This implementation is based off of the original C implementation by the creator - A. J. Lait - as found in his research paper entitled "An Assessment of Name Matching Algorithms."

func SmithWaterman

func SmithWaterman(s1 string, s2 string) float64

SmithWaterman computes the Smith-Waterman local sequence alignment for the two input strings. This was originally designed to find similar regions in strings representing DNA or protein sequences.

func Soundex

func Soundex(s1 string) string

Soundex computes the Soundex phonetic representation of the input string. It attempts to encode homophones with the same characters. More information can be found at http://en.wikipedia.org/wiki/Soundex.

Types

type String

type String struct {
	// contains filtered or unexported fields
}

String wraps a regular string with a small structure that provides more efficient indexing by code point index, as opposed to byte index. Scanning incrementally forwards or backwards is O(1) per index operation (although not as fast a range clause going forwards). Random access is O(N) in the length of the string, but the overhead is less than always scanning from the beginning. If the string is ASCII, random access is O(1). Unlike the built-in string type, String has internal mutable state and is not thread-safe.

func NewString

func NewString(contents string) *String

NewString returns a new UTF-8 string with the provided contents.

func (*String) At

func (s *String) At(i int) int

At returns the rune with index i in the String. The sequence of runes is the same as iterating over the contents with a "for range" clause.

func (*String) Init

func (s *String) Init(contents string) *String

Init initializes an existing String to hold the provided contents. It returns a pointer to the initialized String.

func (*String) IsASCII

func (s *String) IsASCII() bool

IsASCII returns a boolean indicating whether the String contains only ASCII bytes.

func (*String) RuneCount

func (s *String) RuneCount() int

RuneCount returns the number of runes (Unicode code points) in the String.

func (*String) Slice

func (s *String) Slice(i, j int) string

Slice returns the string sliced at rune positions [i:j].

func (*String) String

func (s *String) String() string

String returns the contents of the String. This method also means the String is directly printable by fmt.Print.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL