jellyfish

package module
v0.0.0-...-81d50dd Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 21, 2019 License: BSD-3-Clause Imports: 2 Imported by: 4

README

go-jellyfish

GoDoc Build Status

go-jellyfish is a Go library for approximate and phonetic matches of strings.

go-jellyfish is based on the C/Python version of jellyfish.

Written by James Turk dev@jamesturk.net and released under a BSD-style license. (See LICENSE for details.)

Porter Stemmer implementation based upon Alex Gonopolskiy's go-stem, with permission.

Requirements

Tests require Go >= 1.4

Included Algorithms

String comparison:

  • Levenshtein Distance
  • Damerau-Levenshtein Distance
  • Jaro Distance
  • Jaro-Winkler Distance
  • Match Rating Approach Comparison
  • Hamming Distance

Phonetic encoding:

  • American Soundex
  • Metaphone
  • NYSIIS (New York State Identification and Intelligence System)
  • Match Rating Codex

Documentation

Overview

Package jellyfish provides Go implementations of common string comparison an dphonetic encoding algorithms.

Source code and other details are available at GitHub: https://github.com/jamesturk/go-jellyfish

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DamerauLevenshtein

func DamerauLevenshtein(s1, s2 string) int

DamerauLevenshtein computes the Damerau-Levenshtein distance between two strings.

A modification of Levenshtein distance, Damerau-Levenshtein distance counts the number of edits (insertions, deletions, and substitutions) but unlike Levenshtein, considers transpositions (such as ifhs for fish) a single edit.

For example:

Levenshtein("fish", "ifsh") == 2          // one deletion, one insertion
                                          // but...
DamerauLevenshtein("fish", "ifsh") == 1   // one transposition

See the Damerau-Levenshtein distance article at Wikipedia (http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance) for more details.

func Hamming

func Hamming(s1, s2 string) int

Hamming computes the Hamming distance between s1 and s2.

Hamming distance is the number of characters that differ between two strings.

Typically Hamming distance is undefined when strings are of different lengths, this implementation considers extra characters as differing. Thus Hamming("abc", "abcd") == 1

See the Hamming distance article at Wikipedia (http://en.wikipedia.org/wiki/Hamming_distance) for more details.

func Jaro

func Jaro(s1, s2 string) float64

Jaro computes the Jaro distance between two strings.

Jaro distance is a string-edit distance that gives a floating point response in [0,1] where 0 represents two completely dissimilar strings and 1 represents identical strings.

func JaroWinkler

func JaroWinkler(s1, s2 string) float64

JaroWinkler computes the Jaro-Winkler distance between two strings.

Jaro-Winkler is a modification/improvement to Jaro distance, like Jaro it gives a floating point response in [0,1] where 0 represents two completely dissimilar strings and 1 represents identical strings.

See the Jaro-Winkler distance article at Wikipedia (http://en.wikipedia.org/wiki/Jaro-Winkler_distance) for more details.

func Levenshtein

func Levenshtein(s1, s2 string) int

Levenshtein computes the Levenshtein distance between two strings.

Levenshtein distance represents the number of insertions, deletions, and subsititutions required to change one word to another.

For example:

Levenshtein("berne", "born") == 2

representing the transformation of the first e to o and the deletion of the second e.

See the Levenshtein distance article at Wikipedia (http://en.wikipedia.org/wiki/Levenshtein_distance) for more details.

func MatchRatingCodex

func MatchRatingCodex(str string) string

MatchRatingCodex calculate the match rating approach value (also called PNI) for a string.

The Match rating approach algorithm is an algorithm for determining whether or not two names are pronounced similarly. The algorithm consists of an encoding function (similar to Soundex or NYSIIS) which is implemented here as well as MatchRatingComparison which does the actual comparison.

See the Match Rating Approach article at Wikipedia (http://en.wikipedia.org/wiki/Match_rating_approach) for more details.

func MatchRatingComparison

func MatchRatingComparison(s1, s2 string) bool

MatchRatingComparison compares two strings using the match rating approach algorithm. Returns true if strings are considered equivalent or false if not.

The Match rating approach algorithm is an algorithm for determining whether or not two names are pronounced similarly. Strings are first encoded using MatchRatingCodex then compared according to the MRA algorithm.

See the Match Rating Approach article at Wikipedia (http://en.wikipedia.org/wiki/Match_rating_approach) for more details.

func Metaphone

func Metaphone(s string) string

Metaphone calculates the metaphone code for a string.

The Metaphone algorithm was designed as an improvement on Soundex. It transforms a word into a string consisting of '0BFHJKLMNPRSTWXY' where '0' is pronounced 'th' and 'X' is a '[sc]h' sound.

For example:

Metaphone("Klumpz") == Metaphone("Clumps")    // KLMPS

See the Metaphone article at Wikipedia (http://en.wikipedia.org/wiki/Metaphone) for more details.

func Nysiis

func Nysiis(s string) string

Nysiis calculates the NYSIIS code for a string.

The NYSIIS algorithm is an algorithm developed by the New York State Identification and Intelligence System. It transforms a word into a phonetic code. Like Soundex and Metaphone it is primarily intended for use on names (as they would be pronounced in English).

For example:

Nysiis("John") == Nysiis("Jan") 	// JAN

See the NYSIIS article at Wikipedia (http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System) for more details.

func Porter

func Porter(str string) string

Porter returns the stem of the given string using the common Porter stemmer algorithm.

Stemming is the process of reducing a word to its root form, for example 'stemmed' to 'stem'.

Martin Porter's algorithm is a common algorithm used for stemming English words that works for many purposes.

See the official homepage for the Porter Stemming Algorithm (http://tartarus.org/martin/PorterStemmer/) for more details.

This Go implementation takes inspiration from Alex Gonopolskiy's go-stem (https://github.com/agonopol/go-stem).

func Soundex

func Soundex(str string) string

Soundex is an algorithm to convert a word (typically a name) to a four digit code in the form 'A123' where 'A' is the first letter of the name and the digits represent similar sounds.

For example:

soundex("Ann") == soundex("Anne")      // A500
soundex("Rupert") == soundex("Robert") // R163

See the Soundex article at Wikipedia (http://en.wikipedia.org/wiki/Soundex) for more details.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL