stringdist

package module
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 16, 2019 License: Apache-2.0 Imports: 0 Imported by: 0

README

stringdist

stringdist package contains several string metrics for calculating edit distance between two different strings. This includes the Levenshtein Distance, Damerau Levenshtein (both Optimal String Alignment, OSA and true damerau levenshtein), Jaro, Jaro Winkler and additionally a BK-Tree that can be used for autocorrect.

Algorithms

  • Levenshtein: A string metric for measuring the difference between two sequence. Done by computing the minimum number of single-edit character edit (insertion, substitution and deletion) required to change from one word to another.
  • Damerau-Levenshteim: similar to Levenshtein, but allows transposition of two adjacent characters. Can be computed with two different algorithm - Optimal String Alignment, (OSA) and true damerau-levenshtein. The assumption for ASA is taht no substring is edited more than once.
  • Jaro: Jaro distance between two words is the minimum number of single-character transpositions required to change one word into the other.
  • Jaro-Winkler: Similar to Jaro, but uses a prefix scale which gives more favourable ratings to strings that match from the beginning for a set prefix length.
  • BK-Tree: A tree data structure specialized to index data in a metric space. Can be used for approximate string matching in a dictionary.

Other algorithms to explore:

  • Sift3/4 algorithm
  • Soundex
  • Metaphone
  • Hamming Distance
  • Symspell
  • Linspell

Thoughts

  • Autocorrect can be implemented using any of the distance metrics (such as levenshtein) with BK-Tree
  • Distance metric can be supplied to bk-tree through an interface.
  • Dictionary words can first be supplied to the tree, and subsequent words can be added later through other means (syncing, streaming, pub-sub)
  • The tree can be snapshotted periodically to avoid rebuild (e.g. using gob), test should be conducted to see if rebuilding the tree is faster than reloading the whole tree.
  • Build tree through prefix (A-Z) would result in better performance (?). How to avoid hotspots (more characters in A than Z)?
  • Can part of the tree be transmitted through the network?
  • How to blacklist words that are not supposed to be searchable? (profanity words)

References

Documentation

Index

Constants

View Source
const Tolerance = 2

The difference in edit distance.

View Source
const WordLen = 32

Maximum word length that is supported.

Variables

View Source
var BoostThreshold float64 = 0.7
View Source
var PrefixLen = 4.

Functions

func Jaro

func Jaro(source, target string) float64

func JaroWinkler

func JaroWinkler(s1, s2 string) float64

Types

type BKTree

type BKTree struct {
	// contains filtered or unexported fields
}

func NewBKTree

func NewBKTree(calculator Calculator) *BKTree

func (*BKTree) Add

func (b *BKTree) Add(word string)

func (*BKTree) Search

func (b *BKTree) Search(word string, d int) []string

type Calculator

type Calculator interface {
	Calculate(source, target string) int
}

Calculator calculates the distance between two string.

type DamerauLevenshtein

type DamerauLevenshtein struct {
	// contains filtered or unexported fields
}

func NewDamerauLevenshtein

func NewDamerauLevenshtein(size int) *DamerauLevenshtein

func (*DamerauLevenshtein) Calculate

func (d *DamerauLevenshtein) Calculate(s, t string) int

type Levenshtein

type Levenshtein struct {
	// contains filtered or unexported fields
}

Levenshtein represents the levenshtein operation.

func NewLevenshtein

func NewLevenshtein(size int) *Levenshtein

NewLevenshtein returns a new levenshtein operation.

func (*Levenshtein) Calculate

func (l *Levenshtein) Calculate(s, t string) int

Calculate the levenshtein distance for the given source and target string.

type Node

type Node struct {
	// contains filtered or unexported fields
}

func NewNode

func NewNode(word string) *Node

type TrueDamerauLevenshtein

type TrueDamerauLevenshtein struct {
	// contains filtered or unexported fields
}

func NewTrueDamerauLevenshtein

func NewTrueDamerauLevenshtein() *TrueDamerauLevenshtein

func (*TrueDamerauLevenshtein) Calculate

func (dl *TrueDamerauLevenshtein) Calculate(s, t string) int

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL