closestmatch

package module
v2.1.0+incompatible Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 24, 2017 License: MIT Imports: 5 Imported by: 70

README ¶

closestmatch 📃

Version Build Status Code Coverage GoDoc

closestmatch is a simple and fast Go library for fuzzy matching an input string to a list of target strings. closestmatch is useful for handling input from a user where the input (which could be mispelled or out of order) needs to match a key in a database. closestmatch uses a bag-of-words approach to precompute character n-grams to represent each possible target string. The closest matches have highest overlap between the sets of n-grams. The precomputation scales well and is much faster and more accurate than Levenshtein for long strings.

Getting Started

Install

go get -u -v github.com/schollz/closestmatch

Use

Create a closestmatch object from a list words
// Take a slice of keys, say band names that are similar
// http://www.tonedeaf.com.au/412720/38-bands-annoyingly-similar-names.htm
wordsToTest := []string{"King Gizzard", "The Lizard Wizard", "Lizzard Wizzard"}

// Choose a set of bag sizes, more is more accurate but slower
bagSizes := []int{2}

// Create a closestmatch object
cm := closestmatch.New(wordsToTest, bagSizes)
Find the closest match, or find the N closest matches
fmt.Println(cm.Closest("kind gizard"))
// returns 'King Gizzard'

fmt.Println(cm.ClosestN("kind gizard",3))
// returns [King Gizzard Lizzard Wizzard The Lizard Wizard]
Calculate the accuracy
// Calculate accuracy
fmt.Println(cm.AccuracyMutatingWords())
// ~ 66 % (still way better than Levenshtein which hits 0% with this particular set)

// Improve accuracy by adding more bags
bagSizes = []int{2, 3, 4}
cm = closestmatch.New(wordsToTest, bagSizes)
fmt.Println(cm.AccuracyMutatingWords())
// accuracy improves to ~ 76 %
Save/Load
// Save your current calculated bags
cm.Save("closestmatches.gob")

// Open it again
cm2, _ := closestmatch.Load("closestmatches.gob")
fmt.Println(cm2.Closest("lizard wizard"))
// prints "The Lizard Wizard"
Advantages

closestmatch is more accurate than Levenshtein for long strings (like in the test corpus).

closestmatch is ~20x faster than a fast implementation of Levenshtein. Try it yourself with the benchmarks:

cd $GOPATH/src/github.com/schollz/closestmatch && go test -run=None -bench=. > closestmatch.bench
cd $GOPATH/src/github.com/schollz/closestmatch/levenshtein && go test -run=None -bench=. > levenshtein.bench
benchcmp levenshtein.bench ../closestmatch.bench

which gives the following benchmark (on Intel i7-3770 CPU @ 3.40GHz w/ 8 processors):

benchmark                 old ns/op     new ns/op     delta
BenchmarkNew-8            1.47          1933870       +131555682.31%
BenchmarkClosestOne-8     104603530     4855916       -95.36%

The New() function in closestmatch is so slower than levenshtein because there is precomputation needed.

Disadvantages

closestmatch does worse for matching lists of single words, like a dictionary. For comparison:

$ cd $GOPATH/src/github.com/schollz/closestmatch && go test
Accuracy with mutating words in book list:      90.0%
Accuracy with mutating letters in book list:    100.0%
Accuracy with mutating letters in dictionary:   38.9%

while levenshtein performs slightly better for a single-word dictionary (but worse for longer names, like book titles):

$ cd $GOPATH/src/github.com/schollz/closestmatch/levenshtein && go test
Accuracy with mutating words in book list:      40.0%
Accuracy with mutating letters in book list:    100.0%
Accuracy with mutating letters in dictionary:   64.8%

License

MIT

Documentation ¶

Index ¶

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type ClosestMatch ¶

type ClosestMatch struct {
	SubstringSizes []int
	SubstringToID  map[string]map[uint32]struct{}
	ID             map[uint32]IDInfo
}

ClosestMatch is the structure that contains the substring sizes and carrys a map of the substrings for easy lookup

func Load ¶

func Load(filename string) (*ClosestMatch, error)

Load can load a previously saved ClosestMatch object from disk

func New ¶

func New(possible []string, subsetSize []int) *ClosestMatch

New returns a new structure for performing closest matches

func (*ClosestMatch) AccuracyMutatingLetters ¶

func (cm *ClosestMatch) AccuracyMutatingLetters() float64

AccuracyMutatingLetters runs some basic tests against the wordlist to see how accurate this bag-of-characters method is against the target dataset when mutating individual letters (adding, removing, changing)

func (*ClosestMatch) AccuracyMutatingWords ¶

func (cm *ClosestMatch) AccuracyMutatingWords() float64

AccuracyMutatingWords runs some basic tests against the wordlist to see how accurate this bag-of-characters method is against the target dataset

func (*ClosestMatch) Closest ¶

func (cm *ClosestMatch) Closest(searchWord string) string

Closest searches for the `searchWord` and returns the closest match

func (*ClosestMatch) ClosestN ¶

func (cm *ClosestMatch) ClosestN(searchWord string, n int) []string

ClosestN searches for the `searchWord` and returns the n closests matches

func (*ClosestMatch) Save ¶

func (cm *ClosestMatch) Save(filename string) error

Save writes the current ClosestSave object as a gzipped JSON file

type IDInfo ¶

type IDInfo struct {
	Key           string
	NumSubstrings int
}

IDInfo carries the information about the keys

type Pair ¶

type Pair struct {
	Key   string
	Value int
}

type PairList ¶

type PairList []Pair

func (PairList) Len ¶

func (p PairList) Len() int

func (PairList) Less ¶

func (p PairList) Less(i, j int) bool

func (PairList) Swap ¶

func (p PairList) Swap(i, j int)

Directories ¶

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL