ferret

package module

v0.0.0-...-14de0b6 Latest Latest Go to latest Published: Feb 19, 2019 License: Apache-2.0 Imports: 3 Imported by: 7

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/argusdusty/Ferret

Links

Open Source Insights

README ¶

Ferret

An optimized substring search engine written in Go.

Ferret makes use of a combination of an Inverted Index and a Suffix Array to allow log-time lookups with a relatively small memory footprint. Also incorporates error-correction (Levenshtein distance 1) and simple Unicode-to-ASCII conversion. Allows for arbitrary sorting functions Allows you to map arbitrary data to your results, and quickly update this data.

Author: Mark Canning
Developed at/for: Tamber - http://www.tamber.com/

Installing

Install: go get github.com/argusdusty/Ferret
Update: go get -u github.com/argusdusty/Ferret
User: import "github.com/argusdusty/Ferret"

Performance

Uses linear memory (~10-18 bytes per character) Searches performed in log time with the number of characters in the dictionary. Sorted searches can be slow, taking ~linear time with the number of matches, rather than linear time with the results limit. Initialization takes linearithmic (ln(n)*n) time (being a sorting algorithm)

The code is meant to be as fast as possible for a substring dictionary search, and as such is best suited for medium-large dictionaries with ~1-100 million total characters. I've timed 10s initialization for 3.5 million characters on a modern CPU, and 10us search time (4000us with error-correction), so this system is capable of ~100,000 queries per second on a single processor - feel free to try the benchmarks in dictionaryexample.go.

Sample usage

Initializing the search engine:

// Allows for exact (case-sensitive) substring searches over a list of songs 
// mapping their respective artists, allowing sorting by the song popularity
SearchEngine := ferret.New(Songs, Artists, SongPopularities, func(s string) []byte { return []byte(s) })

// Allows for lowercase-ASCII substring searches over a list of songs
// mapping their respective artists, allowing sorting by the song popularity
SearchEngine := ferret.New(Songs, Artists, SongPopularities, ferret.UnicodeToLowerASCII)

// Allows for lowercase-ASCII substring searches over a list of artists,
// allowing sorting by the artist popularity
SearchEngine := ferret.New(Artists, Artists, ArtistPopularities, ferret.UnicodeToLowerASCII)

Inserting a new element into the search engine:

// Add a song to an existing SearchEngine, written by Artist,
// and with popularity SongPopularity
SearchEngine.Insert(Song, Artist, SongPopularity)

Performing simple unsorted substring search:

// For songs - returns a list of up to 25 artists of the matching songs,
// and the song popularities
SearchEngine.Query(SongQuery, 25)

Performing a sorted substring search:

// For songs - returns a list of up to 25 artists of the matching songs,
// and the song popularities, sorted by the song popularities
// assuming the song popularities are float64s
SearchEngine.SortedQuery(SongQuery, 25, func(s string, v interface{}, l int, i int) float64 { return v.(float64) })

More examples

Check out example/example.go and example/dictionaryexample.go for more example usage.

Documentation ¶

Overview ¶

Package ferret implements a fast in-memory substring search engine

Index ¶

Variables
func ErrorCorrect(Word []byte, AllowedBytes []byte) [][]byte
func ToASCII(r rune) rune
func UnicodeToLowerASCII(s string) []byte
type InvertedSuffix
- func New(Words, Results []string, Data []interface{}, Converter func(string) []byte) *InvertedSuffix

Constants ¶

This section is empty.

Variables ¶

View Source

var AllASCII = []byte{}/* 129 elements not displayed */

AllASCII is all ASCII bytes (0-127)

View Source

var LowercaseASCII = []byte{
	32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
	49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
	91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105,
	106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118,
	119, 120, 121, 122, 123, 124, 125, 126,
}

LowercaseASCII is all printable ASCII bytes excluding capitalized letters (A-Z)

View Source

var LowercaseLetters = []byte{
	97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110,
	111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,
}

LowercaseLetters is All lowercase ASCII bytes (a-z/97-122)

View Source

var PrintableASCII = []byte{
	32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
	49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
	66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82,
	83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,
	100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
	114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126,
}

PrintableASCII is all printable ASCII bytes

View Source

var UnicodeToASCII = map[rune]rune{
	'À': 'A', 'Á': 'A', 'Â': 'A', 'Ã': 'A', 'Ä': 'A', 'Å': 'A',
	'Ç': 'C',
	'È': 'E', 'É': 'E', 'Ê': 'E', 'Ë': 'E',
	'Ì': 'I', 'Í': 'I', 'Î': 'I', 'Ï': 'I',
	'Ñ': 'N',
	'Ò': 'O', 'Ó': 'O', 'Ô': 'O', 'Õ': 'O', 'Ö': 'O', 'Ø': 'O',
	'Ù': 'U', 'Ú': 'U', 'Û': 'U', 'Ü': 'U',
	'Ý': 'Y',
	'à': 'a', 'á': 'a', 'â': 'a', 'ã': 'a', 'ä': 'a', 'å': 'a',
	'ç': 'c',
	'è': 'e', 'é': 'e', 'ê': 'e', 'ë': 'e',
	'ì': 'i', 'í': 'i', 'î': 'i', 'ï': 'i',
	'ð': 'o',
	'ñ': 'n',
	'ò': 'o', 'ó': 'o', 'ô': 'o', 'õ': 'o', 'ö': 'o', 'ø': 'o',
	'ù': 'u', 'ú': 'u', 'û': 'u', 'ü': 'u',
	'ý': 'y', 'ÿ': 'y',
}

UnicodeToASCII maps the unicode Latin-1 supplement to ASCII characters without accents

Functions ¶

func ErrorCorrect ¶

func ErrorCorrect(Word []byte, AllowedBytes []byte) [][]byte

ErrorCorrect returns all byte-arrays which are Levenshtein distance of 1 away from Word within an allowed array of byte characters.

func ToASCII ¶

func ToASCII(r rune) rune

ToASCII converts a single unicode rune to the ASCII equivalent using the UnicodeToASCII table

func UnicodeToLowerASCII ¶

func UnicodeToLowerASCII(s string) []byte

UnicodeToLowerASCII converts a unicode string to ASCII bytes using the UnicodeToASCII table

Types ¶

type InvertedSuffix ¶

type InvertedSuffix struct {
	WordIndex   []int               // WordIndex and SuffixIndex are sorted by Words[WordIndex[i]][SuffixIndex[i]:]
	SuffixIndex []int               // WordIndex and SuffixIndex are sorted by Words[WordIndex[i]][SuffixIndex[i]:]
	Words       [][]byte            // Words is the list of words (in []byte form) to perform substring searches over
	Results     []string            // Results is the string value of the words. Used as a return value
	Values      []interface{}       // Values is some data mapped to the words. Can be used for sorting, or as a return value
	Converter   func(string) []byte // Converter converts an inserted word/query to a byte array to search for/with
}

InvertedSuffix implements the data structure for substring searches

func New ¶

func New(Words, Results []string, Data []interface{}, Converter func(string) []byte) *InvertedSuffix

New creates an inverted suffix from a dictionary of byte arrays, mapping data, and a string->[]byte converter

func (*InvertedSuffix) ErrorCorrectingQuery ¶

func (IS *InvertedSuffix) ErrorCorrectingQuery(Word string, ResultsLimit int, ErrorCorrection func([]byte) [][]byte) ([]string, []interface{})

ErrorCorrectingQuery returns the strings which contain the query Unsorted, I think it's partially sorted alphabetically Will search for all substrings defined by ErrorCorrection if no results are found on the initial query Input:

Query: The substring to search for.
ResultsLimit: Limit the results so you don't return your whole dictionary by accident. Set to -1 for no limit
ErrorCorrection: Returns a list of alternate queries

func (*InvertedSuffix) Insert ¶

func (IS *InvertedSuffix) Insert(Word, Result string, Data interface{})

Insert adds a word to the dictionary that IS was built on. This is pretty slow, because of linear-time insertion into an array, so stick to New when you can

func (*InvertedSuffix) Query ¶

func (IS *InvertedSuffix) Query(Word string, ResultsLimit int) ([]string, []interface{})

Query returns the strings which contain the query, and their stored values unsorted Input:

Word: The substring to search for.
ResultsLimit: Limit the results to some number of values. Set to -1 for no limit

func (*InvertedSuffix) Search ¶

func (IS *InvertedSuffix) Search(Query []byte) (int, int)

Search performs an exact substring search for the query in the word dictionary Returns the boundaries (low/high) of sorted suffixes which have the query as a prefix This is a low-level interface. I wouldn't recommend using this yourself

func (*InvertedSuffix) SortedErrorCorrectingQuery ¶

func (IS *InvertedSuffix) SortedErrorCorrectingQuery(Word string, ResultsLimit int, ErrorCorrection func([]byte) [][]byte, Sorter func(string, interface{}, int, int) float64) ([]string, []interface{}, []float64)

SortedErrorCorrectingQuery returns the strings which contain the query Sorted. The function sorter produces a value to sort by (largest first) Will search for all substrings defined by ErrorCorrection if no results are found on the initial query Input:

Query: The substring to search for.
ResultsLimit: Limit the results so you don't return your whole dictionary by accident. Set to -1 for no limit
ErrorCorrection: Returns a list of alternate queries
Sorter: Takes (Result, Value, Length, Index (where Query begins in Result))
    (string, []byte, int, int), and produces a value (float64) to sort by (largest first).

func (*InvertedSuffix) SortedQuery ¶

func (IS *InvertedSuffix) SortedQuery(Word string, ResultsLimit int, Sorter func(string, interface{}, int, int) float64) ([]string, []interface{}, []float64)

SortedQuery returns the strings which contain the query sorted The function sorter produces a value to sort by (largest first) Input:

Word: The substring to search for.
ResultsLimit: Limit the results to some number of values. Set to -1 for no limit
Sorter: Takes (Result, Value, Length, Index (where Query begins in Result)) (string, []byte, int, int)
    and produces a value (float64) to sort by (largest first).

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
example

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL