ferret

package module
v0.0.0-...-14de0b6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 19, 2019 License: Apache-2.0 Imports: 3 Imported by: 7

README

Ferret

An optimized substring search engine written in Go.

Ferret makes use of a combination of an Inverted Index and a Suffix Array to allow log-time lookups with a relatively small memory footprint. Also incorporates error-correction (Levenshtein distance 1) and simple Unicode-to-ASCII conversion. Allows for arbitrary sorting functions Allows you to map arbitrary data to your results, and quickly update this data.

Author: Mark Canning
Developed at/for: Tamber - http://www.tamber.com/

Installing

Install: go get github.com/argusdusty/Ferret
Update: go get -u github.com/argusdusty/Ferret
User: import "github.com/argusdusty/Ferret"

Performance

Uses linear memory (~10-18 bytes per character) Searches performed in log time with the number of characters in the dictionary. Sorted searches can be slow, taking ~linear time with the number of matches, rather than linear time with the results limit. Initialization takes linearithmic (ln(n)*n) time (being a sorting algorithm)

The code is meant to be as fast as possible for a substring dictionary search, and as such is best suited for medium-large dictionaries with ~1-100 million total characters. I've timed 10s initialization for 3.5 million characters on a modern CPU, and 10us search time (4000us with error-correction), so this system is capable of ~100,000 queries per second on a single processor - feel free to try the benchmarks in dictionaryexample.go.

Sample usage

Initializing the search engine:
// Allows for exact (case-sensitive) substring searches over a list of songs 
// mapping their respective artists, allowing sorting by the song popularity
SearchEngine := ferret.New(Songs, Artists, SongPopularities, func(s string) []byte { return []byte(s) })

// Allows for lowercase-ASCII substring searches over a list of songs
// mapping their respective artists, allowing sorting by the song popularity
SearchEngine := ferret.New(Songs, Artists, SongPopularities, ferret.UnicodeToLowerASCII)

// Allows for lowercase-ASCII substring searches over a list of artists,
// allowing sorting by the artist popularity
SearchEngine := ferret.New(Artists, Artists, ArtistPopularities, ferret.UnicodeToLowerASCII)
Inserting a new element into the search engine:
// Add a song to an existing SearchEngine, written by Artist,
// and with popularity SongPopularity
SearchEngine.Insert(Song, Artist, SongPopularity)
// For songs - returns a list of up to 25 artists of the matching songs,
// and the song popularities
SearchEngine.Query(SongQuery, 25)
// For songs - returns a list of up to 25 artists of the matching songs,
// and the song popularities, sorted by the song popularities
// assuming the song popularities are float64s
SearchEngine.SortedQuery(SongQuery, 25, func(s string, v interface{}, l int, i int) float64 { return v.(float64) })
More examples

Check out example/example.go and example/dictionaryexample.go for more example usage.

Documentation

Overview

Package ferret implements a fast in-memory substring search engine

Index

Constants

This section is empty.

Variables

View Source
var AllASCII = []byte{}/* 129 elements not displayed */

AllASCII is all ASCII bytes (0-127)

View Source
var LowercaseASCII = []byte{
	32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
	49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
	91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105,
	106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118,
	119, 120, 121, 122, 123, 124, 125, 126,
}

LowercaseASCII is all printable ASCII bytes excluding capitalized letters (A-Z)

View Source
var LowercaseLetters = []byte{
	97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110,
	111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,
}

LowercaseLetters is All lowercase ASCII bytes (a-z/97-122)

View Source
var PrintableASCII = []byte{
	32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
	49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
	66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82,
	83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,
	100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
	114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126,
}

PrintableASCII is all printable ASCII bytes

View Source
var UnicodeToASCII = map[rune]rune{
	'À': 'A', 'Á': 'A', 'Â': 'A', 'Ã': 'A', 'Ä': 'A', 'Å': 'A',
	'Ç': 'C',
	'È': 'E', 'É': 'E', 'Ê': 'E', 'Ë': 'E',
	'Ì': 'I', 'Í': 'I', 'Î': 'I', 'Ï': 'I',
	'Ñ': 'N',
	'Ò': 'O', 'Ó': 'O', 'Ô': 'O', 'Õ': 'O', 'Ö': 'O', 'Ø': 'O',
	'Ù': 'U', 'Ú': 'U', 'Û': 'U', 'Ü': 'U',
	'Ý': 'Y',
	'à': 'a', 'á': 'a', 'â': 'a', 'ã': 'a', 'ä': 'a', 'å': 'a',
	'ç': 'c',
	'è': 'e', 'é': 'e', 'ê': 'e', 'ë': 'e',
	'ì': 'i', 'í': 'i', 'î': 'i', 'ï': 'i',
	'ð': 'o',
	'ñ': 'n',
	'ò': 'o', 'ó': 'o', 'ô': 'o', 'õ': 'o', 'ö': 'o', 'ø': 'o',
	'ù': 'u', 'ú': 'u', 'û': 'u', 'ü': 'u',
	'ý': 'y', 'ÿ': 'y',
}

UnicodeToASCII maps the unicode Latin-1 supplement to ASCII characters without accents

Functions

func ErrorCorrect

func ErrorCorrect(Word []byte, AllowedBytes []byte) [][]byte

ErrorCorrect returns all byte-arrays which are Levenshtein distance of 1 away from Word within an allowed array of byte characters.

func ToASCII

func ToASCII(r rune) rune

ToASCII converts a single unicode rune to the ASCII equivalent using the UnicodeToASCII table

func UnicodeToLowerASCII

func UnicodeToLowerASCII(s string) []byte

UnicodeToLowerASCII converts a unicode string to ASCII bytes using the UnicodeToASCII table

Types

type InvertedSuffix

type InvertedSuffix struct {
	WordIndex   []int               // WordIndex and SuffixIndex are sorted by Words[WordIndex[i]][SuffixIndex[i]:]
	SuffixIndex []int               // WordIndex and SuffixIndex are sorted by Words[WordIndex[i]][SuffixIndex[i]:]
	Words       [][]byte            // Words is the list of words (in []byte form) to perform substring searches over
	Results     []string            // Results is the string value of the words. Used as a return value
	Values      []interface{}       // Values is some data mapped to the words. Can be used for sorting, or as a return value
	Converter   func(string) []byte // Converter converts an inserted word/query to a byte array to search for/with
}

InvertedSuffix implements the data structure for substring searches

func New

func New(Words, Results []string, Data []interface{}, Converter func(string) []byte) *InvertedSuffix

New creates an inverted suffix from a dictionary of byte arrays, mapping data, and a string->[]byte converter

func (*InvertedSuffix) ErrorCorrectingQuery

func (IS *InvertedSuffix) ErrorCorrectingQuery(Word string, ResultsLimit int, ErrorCorrection func([]byte) [][]byte) ([]string, []interface{})

ErrorCorrectingQuery returns the strings which contain the query Unsorted, I think it's partially sorted alphabetically Will search for all substrings defined by ErrorCorrection if no results are found on the initial query Input:

Query: The substring to search for.
ResultsLimit: Limit the results so you don't return your whole dictionary by accident. Set to -1 for no limit
ErrorCorrection: Returns a list of alternate queries

func (*InvertedSuffix) Insert

func (IS *InvertedSuffix) Insert(Word, Result string, Data interface{})

Insert adds a word to the dictionary that IS was built on. This is pretty slow, because of linear-time insertion into an array, so stick to New when you can

func (*InvertedSuffix) Query

func (IS *InvertedSuffix) Query(Word string, ResultsLimit int) ([]string, []interface{})

Query returns the strings which contain the query, and their stored values unsorted Input:

Word: The substring to search for.
ResultsLimit: Limit the results to some number of values. Set to -1 for no limit

func (*InvertedSuffix) Search

func (IS *InvertedSuffix) Search(Query []byte) (int, int)

Search performs an exact substring search for the query in the word dictionary Returns the boundaries (low/high) of sorted suffixes which have the query as a prefix This is a low-level interface. I wouldn't recommend using this yourself

func (*InvertedSuffix) SortedErrorCorrectingQuery

func (IS *InvertedSuffix) SortedErrorCorrectingQuery(Word string, ResultsLimit int, ErrorCorrection func([]byte) [][]byte, Sorter func(string, interface{}, int, int) float64) ([]string, []interface{}, []float64)

SortedErrorCorrectingQuery returns the strings which contain the query Sorted. The function sorter produces a value to sort by (largest first) Will search for all substrings defined by ErrorCorrection if no results are found on the initial query Input:

Query: The substring to search for.
ResultsLimit: Limit the results so you don't return your whole dictionary by accident. Set to -1 for no limit
ErrorCorrection: Returns a list of alternate queries
Sorter: Takes (Result, Value, Length, Index (where Query begins in Result))
    (string, []byte, int, int), and produces a value (float64) to sort by (largest first).

func (*InvertedSuffix) SortedQuery

func (IS *InvertedSuffix) SortedQuery(Word string, ResultsLimit int, Sorter func(string, interface{}, int, int) float64) ([]string, []interface{}, []float64)

SortedQuery returns the strings which contain the query sorted The function sorter produces a value to sort by (largest first) Input:

Word: The substring to search for.
ResultsLimit: Limit the results to some number of values. Set to -1 for no limit
Sorter: Takes (Result, Value, Length, Index (where Query begins in Result)) (string, []byte, int, int)
    and produces a value (float64) to sort by (largest first).

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL