cleo

package module
v0.0.0-...-1ae44fe Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 19, 2021 License: Apache-2.0 Imports: 9 Imported by: 1

README

gocleo

The Cleo search is explained here: Linked in original article

The source for Jingwei Wu's version can be found here: Jingwei's version

Basically, this is a golang version of the original program. The original program is written in Java. I have included a corpus of words to search for. I downloaded this corpus from http://www.wordfrequency.info/

Algorithm overview
  • The algorithm starts out by searching for matches in the inverted index. The inverted index contains a map of the word's prefix (up to 4 chars). Each word prefix maps to an array of document ID, bloom filter tuples.
  • The bloom filter of each candidate is compared against the query's bloom filter. If it matches successfully, the candidate makes it to the next round.
  • The remaining words are scored by their levenshtein distance to the query, then normalized using the Jaccard coefficient.
  • The final words are returned as JSON
  • You can also change how scoring works if you like. You just need to provide a function that conforms to func(s1, s2 string) (score float64)
Instructions

This is a sample app:

package main
import "github.com/jamra/gocleo"

func main(){
  cleo.InitAndRun("w1_fixed.txt", "8080", nil) //The last parameter is optional. Defaults to Levenshtein distance normalized by Jaccard coefficient
}

Run the program and navigate to localhost:8080/cleo/{query}

{query} is your search. e.g.("tractor", "nightingale", "pizza")

Your own corpus

You can have the search run off of your own corpus so long as each term is separated by a new line. w1_fixed.txt is provided as an example.

Setup

This should work with go get

go get github.com/jamra/gocleo
TODO
  • Give the user the ability to add and remove words from the index.
  • More robust Unit testing

Documentation

Overview

Package cleo provides a fast search algorithm for prefix matching large amounts of text.

Index

Constants

View Source
const (
	FNV_BASIS_64 = uint64(14695981039346656037)
	FNV_PRIME_64 = uint64((1 << 40) + 435)
	FNV_MASK_64  = uint64(^uint64(0) >> 1)
	NUM_BITS     = 64

	FNV_BASIS_32 = uint32(0x811c9dc5)
	FNV_PRIME_32 = uint32((1 << 24) + 403)
	FNV_MASK_32  = uint32(^uint32(0) >> 1)
)

Used for the bloom filter

Variables

This section is empty.

Functions

func BuildIndexes

func BuildIndexes(corpusPath string, scoringFunction fn_score)

func InitIndex

func InitIndex(iIndex *InvertedIndex, fIndex *ForwardIndex, corpusPath string)

func LevenshteinDistance

func LevenshteinDistance(s, t string) int

Levenshtein distance is the number of inserts, deletions, and substitutions that differentiate one word from another. This algorithm is dynamic programming found at http://en.wikipedia.org/wiki/Levenshtein_distance

func Max

func Max(a ...int) int

func Min

func Min(a ...int) int

func Score

func Score(query, candidate string) float64

func TestBytesFromQuery

func TestBytesFromQuery(bf int, qBloom int) bool

Iterates through all of the 8 bytes (64 bits) and tests each bit that is set to 1 in the query's filter against the bit in the comparison's filter. If the bit is not also 1, you do not have a match.

Types

type ByScore

type ByScore struct{ RankedResults }

func (ByScore) Less

func (s ByScore) Less(i, j int) bool

type Document

type Document struct {
	// contains filtered or unexported fields
}

type ForwardIndex

type ForwardIndex map[int]string

Forward Index - Maps the document id to the document

func NewForwardIndex

func NewForwardIndex() *ForwardIndex

func (*ForwardIndex) AddDoc

func (x *ForwardIndex) AddDoc(docId int, doc string)

type InvertedIndex

type InvertedIndex map[string][]Document

Inverted Index - Maps the query prefix to the matching documents

func NewInvertedIndex

func NewInvertedIndex() *InvertedIndex

func (*InvertedIndex) AddDoc

func (x *InvertedIndex) AddDoc(docId int, doc string, bloom int)

func (*InvertedIndex) Search

func (x *InvertedIndex) Search(query string) []Document

func (*InvertedIndex) Size

func (x *InvertedIndex) Size() int

type RankedResult

type RankedResult struct {
	Word  string
	Score float64
}

func CleoSearch

func CleoSearch(iIndex *InvertedIndex, fIndex *ForwardIndex, query string) []RankedResult

This is the meat of the search. It first checks the inverted index for matches, then filters the potentially numerous results using the bloom filter. Finally, it ranks the word using a Levenshtein distance.

type RankedResults

type RankedResults []RankedResult

func (RankedResults) Len

func (s RankedResults) Len() int

func (RankedResults) Swap

func (s RankedResults) Swap(i, j int)

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL