cleo

package module

v0.0.0-...-1ae44fe Latest Latest Go to latest Published: Apr 19, 2021 License: Apache-2.0 Imports: 9 Imported by: 1

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/jamra/gocleo

Links

Open Source Insights

README ¶

gocleo

A golang implementation of the Cleo search.

The Cleo search is explained here: Linked in original article

The source for Jingwei Wu's version can be found here: Jingwei's version

Basically, this is a golang version of the original program. The original program is written in Java. I have included a corpus of words to search for. I downloaded this corpus from http://www.wordfrequency.info/

Algorithm overview

The algorithm starts out by searching for matches in the inverted index. The inverted index contains a map of the word's prefix (up to 4 chars). Each word prefix maps to an array of document ID, bloom filter tuples.
The bloom filter of each candidate is compared against the query's bloom filter. If it matches successfully, the candidate makes it to the next round.
The remaining words are scored by their levenshtein distance to the query, then normalized using the Jaccard coefficient.
The final words are returned as JSON
You can also change how scoring works if you like. You just need to provide a function that conforms to func(s1, s2 string) (score float64)

Instructions

This is a sample app:

package main
import "github.com/jamra/gocleo"

func main(){
  cleo.InitAndRun("w1_fixed.txt", "8080", nil) //The last parameter is optional. Defaults to Levenshtein distance normalized by Jaccard coefficient
}

Run the program and navigate to localhost:8080/cleo/{query}

{query} is your search. e.g.("tractor", "nightingale", "pizza")

Your own corpus

You can have the search run off of your own corpus so long as each term is separated by a new line. w1_fixed.txt is provided as an example.

Setup

This should work with go get

go get github.com/jamra/gocleo

TODO

Give the user the ability to add and remove words from the index.
More robust Unit testing

Documentation ¶

Overview ¶

Package cleo provides a fast search algorithm for prefix matching large amounts of text.

Index ¶

Constants
func BuildIndexes(corpusPath string, scoringFunction fn_score)
func InitIndex(iIndex *InvertedIndex, fIndex *ForwardIndex, corpusPath string)
func LevenshteinDistance(s, t string) int
func Max(a ...int) int
func Min(a ...int) int
func Score(query, candidate string) float64
func TestBytesFromQuery(bf int, qBloom int) bool
type ByScore
- func (s ByScore) Less(i, j int) bool
type Document
type ForwardIndex
- func NewForwardIndex() *ForwardIndex
- func (x *ForwardIndex) AddDoc(docId int, doc string)
type InvertedIndex
- func NewInvertedIndex() *InvertedIndex
- func (x *InvertedIndex) AddDoc(docId int, doc string, bloom int)
- func (x *InvertedIndex) Search(query string) []Document
- func (x *InvertedIndex) Size() int
type RankedResult
- func CleoSearch(iIndex *InvertedIndex, fIndex *ForwardIndex, query string) []RankedResult
type RankedResults
- func (s RankedResults) Len() int
- func (s RankedResults) Swap(i, j int)

Constants ¶

View Source

const (
	FNV_BASIS_64 = uint64(14695981039346656037)
	FNV_PRIME_64 = uint64((1 << 40) + 435)
	FNV_MASK_64  = uint64(^uint64(0) >> 1)
	NUM_BITS     = 64

	FNV_BASIS_32 = uint32(0x811c9dc5)
	FNV_PRIME_32 = uint32((1 << 24) + 403)
	FNV_MASK_32  = uint32(^uint32(0) >> 1)
)

Used for the bloom filter

Variables ¶

This section is empty.

Functions ¶

func BuildIndexes ¶

func BuildIndexes(corpusPath string, scoringFunction fn_score)

func InitIndex ¶

func InitIndex(iIndex *InvertedIndex, fIndex *ForwardIndex, corpusPath string)

func LevenshteinDistance ¶

func LevenshteinDistance(s, t string) int

Levenshtein distance is the number of inserts, deletions, and substitutions that differentiate one word from another. This algorithm is dynamic programming found at http://en.wikipedia.org/wiki/Levenshtein_distance

func Max ¶

func Max(a ...int) int

func Min ¶

func Min(a ...int) int

func Score ¶

func Score(query, candidate string) float64

func TestBytesFromQuery ¶

func TestBytesFromQuery(bf int, qBloom int) bool

Iterates through all of the 8 bytes (64 bits) and tests each bit that is set to 1 in the query's filter against the bit in the comparison's filter. If the bit is not also 1, you do not have a match.

Types ¶

type ByScore ¶

type ByScore struct{ RankedResults }

func (ByScore) Less ¶

func (s ByScore) Less(i, j int) bool

type Document ¶

type Document struct {
	// contains filtered or unexported fields
}

type ForwardIndex ¶

type ForwardIndex map[int]string

Forward Index - Maps the document id to the document

func NewForwardIndex ¶

func NewForwardIndex() *ForwardIndex

func (*ForwardIndex) AddDoc ¶

func (x *ForwardIndex) AddDoc(docId int, doc string)

type InvertedIndex ¶

type InvertedIndex map[string][]Document

Inverted Index - Maps the query prefix to the matching documents

func NewInvertedIndex ¶

func NewInvertedIndex() *InvertedIndex

func (*InvertedIndex) AddDoc ¶

func (x *InvertedIndex) AddDoc(docId int, doc string, bloom int)

func (*InvertedIndex) Search ¶

func (x *InvertedIndex) Search(query string) []Document

func (*InvertedIndex) Size ¶

func (x *InvertedIndex) Size() int

type RankedResult ¶

type RankedResult struct {
	Word  string
	Score float64
}

func CleoSearch ¶

func CleoSearch(iIndex *InvertedIndex, fIndex *ForwardIndex, query string) []RankedResult

This is the meat of the search. It first checks the inverted index for matches, then filters the potentially numerous results using the bloom filter. Finally, it ranks the word using a Levenshtein distance.

type RankedResults ¶

type RankedResults []RankedResult

func (RankedResults) Len ¶

func (s RankedResults) Len() int

func (RankedResults) Swap ¶

func (s RankedResults) Swap(i, j int)

Source Files ¶

View all Source files

cleo.go

Directories ¶

Path	Synopsis
examples

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL