fulltext

package module
v0.0.0-...-05ab8b1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 21, 2015 License: MIT Imports: 11 Imported by: 4

README

Overview

This is a simple, pure-Go, full text indexing and search library.

I made it for use on small to medium websites, although there is nothing web-specific about it's API or operation.

Cdb (http://github.com/jbarham/go-cdb) is used to perform the indexing and lookups and github.com/spf13/afero is used as a in-memory filesystem for search indizes.

Status

This project is experimental. Breaking changes very well may occur.

Notes on Building

fulltext requires jbarham/go-cdb and spf13/afero:

go get github.com/jbarham/go-cdb
go get github.com/spf13/afero

Usage

First, you must create an index. Like this:

import "github.com/bradleypeabody/fulltext"

// create new index with temp dir (usually "" is fine)
idx, err := fulltext.NewIndexer()
if err != nil {
	panic(err)
}
defer idx.Close()

// for each document you want to add, you do something like this:
doc := fulltext.IndexDoc{
	Id: []byte(uuid), // unique identifier (the path to a webpage works...)
	StoreValue: []byte(title), // bytes you want to be able to retrieve from search results
	IndexValue: []byte(data), // bytes you want to be split into words and indexed
}

idx.AddDoc(doc) // add it

// when done, write out to final index
if err := idx.FinalizeAndWrite(f); err != nil {
	panic(err)
}

Once you have an index file, you can search it like this:


// create and in-memory index file
var indexFs afero.Fs = &afero.MemMapFs{}
indexFile, err := indexFs.Create("idxout")
if err != nil {
	panic(err)
}

s, err := fulltext.NewSearcher(indexFile)
if err != nil {
	panic(err)
}

defer s.Close()

sr, err := s.SimpleSearch("Horatio", 20)
if err != nil {
	panic(err)
}

for k, v := range sr.Items {
	fmt.Printf("----------- #:%d\n", k)
	fmt.Printf("Id: %s\n", v.Id)
	fmt.Printf("Score: %d\n", v.Score)
	fmt.Printf("StoreValue: %s\n", v.StoreValue)
}

It's rather simplistic. But it's fast and it works.

TODOs

  • Will likely need some sort of "stop word" functionality.

  • Wordize(), IndexizeWord() and the scoring aggregation logic should be extracted to callback functions with the existing functionality as default.

  • If there is some decent b-tree disk storage that is portable then it would be worth looking at using that instead of CDB and implementing LIKE-style matching. As it is, CDB is quite efficient, but it is a hash index.

Implementation Notes

I originally tried doing this on top of Sqlite. It was dreadfully slow. Cdb is orders of magnitude faster.

Two main disadvantages from going the Cdb route are that the index cannot be edited once it is built (you have to recreate it in full), and since it's hash-based it will not support any sort of fuzzy matching unless those variations are included in the index (which they are not, in the current implementation.) For my purposes these two disadvantages are overshadowed by the fact that it's blinding fast, easy to use, portable (pure-Go), and it's interface allowed me to build the indexes I needed into a single file.

In the test suite is included a copy of the complete works of William Shakespeare (thanks to Jeremy Hylton's http://shakespeare.mit.edu/) and this library is used to create a simple search engine on top of that corpus. By default it only runs for 10 seconds, but you can run it for longer by doing something like:

SEARCHER_WEB_TIMEOUT_SECONDS=120 go test fulltext -v

Documentation

Overview

A simple cross-platform, full-text search engine, backed by sqlite. Intended for use on small- to medium-sized websites.

See README.md for usage.

Index

Constants

View Source
const HEADER_SIZE = 4096

Size of header block to prepend - make it 4k to align disk reads

Variables

This section is empty.

Functions

func HTMLExtractDescription

func HTMLExtractDescription(html string) string

Helper to extract an HTML description from the meta[name=description] tag

func HTMLExtractTitle

func HTMLExtractTitle(html string) string

Helper to extract an HTML title from the title tag

func HTMLStripTags

func HTMLStripTags(s string) (output string)

This function copied from here: https://github.com/kennygrant/sanitize/blob/master/sanitize.go License is: https://github.com/kennygrant/sanitize/blob/master/License-BSD.txt Strip html tags, replace common entities, and escape <>&;'" in the result. Note the returned text may contain entities as it is escaped by HTMLEscapeString, and most entities are not translated.

func IndexizeWord

func IndexizeWord(w string) string

Make word appropriate for indexing

func Wordize

func Wordize(t string) []string

Split a string up into words

Types

type IndexDoc

type IndexDoc struct {
	Id         []byte // the id, this is usually the path to the document
	IndexValue []byte // index this data
	StoreValue []byte // store this data
}

Contents of a single document to be indexed

type Indexer

type Indexer struct {
	WordSplit WordSplitter
	WordClean WordCleaner
	// contains filtered or unexported fields
}

Produces a set of cdb files from a series of AddDoc() calls

func NewIndexer

func NewIndexer() (*Indexer, error)

NewIndexer creates a new indexer.

func (*Indexer) AddDoc

func (idx *Indexer) AddDoc(idoc IndexDoc) error

Add a document to the index - writes to temporary files and stores some data in memory while building the index.

func (*Indexer) Close

func (idx *Indexer) Close()

close and remove all resources

func (*Indexer) FinalizeAndWrite

func (idx *Indexer) FinalizeAndWrite(w io.Writer) error

Builds a final single index file, which consists of some simple header info, followed by the cdb binary files that comprise the full index.

type SearchResultItem

type SearchResultItem struct {
	Id         []byte // id of this item (document)
	StoreValue []byte // the stored value of this document
	Score      int64  // the total score
}

A single item in a search result

type SearchResultItems

type SearchResultItems []SearchResultItem

Implement sort.Interface

func (SearchResultItems) Len

func (s SearchResultItems) Len() int

func (SearchResultItems) Less

func (s SearchResultItems) Less(i, j int) bool

func (SearchResultItems) Swap

func (s SearchResultItems) Swap(i, j int)

type SearchResults

type SearchResults struct {
	Items SearchResultItems
}

What happened during the search

type Searcher

type Searcher struct {
	// contains filtered or unexported fields
}

Interface for search. Not thread-safe, but low overhead so having a separate one per thread should be workable.

func NewSearcher

func NewSearcher(indexFile afero.File) (*Searcher, error)

NewSearcher creates a new searcher instance from the given index file.

func (*Searcher) Close

func (s *Searcher) Close() error

Close and release resources

func (*Searcher) SimpleSearch

func (s *Searcher) SimpleSearch(search string, maxn int) (SearchResults, error)

Perform a search

type WordCleaner

type WordCleaner func(string) string

type WordSplitter

type WordSplitter func(string) []string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL