fulltext

package module

v0.0.0-...-05ab8b1 Latest Latest Go to latest Published: Jul 21, 2015 License: MIT Imports: 11 Imported by: 4

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/andreaskoch/fulltext

Links

Open Source Insights

README ¶

Overview

This is a simple, pure-Go, full text indexing and search library.

I made it for use on small to medium websites, although there is nothing web-specific about it's API or operation.

Cdb (http://github.com/jbarham/go-cdb) is used to perform the indexing and lookups and github.com/spf13/afero is used as a in-memory filesystem for search indizes.

Status

This project is experimental. Breaking changes very well may occur.

Notes on Building

fulltext requires jbarham/go-cdb and spf13/afero:

go get github.com/jbarham/go-cdb
go get github.com/spf13/afero

Usage

First, you must create an index. Like this:

import "github.com/bradleypeabody/fulltext"

// create new index with temp dir (usually "" is fine)
idx, err := fulltext.NewIndexer()
if err != nil {
	panic(err)
}
defer idx.Close()

// for each document you want to add, you do something like this:
doc := fulltext.IndexDoc{
	Id: []byte(uuid), // unique identifier (the path to a webpage works...)
	StoreValue: []byte(title), // bytes you want to be able to retrieve from search results
	IndexValue: []byte(data), // bytes you want to be split into words and indexed
}

idx.AddDoc(doc) // add it

// when done, write out to final index
if err := idx.FinalizeAndWrite(f); err != nil {
	panic(err)
}

Once you have an index file, you can search it like this:


// create and in-memory index file
var indexFs afero.Fs = &afero.MemMapFs{}
indexFile, err := indexFs.Create("idxout")
if err != nil {
	panic(err)
}

s, err := fulltext.NewSearcher(indexFile)
if err != nil {
	panic(err)
}

defer s.Close()

sr, err := s.SimpleSearch("Horatio", 20)
if err != nil {
	panic(err)
}

for k, v := range sr.Items {
	fmt.Printf("----------- #:%d\n", k)
	fmt.Printf("Id: %s\n", v.Id)
	fmt.Printf("Score: %d\n", v.Score)
	fmt.Printf("StoreValue: %s\n", v.StoreValue)
}

It's rather simplistic. But it's fast and it works.

TODOs

Will likely need some sort of "stop word" functionality.
~~Wordize(), IndexizeWord()~~ and the scoring aggregation logic should be extracted to callback functions with the existing functionality as default.
If there is some decent b-tree disk storage that is portable then it would be worth looking at using that instead of CDB and implementing LIKE-style matching. As it is, CDB is quite efficient, but it is a hash index.

Implementation Notes

I originally tried doing this on top of Sqlite. It was dreadfully slow. Cdb is orders of magnitude faster.

Two main disadvantages from going the Cdb route are that the index cannot be edited once it is built (you have to recreate it in full), and since it's hash-based it will not support any sort of fuzzy matching unless those variations are included in the index (which they are not, in the current implementation.) For my purposes these two disadvantages are overshadowed by the fact that it's blinding fast, easy to use, portable (pure-Go), and it's interface allowed me to build the indexes I needed into a single file.

In the test suite is included a copy of the complete works of William Shakespeare (thanks to Jeremy Hylton's http://shakespeare.mit.edu/) and this library is used to create a simple search engine on top of that corpus. By default it only runs for 10 seconds, but you can run it for longer by doing something like:

SEARCHER_WEB_TIMEOUT_SECONDS=120 go test fulltext -v

Documentation ¶

Overview ¶

A simple cross-platform, full-text search engine, backed by sqlite. Intended for use on small- to medium-sized websites.

See README.md for usage.

Index ¶

Constants
func HTMLExtractDescription(html string) string
func HTMLExtractTitle(html string) string
func HTMLStripTags(s string) (output string)
func IndexizeWord(w string) string
func Wordize(t string) []string
type IndexDoc
type Indexer
- func NewIndexer() (*Indexer, error)
type SearchResultItem
type SearchResultItems
type SearchResults
type Searcher
- func NewSearcher(indexFile afero.File) (*Searcher, error)
- func (s *Searcher) Close() error
- func (s *Searcher) SimpleSearch(search string, maxn int) (SearchResults, error)
type WordCleaner
type WordSplitter

Constants ¶

View Source

const HEADER_SIZE = 4096

Size of header block to prepend - make it 4k to align disk reads

Variables ¶

This section is empty.

Functions ¶

func HTMLExtractDescription ¶

func HTMLExtractDescription(html string) string

Helper to extract an HTML description from the meta[name=description] tag

func HTMLExtractTitle ¶

func HTMLExtractTitle(html string) string

Helper to extract an HTML title from the title tag

func HTMLStripTags ¶

func HTMLStripTags(s string) (output string)

This function copied from here: https://github.com/kennygrant/sanitize/blob/master/sanitize.go License is: https://github.com/kennygrant/sanitize/blob/master/License-BSD.txt Strip html tags, replace common entities, and escape <>&;'" in the result. Note the returned text may contain entities as it is escaped by HTMLEscapeString, and most entities are not translated.

func IndexizeWord ¶

func IndexizeWord(w string) string

Make word appropriate for indexing

func Wordize ¶

func Wordize(t string) []string

Split a string up into words

Types ¶

type IndexDoc ¶

type IndexDoc struct {
	Id         []byte // the id, this is usually the path to the document
	IndexValue []byte // index this data
	StoreValue []byte // store this data
}

Contents of a single document to be indexed

type Indexer ¶

type Indexer struct {
	WordSplit WordSplitter
	WordClean WordCleaner
	// contains filtered or unexported fields
}

Produces a set of cdb files from a series of AddDoc() calls

func NewIndexer ¶

func NewIndexer() (*Indexer, error)

NewIndexer creates a new indexer.

func (*Indexer) AddDoc ¶

func (idx *Indexer) AddDoc(idoc IndexDoc) error

Add a document to the index - writes to temporary files and stores some data in memory while building the index.

func (*Indexer) Close ¶

func (idx *Indexer) Close()

close and remove all resources

func (*Indexer) FinalizeAndWrite ¶

func (idx *Indexer) FinalizeAndWrite(w io.Writer) error

Builds a final single index file, which consists of some simple header info, followed by the cdb binary files that comprise the full index.

type SearchResultItem ¶

type SearchResultItem struct {
	Id         []byte // id of this item (document)
	StoreValue []byte // the stored value of this document
	Score      int64  // the total score
}

A single item in a search result

type SearchResultItems ¶

type SearchResultItems []SearchResultItem

Implement sort.Interface

func (SearchResultItems) Len ¶

func (s SearchResultItems) Len() int

func (SearchResultItems) Less ¶

func (s SearchResultItems) Less(i, j int) bool

func (SearchResultItems) Swap ¶

func (s SearchResultItems) Swap(i, j int)

type SearchResults ¶

type SearchResults struct {
	Items SearchResultItems
}

What happened during the search

type Searcher ¶

type Searcher struct {
	// contains filtered or unexported fields
}

Interface for search. Not thread-safe, but low overhead so having a separate one per thread should be workable.

func NewSearcher ¶

func NewSearcher(indexFile afero.File) (*Searcher, error)

NewSearcher creates a new searcher instance from the given index file.

func (*Searcher) Close ¶

func (s *Searcher) Close() error

Close and release resources

func (*Searcher) SimpleSearch ¶

func (s *Searcher) SimpleSearch(search string, maxn int) (SearchResults, error)

Perform a search

type WordCleaner ¶

type WordCleaner func(string) string

type WordSplitter ¶

type WordSplitter func(string) []string

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL