minsearch

package module
v0.0.0-...-b850876 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 18, 2019 License: Apache-2.0 Imports: 15 Imported by: 8

README

go-minsearch

Package minsearch implements a minimal solution to index text and retrieve search results with score.

Documentation at https://godoc.org/github.com/tim-st/go-minsearch.

Download and install package minsearch and its tools with go get -u github.com/tim-st/go-minsearch/...

Commands

wikiindex

wikiindex can create a full text index of a MediaWiki xml.bz2 dump file. The indexing process is interruptable.

Indexing Example

wikiindex -filename="dewiki-20190601-pages-articles.xml.bz2" -fullText -idLimit=1000 -noSync

creates an index file dewiki-20190601-pages-articles.xml.bz2.idx (file size: 3.00 GB; number segments: 13628084; avg number IDs per segment: 14.05).

wikisearch

wikisearch can print sorted search results of a query searched in an indexed file created by wikiindex.

The ID in the search results is the page ID used by the MediaWiki (hostname/w/index.php?curid=ID).

Search Example

wikisearch -filename="dewiki-20190601-pages-articles.xml.bz2.idx" -intersection -limit=10 -query="word1 word2 word3..."

Documentation

Overview

Package minsearch implements a minimal solution to index text and retrieve search results with score.

Index

Constants

View Source
const DefaultMaxResults = 1000000

DefaultMaxResults is a default value for the maximum temporary results during calculation of a search.

Variables

This section is empty.

Functions

This section is empty.

Types

type File

type File struct {
	// contains filtered or unexported fields
}

File is the index file.

func Open

func Open(filename string, noSync bool) (*File, error)

Open opens the File or creates a new File if it doesn't exist. Setting the noSync flag will cause the database to skip fsync() calls after each commit. In the event of a system failure data can get lost, so setting it is unsafe but makes indexing much faster.

func (*File) AvgCount

func (f *File) AvgCount() (float32, error)

AvgCount returns the average number of IDs per key in the database at last calculation. If it wasn't calculated before (UpdateStatistics does it), an error is returned.

func (*File) Close

func (f *File) Close()

Close closes the file.

func (*File) IndexBatch

func (f *File) IndexBatch(pairs []Pair, maxIDs int) error

IndexBatch indexes all relevant segments for each Pair as a batch operation. See IndexPair for more information.

func (*File) IndexPair

func (f *File) IndexPair(pair Pair, maxIDs int) error

IndexPair indexes all relevant segments of the given Pair. If maxIDs > 0 each indexed segment will only have up to maxIDs different (ID, Score) pairs and only the highest scores are chosen. If maxIDs > 0 and the value is chosen too small, the results could become too bad. Maybe maxIDs in [1000, 10000] is a good choice that limits too common words of a language like "the" or "a" in English. If maxIDs <= 0 the number of scores per segment is not limited. This will yield the best results (under the assumption that the result set is not limited) but definetly the biggest file size and higher temporary memory usage.

func (*File) KeyCount

func (f *File) KeyCount() (uint32, error)

KeyCount returns the number of keys in the database at last calculation. If it wasn't calculated before (UpdateStatistics does it), an error is returned.

func (*File) LastID

func (f *File) LastID() (ID, error)

LastID returns the last ID that was saved using SetLastID. This function can be helpful to get the last state of an operation.

func (*File) Search

func (f *File) Search(query []byte, setOp SetOperation, maxResults int) ([]Result, error)

Search searches the relevant segments of the query in the index file and returns a result set ordered by score. If maxResults > 0 the maximum temporary results _during_ calculation of the search results, which can be much higher than the end result, are limited to maxResults. If for at least one segment the number of results > maxResults it's possible that the result set misses results with higher score. If maxResults <= 0 the memory is not limited. It's recommend to set maxResults > 0 to limit the maximum RAM usage (especially if the SetOperation is set to Union or query is user input).

func (*File) SetLastID

func (f *File) SetLastID(id ID) error

SetLastID stores the given ID (that can be some unrelated type with same byte length) in the statistics of the database. This function can be helpful to store the last state of some operation. The stored value can be retrieved using a call to LastID. Setting the value has no effect on the indexed data.

func (File) String

func (f File) String() string

func (*File) UpdateStatistics

func (f *File) UpdateStatistics() error

UpdateStatistics calculates the current number of keys and the average data length.

type ID

type ID = uint32

ID is a unique uint32 number like a position or an FNV hash, which is indexed together with a Score.

type Pair

type Pair struct {
	ID   ID
	Text []byte
}

Pair is a pair of an ID and the text which should get indexed for the ID.

type Result

type Result struct {
	ID    ID
	Score Score
}

Result is a single search result of a result set. It stores the ID and the score depending on the search query.

type Score

type Score = float32

Score is a priority value calculated for each indexed segment per ID.

type SetOperation

type SetOperation uint8

SetOperation is the operation that is done on the result set when the query consists of multiple relevant segments.

const (
	// Union collects all search results that match at least one relevant segment of the query.
	Union SetOperation = iota
	// Intersection collects all results that match each relevant segment of the query.
	Intersection
)

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL