minsearch

package module

v0.0.0-...-b850876 Latest Latest Go to latest Published: Dec 18, 2019 License: Apache-2.0 Imports: 15 Imported by: 8

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/tim-st/go-minsearch

README ¶

go-minsearch

Package minsearch implements a minimal solution to index text and retrieve search results with score.

Documentation at https://godoc.org/github.com/tim-st/go-minsearch.

Download and install package minsearch and its tools with go get -u github.com/tim-st/go-minsearch/...

Commands

wikiindex

wikiindex can create a full text index of a MediaWiki xml.bz2 dump file. The indexing process is interruptable.

Indexing Example

wikiindex -filename="dewiki-20190601-pages-articles.xml.bz2" -fullText -idLimit=1000 -noSync

creates an index file dewiki-20190601-pages-articles.xml.bz2.idx (file size: 3.00 GB; number segments: 13628084; avg number IDs per segment: 14.05).

wikisearch

wikisearch can print sorted search results of a query searched in an indexed file created by wikiindex.

The ID in the search results is the page ID used by the MediaWiki (hostname/w/index.php?curid=ID).

Search Example

wikisearch -filename="dewiki-20190601-pages-articles.xml.bz2.idx" -intersection -limit=10 -query="word1 word2 word3..."

Documentation ¶

Overview ¶

Package minsearch implements a minimal solution to index text and retrieve search results with score.

Index ¶

Constants
type File
- func Open(filename string, noSync bool) (*File, error)
type ID
type Pair
type Result
type Score
type SetOperation

Constants ¶

View Source

const DefaultMaxResults = 1000000

DefaultMaxResults is a default value for the maximum temporary results during calculation of a search.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

func Open ¶

func Open(filename string, noSync bool) (*File, error)

Open opens the File or creates a new File if it doesn't exist. Setting the noSync flag will cause the database to skip fsync() calls after each commit. In the event of a system failure data can get lost, so setting it is unsafe but makes indexing much faster.

func (*File) AvgCount ¶

func (f *File) AvgCount() (float32, error)

AvgCount returns the average number of IDs per key in the database at last calculation. If it wasn't calculated before (UpdateStatistics does it), an error is returned.

func (*File) Close ¶

func (f *File) Close()

Close closes the file.

func (*File) IndexBatch ¶

func (f *File) IndexBatch(pairs []Pair, maxIDs int) error

IndexBatch indexes all relevant segments for each Pair as a batch operation. See IndexPair for more information.

func (*File) IndexPair ¶

func (f *File) IndexPair(pair Pair, maxIDs int) error

IndexPair indexes all relevant segments of the given Pair. If maxIDs > 0 each indexed segment will only have up to maxIDs different (ID, Score) pairs and only the highest scores are chosen. If maxIDs > 0 and the value is chosen too small, the results could become too bad. Maybe maxIDs in [1000, 10000] is a good choice that limits too common words of a language like "the" or "a" in English. If maxIDs <= 0 the number of scores per segment is not limited. This will yield the best results (under the assumption that the result set is not limited) but definetly the biggest file size and higher temporary memory usage.

func (*File) KeyCount ¶

func (f *File) KeyCount() (uint32, error)

KeyCount returns the number of keys in the database at last calculation. If it wasn't calculated before (UpdateStatistics does it), an error is returned.

func (*File) LastID ¶

func (f *File) LastID() (ID, error)

LastID returns the last ID that was saved using SetLastID. This function can be helpful to get the last state of an operation.

func (*File) Search ¶

func (f *File) Search(query []byte, setOp SetOperation, maxResults int) ([]Result, error)

Search searches the relevant segments of the query in the index file and returns a result set ordered by score. If maxResults > 0 the maximum temporary results _during_ calculation of the search results, which can be much higher than the end result, are limited to maxResults. If for at least one segment the number of results > maxResults it's possible that the result set misses results with higher score. If maxResults <= 0 the memory is not limited. It's recommend to set maxResults > 0 to limit the maximum RAM usage (especially if the SetOperation is set to Union or query is user input).

func (*File) SetLastID ¶

func (f *File) SetLastID(id ID) error

SetLastID stores the given ID (that can be some unrelated type with same byte length) in the statistics of the database. This function can be helpful to store the last state of some operation. The stored value can be retrieved using a call to LastID. Setting the value has no effect on the indexed data.

func (File) String ¶

func (f File) String() string

func (*File) UpdateStatistics ¶

func (f *File) UpdateStatistics() error

UpdateStatistics calculates the current number of keys and the average data length.

type ID ¶

type ID = uint32

ID is a unique uint32 number like a position or an FNV hash, which is indexed together with a Score.

type Pair ¶

type Pair struct {
	ID   ID
	Text []byte
}

Pair is a pair of an ID and the text which should get indexed for the ID.

type Result ¶

type Result struct {
	ID    ID
	Score Score
}

Result is a single search result of a result set. It stores the ID and the score depending on the search query.

type Score ¶

type Score = float32

Score is a priority value calculated for each indexed segment per ID.

type SetOperation ¶

type SetOperation uint8

SetOperation is the operation that is done on the result set when the query consists of multiple relevant segments.

const (
	// Union collects all search results that match at least one relevant segment of the query.
	Union SetOperation = iota
	// Intersection collects all results that match each relevant segment of the query.
	Intersection
)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
wikiindex
wikisearch

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL