dictionary

package
v1.4.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 28, 2021 License: MIT Imports: 15 Imported by: 0

Documentation

Overview

Package dictionary contains code related to looking up and storing words in dictionaries. The parser currently supports the Project Gutenberg's edition of the Webster's Unabridged 1913 dictionary.

Index

Constants

View Source
const FileVer = "DICT6\x00" // note: can currently handle "DICT5\x00" too

FileVer is the current compatibility level of saved Files.

Variables

This section is empty.

Functions

func CreateFile

func CreateFile(wm WordMap, dictfile string) error

CreateFile exports a WordMap to a file. The files specified will be overwritten if they exist.

Types

type File

type File struct {
	// contains filtered or unexported fields
}

File implements an efficient Store which is faster to initialize and uses a lot less memory (~15 MB total) than WordMap.

There needs to be enough memory to store the whole index. Reading a dict is also completely thread-safe. Corrupt files will be detected during the read of the corrupted word (or the initialization in the case of index corruption) or during Verify.

The dict file is stored in the following format:

  • --------- + ------------ + --------------------------------------------- + ---------- + ------------------------------------------------- + | | | + ---- + ---------------------------- + | | | | FileVer | idx offset | | size | zlib compressed Word msgpack | ... | idx size | zlib compressed idx map[string][]offset msgpack | | | | + =================================== + | | |

  • --------- + ------------ + --------------------------------------------- + ============================================================== +

    All sizes and offsets are little-endian int64. All sizes are the size of the size plus the data.

The file is opened using the following steps:

1. The FileVer is read and checked. It must match exactly. 2. The idx offset is read. 3. The file is seeked to the beginning plus the idx offset. 4. The idx size is read. 5. The bytes for the idx are decompressed using zlib, and the resulting msgpack is decoded into an in-memory map[string][]int64 of the words to offsets.

To read a word:

1. The offset is retrieved from the in-memory idx. 2. The file is seeked to the beginning plus the offset. 4. The size of the compressed word is read. 5. The bytes for the word are decompressed using zlib, and the resulting msgpack is decoded into an in-memory *Word.

For more details, see the source code.

It is up to the creator to ensure there aren't duplicate references to entries for headwords in the index. If duplicates are found, they will be returned as-is.

func OpenFile

func OpenFile(dictfile string) (*File, error)

OpenFile opens a dictionary file. It will return errors if there are errors reading the files or critical errors in the structure.

func (*File) Close

func (d *File) Close() error

Close closes the files associated with the dictionary file and clears the in-memory index. Usage of the File afterwards may result in a panic.

func (*File) GetWord

func (d *File) GetWord(word string) (*Word, bool, error)

GetWord is deprecated.

func (*File) GetWords added in v1.4.0

func (d *File) GetWords(word string) ([]*Word, bool, error)

GetWord implements Store, and will return an error if the data structure is invalid or the underlying files are inaccessible.

func (*File) HasWord

func (d *File) HasWord(word string) bool

HasWord implements Store.

func (*File) Lookup

func (d *File) Lookup(word string) (*Word, bool, error)

Lookup is deprecated.

func (*File) LookupWord added in v1.4.0

func (d *File) LookupWord(word string) ([]*Word, bool, error)

Lookup is a shortcut for Lookup.

func (*File) NumWords

func (d *File) NumWords() int

NumWords implements Store.

func (*File) Verify

func (d *File) Verify() error

Verify verifies the consistency of the data structures in the dict file. WARNING: Verify takes a few seconds to run.

type Store

type Store interface {
	// NumWords returns the number of words in the Store.
	NumWords() int
	// HasWord checks if the Store contains a word as-is (i.e. do not do any additional processing or trimming).
	HasWord(word string) bool

	// GetWords gets a word, which can have multiple instances, from the Store.
	// If it does not exist, exists will be false, and word and err will be nil.
	GetWords(word string) (w []*Word, exists bool, err error)
	// GetWord is deprecated.
	GetWord(word string) (w *Word, exists bool, err error)

	// LookupWord should call LookupWord on itself.
	LookupWord(word string) ([]*Word, bool, error)
	// Lookup is deprecated.
	Lookup(word string) (*Word, bool, error)
}

Store is a backend for storing dictionary entries. Implementations should not return duplicate entries, but it is not a bug to do so.

type Word

type Word struct {
	Word            string        `json:"word,omitempty" msgpack:"w"`
	Alternates      []string      `json:"alternates,omitempty" msgpack:"a"`
	Info            string        `json:"info,omitempty" msgpack:"i"`
	Etymology       string        `json:"etymology,omitempty" msgpack:"e"`
	Meanings        []WordMeaning `json:"meanings,omitempty" msgpack:"m"`
	Notes           []string      `json:"notes,omitempty" msgpack:"n"`
	Extra           string        `json:"extra,omitempty" msgpack:"x"`
	Credit          string        `json:"credit,omitempty" msgpack:"c"`
	ReferencedWords []string      `json:"referenced_words" msgpack:"r"` // note: this does not include words referenced within meanings
}

Word represents a word.

func Lookup

func Lookup(store Store, word string) (*Word, bool, error)

Lookup looks up the first entry for a word in the dictionary (deprecated). It applies normalization and stemming to the word if no direct match is found.

func LookupWord added in v1.4.0

func LookupWord(store Store, word string) ([]*Word, bool, error)

LookupWord looks up a word in the dictionary. It applies normalization and stemming to the word if no direct match is found.

type WordMap

type WordMap map[string][]*Word

WordMap is an in-memory word Store used and returned by Parse. Although fast, it consumes huge amounts of memory and shouldn't be used if possible. It is up to the creator to ensure there aren't duplicate references to entries for headwords.

func Parse

func Parse(r io.Reader) (WordMap, error)

Parse parses Webster's Unabridged Dictionary of 1913 into a WordMap. Note: For dictserver > v1.3.1, this now uses the parser I implemented for dictutil which is much more efficient and accurate.

func (WordMap) GetWord

func (wm WordMap) GetWord(word string) (*Word, bool, error)

GetWord is deprecated.

func (WordMap) GetWords added in v1.4.0

func (wm WordMap) GetWords(word string) ([]*Word, bool, error)

GetWords implements Store, but will never return an error.

func (WordMap) HasWord

func (wm WordMap) HasWord(word string) bool

HasWord implements Store.

func (WordMap) Lookup

func (wm WordMap) Lookup(word string) (*Word, bool, error)

Lookup is deprecated.

func (WordMap) LookupWord added in v1.4.0

func (wm WordMap) LookupWord(word string) ([]*Word, bool, error)

LookupWord is a shortcut for LookupWord.

func (WordMap) NumWords

func (wm WordMap) NumWords() int

NumWords implements Store.

type WordMeaning added in v1.4.0

type WordMeaning struct {
	Text            string   `json:"text,omitempty" msgpack:"t"`
	Example         string   `json:"example,omitempty" msgpack:"e"`
	ReferencedWords []string `json:"referenced_words" msgpack:"r"`
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL