corpus

package module
v0.0.0-...-4c7443a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 19, 2021 License: MIT Imports: 12 Imported by: 1

README

corpus

Corpus provides vocabulary management data structures and utilities

Documentation

Index

Constants

This section is empty.

Variables

View Source
var NumberWords = map[string]int{
	"zero":        0,
	"one":         1,
	"two":         2,
	"three":       3,
	"four":        4,
	"five":        5,
	"six":         6,
	"seven":       7,
	"eight":       8,
	"nine":        9,
	"ten":         10,
	"eleven":      11,
	"twelve":      12,
	"thirteen":    13,
	"fourteen":    14,
	"fifteen":     15,
	"sixteen":     16,
	"nineteen":    19,
	"seventeen":   17,
	"eighteen":    18,
	"twenty":      20,
	"thirty":      30,
	"forty":       40,
	"fifty":       50,
	"sixty":       60,
	"seventy":     70,
	"eighty":      80,
	"ninety":      90,
	"hundred":     100,
	"thousand":    1000,
	"million":     1000000,
	"billion":     1000000000,
	"trillion":    1000000000000,
	"quadrillion": 1000000000000000,
}

NumberWords was generated with this python code

numberWords = {}

simple = '''zero one two three four five six seven eight nine ten eleven twelve
        thirteen fourteen fifteen sixteen seventeen eighteen nineteen
        twenty'''.split()
for i, word in zip(xrange(0, 20+1), simple):
    numberWords[word] = i

tense = '''thirty forty fifty sixty seventy eighty ninety hundred'''.split()
for i, word in zip(xrange(30, 100+1, 10), tense):
	numberWords[word] = i

larges = '''thousand million billion trillion quadrillion quintillion sextillion septillion'''.split()
for i, word in zip(xrange(3, 24+1, 3), larges):
	numberWords[word] = 10**i

Functions

func ToDict

func ToDict(c *Corpus) map[string]int

ToDict returns a marshalable dict. It returns a copy of the ID mapping.

func ToDictWithFreq

func ToDictWithFreq(c *Corpus) map[string]struct{ ID, Freq int }

ToDictWithFreq returns a simple marshalable type. Conceptually it's a JSON object with the words as the keys. The values are a pair - ID and Freq.

func ViterbiSplit

func ViterbiSplit(input string, c *Corpus) []string

ViterbiSplit is a Viterbi algorithm for splitting words given a corpus

Types

type ConsOpt

type ConsOpt func(c *Corpus) error

ConsOpt is a construction option for manual creation of a Corpus

func FromDict

func FromDict(d map[string]int) ConsOpt

FromDict is a construction option to take a map[string]int where the int represents the word ID. This is useful for constructing corpuses from foreign sources where the ID mappings are important

func FromDictWithFreq

func FromDictWithFreq(d map[string]struct{ ID, Freq int }) ConsOpt

FromDictWithFreq is like FromDict, but also has a frequency.

func WithOrderedWords

func WithOrderedWords(a []string) ConsOpt

WithOrderedWords creates a Corpus with the given word order

func WithSize

func WithSize(size int) ConsOpt

WithSize preallocates all the things in Corpus

func WithWords

func WithWords(a []string) ConsOpt

WithWords creates a corpus from a word list. It may have repeated words

type Corpus

type Corpus struct {
	// contains filtered or unexported fields
}

Corpus is a data structure holding the relevant metadata and information for a corpus of text. It serves as vocabulary with ID for lookup. This is very useful as neural networks rely on the IDs rather than the text themselves

func Construct

func Construct(opts ...ConsOpt) (*Corpus, error)

Construct creates a Corpus given the construction options. This allows for more flexibility

func FromTextCorpus

func FromTextCorpus(r io.Reader, tokenizer func(a string) []string, normalizer func(a string) string) (*Corpus, error)

FromTextCorpus is a utility function to take in a text file, and return a Corpus.

func New

func New() *Corpus

New creates a new *Corpus

func (*Corpus) Add

func (c *Corpus) Add(word string) int

Add adds a word to the corpus and returns its ID. If a word was previously in the corpus, it merely updates the frequency count and returns the ID

func (*Corpus) GobDecode

func (c *Corpus) GobDecode(buf []byte) error

GobDecode implements GobDecoder for *Corpus

func (*Corpus) GobEncode

func (c *Corpus) GobEncode() ([]byte, error)

GobEncode implements GobEncoder for *Corpus

func (*Corpus) IDFreq

func (c *Corpus) IDFreq(id int) int

IDFreq returns the frequency of a word given an ID. If the word isn't in the corpus it returns 0.

func (*Corpus) Id

func (c *Corpus) Id(word string) (int, bool)

ID returns the ID of a word and whether or not it was found in the corpus

func (*Corpus) LoadOneGram

func (c *Corpus) LoadOneGram(r io.Reader) error

LoadOneGram loads a 1_gram.txt file, which is a tab separated file which lists the frequency counts of words. Example:

the	23135851162
of	13151942776
and	12997637966
to	12136980858
a	9081174698
in	8469404971
for	5933321709

func (*Corpus) MaxWordLength

func (c *Corpus) MaxWordLength() int

MaxWordLength returns the length of the longest known word in the corpus.

func (*Corpus) Merge

func (c *Corpus) Merge(other *Corpus)

Merge combines two corpuses. The receiver is the one that is mutated.

func (*Corpus) Replace

func (c *Corpus) Replace(a, with string) error

Replace replaces the content of a word. The old reference remains.

e.g: c.Replace("foo", "bar") c.Id("foo") will still return a ID. The ID will be the same as c.Id("bar")

func (*Corpus) ReplaceWord

func (c *Corpus) ReplaceWord(id int, with string) error

ReplaceWord replaces the word associated with the given ID. The old reference remains.

func (*Corpus) Size

func (c *Corpus) Size() int

Size returns the size of the corpus.

func (*Corpus) TotalFreq

func (c *Corpus) TotalFreq() int

TotalFreq returns the total number of words ever seen by the corpus. This number includes the count of repeat words.

func (*Corpus) Word

func (c *Corpus) Word(id int) (string, bool)

Word returns the word given the ID, and whether or not it was found in the corpus

func (*Corpus) WordFreq

func (c *Corpus) WordFreq(word string) int

WordFreq returns the frequency of the word. If the word wasn't in the corpus, it returns 0.

func (*Corpus) WordProb

func (c *Corpus) WordProb(word string) (float64, bool)

WordProb returns the probability of a word appearing in the corpus.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL