Documentation ¶
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type BagOfWordsTokenizer ¶
type BagOfWordsTokenizer struct {
// contains filtered or unexported fields
}
BagOfWordsTokenizer output a list of words suitable to be used to build a bag of words matrix.
It uses a Treebank tokenizer internally but removes puncutation, and stopwords.
Example ¶
tokenizer := NewBagOfWordsTokenizer("fixtures/stop_words.txt") tokens := tokenizer.Tokenize( // Example string from http://nlp.stanford.edu/software/tokenizer.shtml `"Oh, no," she's saying, "our $400 blender can't handle something this hard!"`) fmt.Println(tokens)
Output: [oh saying blender handle something hard]
func NewBagOfWordsTokenizer ¶
func NewBagOfWordsTokenizer(pathToStopWords string) *BagOfWordsTokenizer
func (*BagOfWordsTokenizer) Tokenize ¶
func (t *BagOfWordsTokenizer) Tokenize(text string) []string
type TreebankWordTokenizer ¶
type TreebankWordTokenizer struct {
// contains filtered or unexported fields
}
TreebankWordTokenizer is an English specific tokenizer which uses regular expressions to tokenize text as in Penn Treebank.
Ported from NLTK's implementation: http://www.nltk.org/_modules/nltk/tokenize/treebank.html
Regexp initialization happens at first call of Tokenize(). You can initialize in advance by creating the Tokenizer via NewTreebankWordTokenizer method.
Example ¶
tokenizer := NewTreebankWordTokenizer() tokens := tokenizer.Tokenize( // Example string from http://nlp.stanford.edu/software/tokenizer.shtml `"Oh, no," she's saying, "our $400 blender can't handle something this hard!"`) fmt.Println(tokens)
Output: [`` Oh , no , '' she 's saying , `` our $ 400 blender ca n't handle something this hard ! '']
func NewTreebankWordTokenizer ¶
func NewTreebankWordTokenizer() *TreebankWordTokenizer
func (*TreebankWordTokenizer) Tokenize ¶
func (t *TreebankWordTokenizer) Tokenize(text string) []string
Click to show internal directories.
Click to hide internal directories.