tokenizer

package module
v0.0.0-...-41f6b9f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 30, 2014 License: MIT Imports: 5 Imported by: 1

README

Overview

Implementation of various natural language tokenizers in Go.

Tokenizers:

  • TreeBankWordTokenizer
  • BagOfWordsTokenizer

Documentation: http://godoc.org/github.com/srom/tokenizer

License

MIT License.

Documentation

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type BagOfWordsTokenizer

type BagOfWordsTokenizer struct {
	// contains filtered or unexported fields
}

BagOfWordsTokenizer output a list of words suitable to be used to build a bag of words matrix.

It uses a Treebank tokenizer internally but removes puncutation, and stopwords.

Example
tokenizer := NewBagOfWordsTokenizer("fixtures/stop_words.txt")

tokens := tokenizer.Tokenize(
	// Example string from http://nlp.stanford.edu/software/tokenizer.shtml
	`"Oh, no," she's saying, "our $400 blender can't handle something this hard!"`)

fmt.Println(tokens)
Output:

[oh saying blender handle something hard]

func NewBagOfWordsTokenizer

func NewBagOfWordsTokenizer(pathToStopWords string) *BagOfWordsTokenizer

func (*BagOfWordsTokenizer) Tokenize

func (t *BagOfWordsTokenizer) Tokenize(text string) []string

type Tokenizer

type Tokenizer interface {
	Tokenize(text string) []string
}

type TreebankWordTokenizer

type TreebankWordTokenizer struct {
	// contains filtered or unexported fields
}

TreebankWordTokenizer is an English specific tokenizer which uses regular expressions to tokenize text as in Penn Treebank.

Ported from NLTK's implementation: http://www.nltk.org/_modules/nltk/tokenize/treebank.html

Regexp initialization happens at first call of Tokenize(). You can initialize in advance by creating the Tokenizer via NewTreebankWordTokenizer method.

Example
tokenizer := NewTreebankWordTokenizer()

tokens := tokenizer.Tokenize(
	// Example string from http://nlp.stanford.edu/software/tokenizer.shtml
	`"Oh, no," she's saying, "our $400 blender can't handle something this hard!"`)

fmt.Println(tokens)
Output:

[`` Oh , no , '' she 's saying , `` our $ 400 blender ca n't handle something this hard ! '']

func NewTreebankWordTokenizer

func NewTreebankWordTokenizer() *TreebankWordTokenizer

func (*TreebankWordTokenizer) Tokenize

func (t *TreebankWordTokenizer) Tokenize(text string) []string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL