prose: github.com/jdkato/prose/tokenize Index | Examples | Files

package tokenize

import "github.com/jdkato/prose/tokenize"

Package tokenize implements functions to split strings into slices of substrings.

Index

Examples

Package Files

pragmatic.go punkt.go regexp.go tokenize.go treebank.go

func TextToWords Uses

func TextToWords(text string) []string

TextToWords converts the string text into a slice of words.

It does so by tokenizing text into sentences (using a port of NLTK's punkt tokenizer; see https://github.com/neurosnap/sentences) and then tokenizing the sentences into words via TreebankWordTokenizer.

type PragmaticSegmenter Uses

type PragmaticSegmenter struct {
    // contains filtered or unexported fields
}

PragmaticSegmenter is a multilingual, rule-based sentence boundary detector.

This is a port of the Ruby library by Kevin S. Dias (https://github.com/diasks2/pragmatic_segmenter).

func NewPragmaticSegmenter Uses

func NewPragmaticSegmenter(lang string) (*PragmaticSegmenter, error)

NewPragmaticSegmenter creates a new PragmaticSegmenter according to the specified language. If the given language is not supported, an error will be returned.

Languages are specified by their two-character ISO 639-1 code. The supported languages are "en" (English), "es" (Spanish), "fr" (French) ... (WIP)

func (*PragmaticSegmenter) Tokenize Uses

func (p *PragmaticSegmenter) Tokenize(text string) []string

Tokenize splits text into sentences.

type ProseTokenizer Uses

type ProseTokenizer interface {
    Tokenize(text string) []string
}

ProseTokenizer is the interface implemented by an object that takes a string and returns a slice of substrings.

type PunktSentenceTokenizer Uses

type PunktSentenceTokenizer struct {
    // contains filtered or unexported fields
}

PunktSentenceTokenizer is an extension of the Go implementation of the Punkt sentence tokenizer (https://github.com/neurosnap/sentences), with a few minor improvements (see https://github.com/neurosnap/sentences/pull/18).

func NewPunktSentenceTokenizer Uses

func NewPunktSentenceTokenizer() *PunktSentenceTokenizer

NewPunktSentenceTokenizer creates a new PunktSentenceTokenizer and loads its English model.

func (PunktSentenceTokenizer) Tokenize Uses

func (p PunktSentenceTokenizer) Tokenize(text string) []string

Tokenize splits text into sentences.

type RegexpTokenizer Uses

type RegexpTokenizer struct {
    // contains filtered or unexported fields
}

RegexpTokenizer splits a string into substrings using a regular expression.

func NewBlanklineTokenizer Uses

func NewBlanklineTokenizer() *RegexpTokenizer

NewBlanklineTokenizer is a RegexpTokenizer constructor.

This tokenizer splits on any sequence of blank lines.

Code:

t := NewBlanklineTokenizer()
fmt.Println(t.Tokenize("They'll save and invest more.\n\nThanks!"))

Output:

[They'll save and invest more. Thanks!]

func NewRegexpTokenizer Uses

func NewRegexpTokenizer(pattern string, gaps, discard bool) *RegexpTokenizer

NewRegexpTokenizer is a RegexpTokenizer constructor that takes three arguments: a pattern to base the tokenizer on, a boolean value indicating whether or not to look for separators between tokens, and boolean value indicating whether or not to discard empty tokens.

func NewWordBoundaryTokenizer Uses

func NewWordBoundaryTokenizer() *RegexpTokenizer

NewWordBoundaryTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of word-like tokens.

Code:

t := NewWordBoundaryTokenizer()
fmt.Println(t.Tokenize("They'll save and invest more."))

Output:

[They'll save and invest more]

func NewWordPunctTokenizer Uses

func NewWordPunctTokenizer() *RegexpTokenizer

NewWordPunctTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of alphabetic and non-alphabetic characters.

Code:

t := NewWordPunctTokenizer()
fmt.Println(t.Tokenize("They'll save and invest more."))

Output:

[They ' ll save and invest more .]

func (RegexpTokenizer) Tokenize Uses

func (r RegexpTokenizer) Tokenize(text string) []string

Tokenize splits text into a slice of tokens according to its regexp pattern.

type TreebankWordTokenizer Uses

type TreebankWordTokenizer struct {
}

TreebankWordTokenizer splits a sentence into words.

This implementation is a port of the Sed script written by Robert McIntyre, which is available at https://gist.github.com/jdkato/fc8b8c4266dba22d45ac85042ae53b1e.

func NewTreebankWordTokenizer Uses

func NewTreebankWordTokenizer() *TreebankWordTokenizer

NewTreebankWordTokenizer is a TreebankWordTokenizer constructor.

Code:

t := NewTreebankWordTokenizer()
fmt.Println(t.Tokenize("They'll save and invest more."))

Output:

[They 'll save and invest more .]

func (TreebankWordTokenizer) Tokenize Uses

func (t TreebankWordTokenizer) Tokenize(text string) []string

Tokenize splits a sentence into a slice of words.

This tokenizer performs the following steps: (1) split on contractions (e.g., "don't" -> [do n't]), (2) split on non-terminating punctuation, (3) split on single quotes when followed by whitespace, and (4) split on periods that appear at the end of lines.

NOTE: As mentioned above, this function expects a sentence (not raw text) as input.

Package tokenize imports 7 packages (graph) and is imported by 10 packages. Updated 2018-03-02. Refresh now. Tools for package owners.