tokenize

package
v1.2.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 22, 2020 License: MIT Imports: 7 Imported by: 33

Documentation

Overview

Package tokenize implements functions to split strings into slices of substrings.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func TextToWords

func TextToWords(text string) []string

TextToWords converts the string text into a slice of words.

It does so by tokenizing text into sentences (using a port of NLTK's punkt tokenizer; see https://github.com/neurosnap/sentences) and then tokenizing the sentences into words via TreebankWordTokenizer.

Types

type PragmaticSegmenter

type PragmaticSegmenter struct {
	// contains filtered or unexported fields
}

PragmaticSegmenter is a multilingual, rule-based sentence boundary detector.

This is a port of the Ruby library by Kevin S. Dias (https://github.com/diasks2/pragmatic_segmenter).

func NewPragmaticSegmenter

func NewPragmaticSegmenter(lang string) (*PragmaticSegmenter, error)

NewPragmaticSegmenter creates a new PragmaticSegmenter according to the specified language. If the given language is not supported, an error will be returned.

Languages are specified by their two-character ISO 639-1 code. The supported languages are "en" (English), "es" (Spanish), "fr" (French) ... (WIP)

func (*PragmaticSegmenter) Tokenize

func (p *PragmaticSegmenter) Tokenize(text string) []string

Tokenize splits text into sentences.

type ProseTokenizer

type ProseTokenizer interface {
	Tokenize(text string) []string
}

ProseTokenizer is the interface implemented by an object that takes a string and returns a slice of substrings.

type PunktSentenceTokenizer

type PunktSentenceTokenizer struct {
	// contains filtered or unexported fields
}

PunktSentenceTokenizer is an extension of the Go implementation of the Punkt sentence tokenizer (https://github.com/neurosnap/sentences), with a few minor improvements (see https://github.com/neurosnap/sentences/pull/18).

func NewPunktSentenceTokenizer

func NewPunktSentenceTokenizer() *PunktSentenceTokenizer

NewPunktSentenceTokenizer creates a new PunktSentenceTokenizer and loads its English model.

func (PunktSentenceTokenizer) Tokenize

func (p PunktSentenceTokenizer) Tokenize(text string) []string

Tokenize splits text into sentences.

type RegexpTokenizer

type RegexpTokenizer struct {
	// contains filtered or unexported fields
}

RegexpTokenizer splits a string into substrings using a regular expression.

func NewBlanklineTokenizer

func NewBlanklineTokenizer() *RegexpTokenizer

NewBlanklineTokenizer is a RegexpTokenizer constructor.

This tokenizer splits on any sequence of blank lines.

Example
t := NewBlanklineTokenizer()
fmt.Println(t.Tokenize("They'll save and invest more.\n\nThanks!"))
Output:

[They'll save and invest more. Thanks!]

func NewRegexpTokenizer

func NewRegexpTokenizer(pattern string, gaps, discard bool) *RegexpTokenizer

NewRegexpTokenizer is a RegexpTokenizer constructor that takes three arguments: a pattern to base the tokenizer on, a boolean value indicating whether or not to look for separators between tokens, and boolean value indicating whether or not to discard empty tokens.

func NewWordBoundaryTokenizer

func NewWordBoundaryTokenizer() *RegexpTokenizer

NewWordBoundaryTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of word-like tokens.

Example
t := NewWordBoundaryTokenizer()
fmt.Println(t.Tokenize("They'll save and invest more."))
Output:

[They'll save and invest more]

func NewWordPunctTokenizer

func NewWordPunctTokenizer() *RegexpTokenizer

NewWordPunctTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of alphabetic and non-alphabetic characters.

Example
t := NewWordPunctTokenizer()
fmt.Println(t.Tokenize("They'll save and invest more."))
Output:

[They ' ll save and invest more .]

func (RegexpTokenizer) Tokenize

func (r RegexpTokenizer) Tokenize(text string) []string

Tokenize splits text into a slice of tokens according to its regexp pattern.

type TreebankWordTokenizer

type TreebankWordTokenizer struct {
}

TreebankWordTokenizer splits a sentence into words.

This implementation is a port of the Sed script written by Robert McIntyre, which is available at https://gist.github.com/jdkato/fc8b8c4266dba22d45ac85042ae53b1e.

func NewTreebankWordTokenizer

func NewTreebankWordTokenizer() *TreebankWordTokenizer

NewTreebankWordTokenizer is a TreebankWordTokenizer constructor.

Example
t := NewTreebankWordTokenizer()
fmt.Println(t.Tokenize("They'll save and invest more."))
Output:

[They 'll save and invest more .]

func (TreebankWordTokenizer) Tokenize

func (t TreebankWordTokenizer) Tokenize(text string) []string

Tokenize splits a sentence into a slice of words.

This tokenizer performs the following steps: (1) split on contractions (e.g., "don't" -> [do n't]), (2) split on non-terminating punctuation, (3) split on single quotes when followed by whitespace, and (4) split on periods that appear at the end of lines.

NOTE: As mentioned above, this function expects a sentence (not raw text) as input.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL