The highest tagged major version is v2.

tokenize

package

v1.2.1 Latest Latest Go to latest Published: Dec 22, 2020 License: MIT Imports: 7 Imported by: 33

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/jdkato/prose

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenize implements functions to split strings into slices of substrings.

Index ¶

func TextToWords(text string) []string
type PragmaticSegmenter
- func NewPragmaticSegmenter(lang string) (*PragmaticSegmenter, error)
- func (p *PragmaticSegmenter) Tokenize(text string) []string
type ProseTokenizer
type PunktSentenceTokenizer
- func NewPunktSentenceTokenizer() *PunktSentenceTokenizer
- func (p PunktSentenceTokenizer) Tokenize(text string) []string
type RegexpTokenizer
- func (r RegexpTokenizer) Tokenize(text string) []string
type TreebankWordTokenizer
- func NewTreebankWordTokenizer() *TreebankWordTokenizer
- func (t TreebankWordTokenizer) Tokenize(text string) []string

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func TextToWords ¶

func TextToWords(text string) []string

TextToWords converts the string text into a slice of words.

It does so by tokenizing text into sentences (using a port of NLTK's punkt tokenizer; see https://github.com/neurosnap/sentences) and then tokenizing the sentences into words via TreebankWordTokenizer.

Types ¶

type PragmaticSegmenter ¶

type PragmaticSegmenter struct {
	// contains filtered or unexported fields
}

PragmaticSegmenter is a multilingual, rule-based sentence boundary detector.

This is a port of the Ruby library by Kevin S. Dias (https://github.com/diasks2/pragmatic_segmenter).

func NewPragmaticSegmenter ¶

func NewPragmaticSegmenter(lang string) (*PragmaticSegmenter, error)

NewPragmaticSegmenter creates a new PragmaticSegmenter according to the specified language. If the given language is not supported, an error will be returned.

Languages are specified by their two-character ISO 639-1 code. The supported languages are "en" (English), "es" (Spanish), "fr" (French) ... (WIP)

func (*PragmaticSegmenter) Tokenize ¶

func (p *PragmaticSegmenter) Tokenize(text string) []string

Tokenize splits text into sentences.

type ProseTokenizer ¶

type ProseTokenizer interface {
	Tokenize(text string) []string
}

ProseTokenizer is the interface implemented by an object that takes a string and returns a slice of substrings.

type PunktSentenceTokenizer ¶

type PunktSentenceTokenizer struct {
	// contains filtered or unexported fields
}

PunktSentenceTokenizer is an extension of the Go implementation of the Punkt sentence tokenizer (https://github.com/neurosnap/sentences), with a few minor improvements (see https://github.com/neurosnap/sentences/pull/18).

func NewPunktSentenceTokenizer ¶

func NewPunktSentenceTokenizer() *PunktSentenceTokenizer

NewPunktSentenceTokenizer creates a new PunktSentenceTokenizer and loads its English model.

func (PunktSentenceTokenizer) Tokenize ¶

func (p PunktSentenceTokenizer) Tokenize(text string) []string

Tokenize splits text into sentences.

type RegexpTokenizer ¶

type RegexpTokenizer struct {
	// contains filtered or unexported fields
}

RegexpTokenizer splits a string into substrings using a regular expression.

func NewBlanklineTokenizer ¶

func NewBlanklineTokenizer() *RegexpTokenizer

NewBlanklineTokenizer is a RegexpTokenizer constructor.

This tokenizer splits on any sequence of blank lines.

Example ¶

t := NewBlanklineTokenizer()
fmt.Println(t.Tokenize("They'll save and invest more.\n\nThanks!"))

Output:

[They'll save and invest more. Thanks!]

func NewRegexpTokenizer ¶

func NewRegexpTokenizer(pattern string, gaps, discard bool) *RegexpTokenizer

NewRegexpTokenizer is a RegexpTokenizer constructor that takes three arguments: a pattern to base the tokenizer on, a boolean value indicating whether or not to look for separators between tokens, and boolean value indicating whether or not to discard empty tokens.

func NewWordBoundaryTokenizer ¶

func NewWordBoundaryTokenizer() *RegexpTokenizer

NewWordBoundaryTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of word-like tokens.

Example ¶

t := NewWordBoundaryTokenizer()
fmt.Println(t.Tokenize("They'll save and invest more."))

Output:

[They'll save and invest more]

func NewWordPunctTokenizer ¶

func NewWordPunctTokenizer() *RegexpTokenizer

NewWordPunctTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of alphabetic and non-alphabetic characters.

Example ¶

t := NewWordPunctTokenizer()
fmt.Println(t.Tokenize("They'll save and invest more."))

Output:

[They ' ll save and invest more .]

func (RegexpTokenizer) Tokenize ¶

func (r RegexpTokenizer) Tokenize(text string) []string

Tokenize splits text into a slice of tokens according to its regexp pattern.

type TreebankWordTokenizer ¶

type TreebankWordTokenizer struct {
}

TreebankWordTokenizer splits a sentence into words.

This implementation is a port of the Sed script written by Robert McIntyre, which is available at https://gist.github.com/jdkato/fc8b8c4266dba22d45ac85042ae53b1e.

func NewTreebankWordTokenizer ¶

func NewTreebankWordTokenizer() *TreebankWordTokenizer

NewTreebankWordTokenizer is a TreebankWordTokenizer constructor.

Example ¶

t := NewTreebankWordTokenizer()
fmt.Println(t.Tokenize("They'll save and invest more."))

Output:

[They 'll save and invest more .]

func (TreebankWordTokenizer) Tokenize ¶

func (t TreebankWordTokenizer) Tokenize(text string) []string

Tokenize splits a sentence into a slice of words.

This tokenizer performs the following steps: (1) split on contractions (e.g., "don't" -> [do n't]), (2) split on non-terminating punctuation, (3) split on single quotes when followed by whitespace, and (4) split on periods that appear at the end of lines.

NOTE: As mentioned above, this function expects a sentence (not raw text) as input.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL