whitespacepretokenizer

package
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 12, 2020 License: BSD-2-Clause Imports: 5 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var DefaultWordRegexp = regexp2.MustCompile(`\w+|[^\w\s]+`, regexp2.IgnoreCase|regexp2.Multiline)

(readonly)

Functions

This section is empty.

Types

type WhiteSpacePreTokenizer

type WhiteSpacePreTokenizer struct {
	// contains filtered or unexported fields
}

WhiteSpacePreTokenizer allows the generation of pre-tokens made by distinct groups of unicode letters (words) and non-letter characters (such as punctuation signs or other symbols). Whitespace-like characters are always identified as explicit tokens separators.

func New

New returns a new WhiteSpacePreTokenizer.

func NewDefault

func NewDefault() *WhiteSpacePreTokenizer

func (*WhiteSpacePreTokenizer) PreTokenize

PreTokenize splits the NormalizedString into word and non-word groups separated by whitespace-like characters.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL