Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
View Source
var DefaultWordRegexp = regexp2.MustCompile(`\w+|[^\w\s]+`, regexp2.IgnoreCase|regexp2.Multiline)
(readonly)
Functions ¶
This section is empty.
Types ¶
type WhiteSpacePreTokenizer ¶
type WhiteSpacePreTokenizer struct {
// contains filtered or unexported fields
}
WhiteSpacePreTokenizer allows the generation of pre-tokens made by distinct groups of unicode letters (words) and non-letter characters (such as punctuation signs or other symbols). Whitespace-like characters are always identified as explicit tokens separators.
func New ¶
func New(r *regexp2.Regexp) *WhiteSpacePreTokenizer
New returns a new WhiteSpacePreTokenizer.
func NewDefault ¶
func NewDefault() *WhiteSpacePreTokenizer
func (*WhiteSpacePreTokenizer) PreTokenize ¶
func (w *WhiteSpacePreTokenizer) PreTokenize(pts *pretokenizedstring.PreTokenizedString) error
PreTokenize splits the NormalizedString into word and non-word groups separated by whitespace-like characters.
Click to show internal directories.
Click to hide internal directories.