bytelevelpretokenizer

package
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 12, 2020 License: BSD-2-Clause Imports: 6 Imported by: 1

Documentation

Index

Constants

This section is empty.

Variables

View Source
var DefaultSplittingRegexp = regexp2.MustCompile(
	`'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+`,
	regexp2.IgnoreCase|regexp2.Multiline)

DefaultSplittingRegexp is a simple default regular expression that can be used for ByteLevelPreTokenizer.

This MUST be treated as a read-only constant value .

Functions

This section is empty.

Types

type ByteLevelPreTokenizer

type ByteLevelPreTokenizer struct {
	// contains filtered or unexported fields
}

ByteLevelPreTokenizer allows the generation of pre-tokens suitable in the context of BPE (Byte Pair Encoding) models.

This pre-tokenizer splits the input string into tokens according to the given regular expression. It also transforms the string expanding each UTF-8 character (rune) into bytes, which are then re-mapped to new custom runes.

A whitespace prefix (' ') can be optionally prepended to the input string, unless the first rune of the string is already a unicode whitespace.

Offsets trimming can be enabled to exclude whitespaces in the post-processing step.

func New

func New(
	splittingRegexp *regexp2.Regexp,
	prefixSpaceEnabled bool,
	offsetsTrimmingEnabled bool,
) *ByteLevelPreTokenizer

New returns a new ByteLevelPreTokenizer.

func NewDefault

func NewDefault() *ByteLevelPreTokenizer

NewDefault returns a new ByteLevelPreTokenizer, using DefaultSplittingRegexp, and enabling prefix space insertion.

func (*ByteLevelPreTokenizer) PreTokenize

PreTokenize is in charge of transforming all the unicode characters into their byte-level counterpart. It also splits the input according to the configured regex.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL