Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
var DefaultSplittingRegexp = regexp2.MustCompile( `'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+`, regexp2.IgnoreCase|regexp2.Multiline)
DefaultSplittingRegexp is a simple default regular expression that can be used for ByteLevelPreTokenizer.
This MUST be treated as a read-only constant value .
Functions ¶
This section is empty.
Types ¶
type ByteLevelPreTokenizer ¶
type ByteLevelPreTokenizer struct {
// contains filtered or unexported fields
}
ByteLevelPreTokenizer allows the generation of pre-tokens suitable in the context of BPE (Byte Pair Encoding) models.
This pre-tokenizer splits the input string into tokens according to the given regular expression. It also transforms the string expanding each UTF-8 character (rune) into bytes, which are then re-mapped to new custom runes.
A whitespace prefix (' ') can be optionally prepended to the input string, unless the first rune of the string is already a unicode whitespace.
Offsets trimming can be enabled to exclude whitespaces in the post-processing step.
func New ¶
func New( splittingRegexp *regexp2.Regexp, prefixSpaceEnabled bool, offsetsTrimmingEnabled bool, ) *ByteLevelPreTokenizer
New returns a new ByteLevelPreTokenizer.
func NewDefault ¶
func NewDefault() *ByteLevelPreTokenizer
NewDefault returns a new ByteLevelPreTokenizer, using DefaultSplittingRegexp, and enabling prefix space insertion.
func (*ByteLevelPreTokenizer) PreTokenize ¶
func (b *ByteLevelPreTokenizer) PreTokenize(pts *pretokenizedstring.PreTokenizedString) error
PreTokenize is in charge of transforming all the unicode characters into their byte-level counterpart. It also splits the input according to the configured regex.