bytelevelpretokenizer

package

v0.2.0 Latest Latest Go to latest Published: Dec 12, 2020 License: BSD-2-Clause Imports: 6 Imported by: 1

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/nlpodyssey/gotokenizers

Links

Open Source Insights

Documentation ¶

Index ¶

Variables
type ByteLevelPreTokenizer
- func New(splittingRegexp *regexp2.Regexp, prefixSpaceEnabled bool, ...) *ByteLevelPreTokenizer
- func NewDefault() *ByteLevelPreTokenizer
- func (b *ByteLevelPreTokenizer) PreTokenize(pts *pretokenizedstring.PreTokenizedString) error

Constants ¶

This section is empty.

Variables ¶

View Source

var DefaultSplittingRegexp = regexp2.MustCompile(
	`'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+`,
	regexp2.IgnoreCase|regexp2.Multiline)

DefaultSplittingRegexp is a simple default regular expression that can be used for ByteLevelPreTokenizer.

This MUST be treated as a read-only constant value .

Functions ¶

This section is empty.

Types ¶

type ByteLevelPreTokenizer ¶

type ByteLevelPreTokenizer struct {
	// contains filtered or unexported fields
}

ByteLevelPreTokenizer allows the generation of pre-tokens suitable in the context of BPE (Byte Pair Encoding) models.

This pre-tokenizer splits the input string into tokens according to the given regular expression. It also transforms the string expanding each UTF-8 character (rune) into bytes, which are then re-mapped to new custom runes.

A whitespace prefix (' ') can be optionally prepended to the input string, unless the first rune of the string is already a unicode whitespace.

Offsets trimming can be enabled to exclude whitespaces in the post-processing step.

func New ¶

func New(
	splittingRegexp *regexp2.Regexp,
	prefixSpaceEnabled bool,
	offsetsTrimmingEnabled bool,
) *ByteLevelPreTokenizer

New returns a new ByteLevelPreTokenizer.

func NewDefault ¶

func NewDefault() *ByteLevelPreTokenizer

NewDefault returns a new ByteLevelPreTokenizer, using DefaultSplittingRegexp, and enabling prefix space insertion.

func (*ByteLevelPreTokenizer) PreTokenize ¶

func (b *ByteLevelPreTokenizer) PreTokenize(pts *pretokenizedstring.PreTokenizedString) error

PreTokenize is in charge of transforming all the unicode characters into their byte-level counterpart. It also splits the input according to the configured regex.

Source Files ¶

View all Source files

bytelevelpretokenizer.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL