tokenmonster

package
v0.0.0-...-8d32435 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 15, 2023 License: MIT Imports: 20 Imported by: 0

README

Click here for the complete documentation on pkg.go.dev.

Basic Usage

import "github.com/alasdairforsythe/tokenmonster/go"

func example() {

	vocab, err := tokenmonster.Load(vocabfilename)
	if err != nil {
		panic(err)
	}

	tokens, missing, err := vocab.Tokenize(text)
	if err != nil {
		panic(err)
	}
	
	decoder := vocab.NewDecoder()
	decoded_text := decoder.Decode(tokens)

}

missing is the number of bytes for which there were no tokens.

text must be a slice of bytes. If you are using UTF-16 encoding, that slice of bytes should be already UTF-16 encoded.

decoded_text will be also a slice of bytes in the charset encoding. If you are using UTF-8 encoding you can convert it to a string with string().

When using vocab.Tokenize(text) please note that if the vocabulary uses any normalizations other than NFD, the normalizations may be applied to the underlying text data. Therefore please pass a copy if you don't want the underlying data to be modified. This applies only to the Go package (the Python library always uses a copy.)

.

Documentation

Index

Constants

View Source
const (
	DOES_NOT_EXIST = 16777215
)

Variables

This section is empty.

Functions

This section is empty.

Types

type Decoder

type Decoder struct {
	// contains filtered or unexported fields
}

A decoder object for sequential decoding. Use the NewDecoder function of the Vocab struct.

func (*Decoder) Decode

func (d *Decoder) Decode(tokens []uint32) []byte

Decodes tokens IDs back into bytes.

func (*Decoder) DecodeSerialized

func (d *Decoder) DecodeSerialized(b []byte, encodingLength uint8, buffer []byte) []byte

Decodes tokens from a serialized bytes slice. `encodingLength` must be one of: 0, 2, 3, 4. If you enter `encodingLength` 0 then it will determine the encoding length from the vocabulary size. `buffer` is optional, you can send it `nil` and it will allocate a new slice.

func (*Decoder) Deserialize

func (d *Decoder) Deserialize(data []byte, encodingLength uint8) []uint32

Deserializes tokens encoded in a bytes stream into a slice of uint32 token IDs. `encodingLength` must be one of: 0, 2, 3, 4. If you enter `encodingLength` 0 then it will determine the encoding length from the vocabulary size.

func (*Decoder) Flush

func (d *Decoder) Flush() []byte

Flushes the remainder from the Decoder instance These will any trailing incomplete UTF-8 sequences or capcode encoding marks

type Info

type Info struct {
	Id           uint32
	Token        []byte
	TokenDecoded []byte
	Type         uint8 // 0 = regular, 1 = character, 2 = special, 3 = unk
	Score        float32
}

Info struct allows access to detailed information about each token from TokensDetailed(). Token is the token still encoded with capcode. TokenDecoded is the decoded form of the token, however the token can be modified by a previous token in a sequence so this cannot be used for decoding. Type is 0 for regular tokens, 1 for character tokens, 2 for special tokens, 3 for UNK token. The Score is the percentage of the training dataset that this token covered and is used for sorting the tokens by their importance.

type Vocab

type Vocab struct {
	// contains filtered or unexported fields
}

The main struct for the vocabulary

func Load

func Load(filename string) (*Vocab, error)

Load the vocabulary from a local file.

func NewVocab

func NewVocab(tokens [][]byte, specialTokens [][]byte, charset uint8, normalization string, usingCapcode uint8, include256bytes bool, include128bytes bool, includeUTF8bytes bool, includeASCIIbytes bool, includeExtendedBytes bool, excludeOtherBytes bool) (*Vocab, error)

NewVocab makes a fresh vocabulary from a custom list of tokens. If you generated your vocabulary with TokenMonster tools, you will not be using this function but instead using `Load`.

func NewVocabFromYAML

func NewVocabFromYAML(yml []byte) (*Vocab, error)

NewVocabFromYAML makes a fresh vocabulary from a YAML file.

func (*Vocab) AddSpecialToken

func (vocab *Vocab) AddSpecialToken(token []byte)

Adds a single special token to the vocabulary. A special token is special because only this token is allowed to tokenize text containing this. If any regular tokens contain your special token within them, they will be deleted. Modifying a vocabulary does not change existing token IDs. All normalization and capcode is applied automatically.

func (*Vocab) AddSpecialTokens

func (vocab *Vocab) AddSpecialTokens(specialTokens [][]byte, size int)

Add multiple special tokens and optionally resize. Enter `size` 0 to not resize. Modifying a vocabulary does not change existing token IDs.

func (*Vocab) AddToken

func (vocab *Vocab) AddToken(token []byte)

Adds a single token to the vocabulary. Modifying a vocabulary does not change existing token IDs. All normalization and capcode is applied automatically.

func (*Vocab) AddTokens

func (vocab *Vocab) AddTokens(addTokens [][]byte, specialTokens [][]byte, size int)

Adds multiple regular and optionally special tokens. You can use `size` to resize the vocabulary to keep it at a specific size. Enter `size` 0 to not resize. Modifying a vocabulary does not change existing token IDs.

func (*Vocab) Capcode

func (vocab *Vocab) Capcode() uint8

The capcode level. 0 = disabled, 1 = deleteToken only, 2 = fully enabled.

func (*Vocab) Charset

func (vocab *Vocab) Charset() uint8

The charset code for the vocabulary. 0 = None, 1 = UTF-8, 2 = UTF-16.

func (*Vocab) Count

func (vocab *Vocab) Count(data []byte) (int, int, error)

Tokenizes but returns the number of tokens instead of the tokens.

func (*Vocab) Decode

func (vocab *Vocab) Decode(tokens []uint32) []byte

Decodes tokens backs into bytes. If you are decoding a stream of tokens individually or in batches, instead of all at once, you should use the Decode method for the Decoder struct instead.

func (*Vocab) DecodeSerialized

func (vocab *Vocab) DecodeSerialized(b []byte, encodingLength uint8, buffer []byte) []byte

Decodes tokens from a serialized bytes slice. `encodingLength` must be one of: 0, 2, 3, 4. If you enter `encodingLength` 0 then it will determine the encoding length from the vocabulary size. `buffer` is optional, you can send it `nil` and it will allocate a new slice. If you are decoding a stream of tokens individually or in batches, instead of all at once, you should use the Decode method for the Decoder struct instead.

func (*Vocab) DeleteToken

func (vocab *Vocab) DeleteToken(token []byte)

Deletes a single token from the vocabulary. Tokens to delete can be capcoded encoded or not, it will look for both. Modifying a vocabulary does not change existing token IDs.

func (*Vocab) DeleteTokenID

func (vocab *Vocab) DeleteTokenID(id uint32)

Deletes a single token from the vocabulary by specifying the ID. Modifying a vocabulary does not change existing token IDs.

func (*Vocab) DeleteTokens

func (vocab *Vocab) DeleteTokens(deleteTokens [][]byte, size int)

Delete multiple tokens and optionally resize. Tokens to delete can be capcoded encoded or not, it will look for both. Enter `size` 0 to not resize. Modifying a vocabulary does not change existing token IDs.

func (*Vocab) Denormalize

func (vocab *Vocab) Denormalize(b []byte) []byte

Decodes capcode from the bytes.

func (*Vocab) Deserialize

func (vocab *Vocab) Deserialize(data []byte, encodingLength uint8) (tokens []uint32)

func (*Vocab) DisableUnkToken

func (vocab *Vocab) DisableUnkToken()

Disables the UNK token. Without an UNK token, a character that has no token to represent it will be ignored.

func (*Vocab) EnableUnkToken

func (vocab *Vocab) EnableUnkToken() bool

Enables the UNK token. Returns true if successful, returns false if an UNK token is not applicable to this vocabulary (all bytes have tokens). If enabled, UNK token will be inserted for every character for which there is no token. You can resize after this if you want to keep the vocabulary sized as it was before, otherwise it will be 1 larger.

func (*Vocab) ExportYAML

func (vocab *Vocab) ExportYAML(writer io.Writer, orderByScore bool)

Exports the vocabulary to a human-readable YAML file. It writes to an io.Writer. You can import from YAML with NewVocabFromYAML().

func (*Vocab) HasUnk

func (vocab *Vocab) HasUnk() bool

Returns true if the vocabulary is using the UNK token. If used, the UNK token ID is used whenever a character being tokenized doesn't exist in the vocabulary.

func (*Vocab) HighestTokenID

func (vocab *Vocab) HighestTokenID() int

Returns the value of the highest token ID.

func (*Vocab) IdToToken

func (vocab *Vocab) IdToToken(id uint32) []byte

Returns the encoded token for the token ID, or nil if it does not exist.

func (*Vocab) Len

func (vocab *Vocab) Len() int

Returns number of tokens in the vocabulary, inluding UNK token if it is used.

func (*Vocab) MaxTokenLength

func (vocab *Vocab) MaxTokenLength() int

The length of the longest (encoded) token in the vocabulary. This can be lower than that chosen during training if none of the longer tokens were chosen.

func (*Vocab) Mode

func (vocab *Vocab) Mode() uint8

The original filter for training the vocabulary. 0 = unfiltered, 1 = clean, 2 = balanced, 3 = consistent, 4 = strict, 5 = not trained with trainvocab.

func (*Vocab) ModifyVocabulary

func (vocab *Vocab) ModifyVocabulary(addTokens [][]byte, specialTokens [][]byte, deleteTokens [][]byte, size int, resetTokenIds bool)

Add regular & special tokens, delete tokens and resize, all in one. Modifying a vocabulary does not change existing token IDs. Pass resetTokenIds = true to ensure there are no gaps in the token IDs.

func (*Vocab) ModifyVocabularyFromYAML

func (vocab *Vocab) ModifyVocabularyFromYAML(yml []byte, size int, resetTokenIds bool)

Add regular & special tokens, delete tokens and resize, all in one. Modifying a vocabulary does not change existing token IDs. Pass resetTokenIds = true to ensure there are no gaps in the token IDs.

func (*Vocab) NewDecoder

func (vocab *Vocab) NewDecoder() *Decoder

Creates a new Decoder instance. This is for decoding tokens in a sequence when they are to be decoded individually or in batches. If you are decoding all in one go, you can use the Vocab's Decode method.

func (*Vocab) Normalization

func (vocab *Vocab) Normalization() string

The type of normalization applied automatically when tokenizing. Returns a string.

func (*Vocab) NormalizationCode

func (vocab *Vocab) NormalizationCode() uint8

The type of normalization applied automatically when tokenizing. Returns a uint8.

func (*Vocab) Normalize

func (vocab *Vocab) Normalize(data []byte) ([]byte, error)

Applies all normalizations to the bytes, including capcode and NFD.

func (*Vocab) NumDeletedTokens

func (vocab *Vocab) NumDeletedTokens() int

The number of tokens deleted from the vocabulary. These can be restored by resizing the vocabulary to be be larger.

func (*Vocab) NumSingleByteTokens

func (vocab *Vocab) NumSingleByteTokens() int

The number of single byte tokens in the vocabulary.

func (*Vocab) NumSpecialTokens

func (vocab *Vocab) NumSpecialTokens() int

Returns the number of special tokens in the vocabulary.

func (*Vocab) PrivateGenerateVocab

func (vocab *Vocab) PrivateGenerateVocab(yamlData []byte, tokens [][]byte, scores []float32, addTokens [][]byte, deleteTokens [][]byte, specialTokens [][]byte, specialTokensEncoded [][]byte, charset uint8, normalizeString string, usingCapcode uint8, level uint8, reserve uint8, resize int, resetTokenIds bool) error

Don't use this function, it's exported because it's used by the exportvocab tool.

func (*Vocab) ResetTokenIds

func (vocab *Vocab) ResetTokenIds(token []byte)

Resets all the IDs of the tokens to be assigned alphabetically, starting from 0, with no gaps.

func (*Vocab) Resize

func (vocab *Vocab) Resize(size int)

Resize the vocabulary by deleting the worst scoring tokens. You can also resize the vocabulary to be larger if any tokens have previously been deleted. Modifying a vocabulary does not change existing token IDs.

func (Vocab) Save

func (vocab Vocab) Save(outputFilename string) error

Save the vocabulary to local file.

func (*Vocab) SingleByteTokens

func (vocab *Vocab) SingleByteTokens() []byte

A slice that contains all the single byte tokens in the vocabulary. Note that this is returned as only a slice of bytes, not a slice of slice of bytes.

func (*Vocab) SingleBytesTrainingCode

func (vocab *Vocab) SingleBytesTrainingCode() uint8

Returns the uint8 code corresponding to the training parameters for single byte tokens.

func (*Vocab) SpecialTokens

func (vocab *Vocab) SpecialTokens() []Info

Returns the token IDs and the corresponding tokens of only the. Set `decode` to false to receive the decoded form of the tokens.

func (*Vocab) TokenToId

func (vocab *Vocab) TokenToId(b []byte) (uint32, bool)

Returns the ID of the token from bytes. This only works for capcode encoded tokens. Apply `Normalize` to the bytes first to use this with decoded tokens.

func (*Vocab) Tokenize

func (vocab *Vocab) Tokenize(data []byte) ([]uint32, int, error)

Tokenizes text from bytes slice to token IDs. The 2nd returned value (int) is the number of characters for which there were no tokens and were replaced with Unk token.

func (*Vocab) TokenizeToSerialized

func (vocab *Vocab) TokenizeToSerialized(data []byte, encodingLength uint8, buffer []byte) ([]byte, uint8, int, error)

Tokenizes directly into serialized bytes with either 16-bit, 24-bit or 32-bit encoded unsigned integers depending on the vocabulary size. Set encodingLength to 0 for it to be chosen automatically, or set `encodingLength` to 2, 3 or 4. The 2rd return value is the encodingLength that was used, and the 3rd is the number of characters for which there were no tokens. `buffer` is an optional reusable buffer, you can send nil.

func (*Vocab) Tokens

func (vocab *Vocab) Tokens() [][]byte

Returns a slice of all tokens in the vocabulary (excluding UNK), in their encoded capcode form.

func (*Vocab) TokensDetailed

func (vocab *Vocab) TokensDetailed() []Info

Returns a slice of Info struct where the index is the Token ID

func (*Vocab) Unk

func (vocab *Vocab) Unk() uint32

Returns the ID of the Unk token. It will return 16777215 if there is no Unk token. You can use HasUnk() to first check if there is an UNK token.

type YamlItem

type YamlItem struct {
	Encoded bool    `yaml:"encoded,omitempty"`
	Token   string  `yaml:",omitempty"`
	Id      *int    `yaml:"id,omitempty"`
	Score   float32 `yaml:"score,omitempty"`
}

type YamlVocab

type YamlVocab struct {
	Charset              string     `yaml:"charset,omitempty"`
	Normalization        string     `yaml:"normalization,omitempty"`
	Capcode              int        `yaml:"capcode,omitempty"`
	TrainingParam        *int       `yaml:"training-param,omitempty"`
	ResetTokenIds        bool       `yaml:"reset-token-ids,omitempty"`
	Include256Bytes      bool       `yaml:"include-256-bytes,omitempty"`
	Include128Bytes      bool       `yaml:"include-128-bytes,omitempty"`
	IncludeUtf8Bytes     bool       `yaml:"include-utf8-bytes,omitempty"`
	IncludeAsciiBytes    bool       `yaml:"include-ascii-bytes,omitempty"`
	IncludeExtendedBytes bool       `yaml:"include-extended-bytes,omitempty"`
	ExcludeOtherBytes    bool       `yaml:"exclude-other-bytes,omitempty"`
	Unk                  bool       `yaml:"unk,omitempty"`
	UnkId                *int       `yaml:"unk-id,omitempty"`
	Regular              []YamlItem `yaml:"tokens,omitempty"`
	Special              []YamlItem `yaml:"special,omitempty"`
	Delete               []YamlItem `yaml:"delete,omitempty"`
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL