tiktoken

package module

v0.0.6 Latest Latest Go to latest Published: Mar 28, 2024 License: MIT Imports: 13 Imported by: 2

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/hupe1980/go-tiktoken

Links

Open Source Insights

README ¶

✂️ go-tiktoken

OpenAI's tiktoken tokenizer written in Go. The vocabularies are embedded and do not need to be downloaded at runtime.

Installation

go get github.com/hupe1980/go-tiktoken

How to use

package main

import (
	"fmt"
	"log"

	"github.com/hupe1980/go-tiktoken"
)

func main() {
	encoding, err := tiktoken.NewEncodingForModel("gpt-3.5-turbo")
	if err != nil {
		log.Fatal(err)
	}

	ids, tokens, err := encoding.Encode("Hello World", nil, nil)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("IDs:", ids)
	fmt.Println("Tokens:", tokens)
}

Output:

IDs: [9906 4435]
Tokens: [Hello  World]

For more example usage, see _examples.

Supported Encodings

✅ cl100k_base
✅ p50k_base
✅ p50k_edit
✅ r50k_base
✅ gpt2
✅ claude

License

MIT

Documentation ¶

Overview ¶

Package tiktoken provides functionality for tokenizing and encoding text using the tiktoken algorithm. The package includes various functions for text processing and encoding using the tiktoken algorithm.

Index ¶

Constants
Variables
func ConvertToMergeableBPERanks(bpe io.Reader) (map[string]uint, error)
func CovertVocabBPEAndEncoderJSONToMergeableBPERanks(vocabBPE io.Reader, encoderJSON io.Reader) (map[string]uint, error)
type Codec
type Encoding

Constants ¶

View Source

const (
	StartOfText string = "<|startoftext|>"
	EndOfText   string = "<|endoftext|>"
	FimPrefix   string = "<|fim_prefix|>"
	FimMiddle   string = "<|fim_middle|>"
	FimSuffix   string = "<|fim_suffix|>"
	EndOfPrompt string = "<|endofprompt|>"
)

Constants for special tokens.

View Source

const (
	CL100kBase string = "cl100k_base"
	P50kBase   string = "p50k_base"
	P50kEdit   string = "p50k_edit"
	R50kBase   string = "r50k_base"
	GPT2       string = "gpt2"
)

Constants for different encodings.

Variables ¶

View Source

var AllSpecial = []string{"all"}

View Source

var ModelPrefixToEncoding = map[string]string{

	"gpt-4-":         CL100kBase,
	"gpt-3.5-turbo-": CL100kBase,
	"gpt-35-turbo":   CL100kBase,
}

ModelPrefixToEncoding maps model prefixes to encodings.

View Source

var ModelToEncoding = map[string]string{

	"gpt-4":         CL100kBase,
	"gpt-3.5-turbo": CL100kBase,
	"gpt-35-turbo":  CL100kBase,

	"text-davinci-003": P50kBase,
	"text-davinci-002": P50kBase,
	"text-davinci-001": R50kBase,
	"text-curie-001":   R50kBase,
	"text-babbage-001": R50kBase,
	"text-ada-001":     R50kBase,
	"davinci":          R50kBase,
	"curie":            R50kBase,
	"babbage":          R50kBase,
	"ada":              R50kBase,

	"code-davinci-002": P50kBase,
	"code-davinci-001": P50kBase,
	"code-cushman-002": P50kBase,
	"code-cushman-001": P50kBase,
	"davinci-codex":    P50kBase,
	"cushman-codex":    P50kBase,

	"text-davinci-edit-001": P50kEdit,
	"code-davinci-edit-001": P50kEdit,

	"text-embedding-ada-002": CL100kBase,
	"text-embedding-3-small": CL100kBase,
	"text-embedding-3-large": CL100kBase,

	"text-similarity-davinci-001":  R50kBase,
	"text-similarity-curie-001":    R50kBase,
	"text-similarity-babbage-001":  R50kBase,
	"text-similarity-ada-001":      R50kBase,
	"text-search-davinci-doc-001":  R50kBase,
	"text-search-curie-doc-001":    R50kBase,
	"text-search-babbage-doc-001":  R50kBase,
	"text-search-ada-doc-001":      R50kBase,
	"code-search-babbage-code-001": R50kBase,
	"code-search-ada-code-001":     R50kBase,

	"gpt2": GPT2,
}

ModelToEncoding maps models to encodings.

Functions ¶

func ConvertToMergeableBPERanks ¶

func ConvertToMergeableBPERanks(bpe io.Reader) (map[string]uint, error)

ConvertToMergeableBPERanks converts the BPE file to mergeable BPE ranks.

func CovertVocabBPEAndEncoderJSONToMergeableBPERanks ¶

func CovertVocabBPEAndEncoderJSONToMergeableBPERanks(vocabBPE io.Reader, encoderJSON io.Reader) (map[string]uint, error)

CovertVocabBPEAndEncoderJSONToMergeableBPERanks converts the vocabulary BPE and encoder JSON to mergeable BPE ranks.

Types ¶

type Codec ¶

type Codec struct {
	Name           string          `json:"name"`
	ExplicitNVocab int             `json:"explicit_n_vocab"`
	PatStr         string          `json:"pat_str"`
	MergeableRanks map[string]uint `json:"mergeable_ranks"`
	SpecialTokens  map[string]uint `json:"special_tokens"`
}

Codec represents a token encoding codec.

func NewCL100kBase ¶

func NewCL100kBase() (*Codec, error)

NewCL100kBase creates a new Codec instance for the cl100k_base tokenization scheme. It loads the mergeable ranks from the embedded cl100kBase resource. The function returns a pointer to the Codec or an error if any.

func NewClaude ¶ added in v0.0.5

func NewClaude() (*Codec, error)

NewClaude creates a new Codec instance for the claude tokenization scheme. It loads the mergeable ranks from the embedded claude resource. The function returns a pointer to the Codec or an error if any.

func NewGPT2 ¶

func NewGPT2() (*Codec, error)

NewGPT2 creates a new Codec instance for the GPT-2 tokenization scheme. It loads the mergeable ranks from the embedded gpt2Vocab and gpt2Encode resources. The function returns a pointer to the Codec or an error if any.

func NewP50kBase ¶

func NewP50kBase() (*Codec, error)

NewP50kBase creates a new Codec instance for the P50k_base tokenization scheme. It loads the mergeable ranks from the embedded p50kBase resource. The function returns a pointer to the Codec or an error if any.

func NewP50kEdit ¶

func NewP50kEdit() (*Codec, error)

NewP50kEdit creates a new Codec instance for the P50k_edit tokenization scheme. It loads the mergeable ranks from the embedded p50kBase resource. The function returns a pointer to the Codec or an error if any.

func NewR50kBase ¶

func NewR50kBase() (*Codec, error)

NewR50kBase creates a new Codec instance for the R50k_base tokenization scheme. It loads the mergeable ranks from the embedded r50kBase resource. The function returns a pointer to the Codec or an error if any.

type Encoding ¶

type Encoding struct {
	// contains filtered or unexported fields
}

Encoding represents a text encoding scheme.

func NewEncoding ¶

func NewEncoding(codec *Codec) (*Encoding, error)

NewEncoding creates a new Encoding instance based on the provided Codec.

func NewEncodingByName ¶

func NewEncodingByName(encoding string) (*Encoding, error)

NewEncodingByName creates a new Encoding instance based on the given encoding name.

func NewEncodingForModel ¶

func NewEncodingForModel(model string) (*Encoding, error)

NewEncodingForModel returns a new Encoding based on the given model. It checks the ModelToEncoding map and ModelPrefixToEncoding map to find a matching encoding.

func (*Encoding) Decode ¶

func (enc *Encoding) Decode(tokens []uint) []byte

Decode decodes the given tokens using the Encoding's core BPE.

func (*Encoding) Encode ¶

func (enc *Encoding) Encode(text string, allowedSpecial, disallowedSpecial []string) ([]uint, []string, error)

Encode encodes the given text with the specified allowed and disallowed special tokens.

func (*Encoding) EncodeOrdinary ¶

func (enc *Encoding) EncodeOrdinary(text string) ([]uint, []string)

EncodeOrdinary encodes the given text using the Encoding's core BPE.

func (*Encoding) Name ¶

func (enc *Encoding) Name() string

Name returns the name of the Encoding.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
_examples
decode
encode

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL