Documentation ¶
Index ¶
Constants ¶
const DefaultCacheCapacity = 10_000
DefaultCacheCapacity is the default capacity for BPEModel internal cache.
Variables ¶
var ErrUnknownTokenOutOfVocabulary = fmt.Errorf("the provided unk token is out of vocabulary")
Functions ¶
This section is empty.
Types ¶
type BPEModel ¶
type BPEModel struct {
// contains filtered or unexported fields
}
BPEModel is a Byte Pair Encoding (BPE) model.
See: https://www.aclweb.org/anthology/P16-1162/
func New ¶
func New( vocab *vocabulary.Vocabulary, merges *MergeMap, cacheCapacity int, dropout float64, unknownToken string, continuingSubwordPrefix string, endOfWordSuffix string, unknownFusionEnabled bool, ) *BPEModel
New returns a new BPEModel initialized with the given options.
func NewDefault ¶
func NewDefault() *BPEModel
type MergeMap ¶
type MergeMap map[symbolIDPair]MergeValue
MergeMap maps pairs of Symbol IDs to (Rank, ID) values.
func MergeMapFromFile ¶
func MergeMapFromFile( filename string, vocab *vocabulary.Vocabulary, prefixLength int, ) (m *MergeMap, err error)
MergeMapFromFile reads merges from file.
func (*MergeMap) Get ¶
func (m *MergeMap) Get(firstID, secondID int) (MergeValue, bool)
Get returns a value associated to the given pair of ID, and whether the value exists in the map.
func (*MergeMap) Set ¶
func (m *MergeMap) Set(firstID, secondID int, v MergeValue)
type MergeValue ¶
type MergeValue struct { // Rank determines the order in which a merge is applied during // tokenization. Rank int // ID is the vocabulary ID of the symbol resulting from merging a pair of // symbols. ID int }
MergeValue is a (Rank, ID) pair.
type Symbol ¶
type Symbol struct { // Unique identifier, which implicitly refers to a sequence of characters. // For example, it might be the ID of a word in a vocabulary. ID int // The length in bytes of the implicit sequence of characters. Length int }
Symbol is an abstract reference to a sequence of characters.
type Word ¶
type Word []*WordSymbol
Word is a slice of WordSymbol.
func NewWordWithCapacity ¶
NewWordWithCapacity returns a new empty Word with the given capacity.
type WordCache ¶
type WordCache struct {
// contains filtered or unexported fields
}
func NewCache ¶
NewCache returns a new Cache initialized with the given capacity.
If capacity is set to zero, the cache becomes ineffective (is disabled).
func NewDefaultCache ¶
func NewDefaultCache() *WordCache
NewDefaultCache returns a new Cache initialized with the default capacity.
type WordMerge ¶
type WordMerge struct { MergeValue Pos int }
type WordMergeHeap ¶
type WordMergeHeap []WordMerge
func (*WordMergeHeap) Len ¶
func (h *WordMergeHeap) Len() int
func (*WordMergeHeap) Less ¶
func (h *WordMergeHeap) Less(i, j int) bool
func (*WordMergeHeap) Pop ¶
func (h *WordMergeHeap) Pop() interface{}
func (*WordMergeHeap) Push ¶
func (h *WordMergeHeap) Push(x interface{})
func (*WordMergeHeap) Swap ¶
func (h *WordMergeHeap) Swap(i, j int)
type WordSymbol ¶
type WordSymbol struct { Symbol // Prev is the index of the previous symbol in the Word. // -1 means no previous symbol. Prev int // Prev is the index of the next symbol in the Word. // -1 means no next symbol. Next int }
WordSymbol expands a Symbol with contextual information related to the Word that contains it.
func (*WordSymbol) HasNext ¶
func (s *WordSymbol) HasNext() bool
func (*WordSymbol) HasPrev ¶
func (s *WordSymbol) HasPrev() bool
func (*WordSymbol) MergeWith ¶
func (s *WordSymbol) MergeWith(other *WordSymbol, newSymbolID int)
MergeWith merges the current WordSymbol with the other one. In order to update prev/next, we consider the receiver to be the WordSymbol on the left, and other to be the next one on the right.