gobert: github.com/buckhx/gobert/tokenize Index | Files | Directories

package tokenize

import "github.com/buckhx/gobert/tokenize"

Package tokenize supplies tokenization operations for BERT. Ports the tokenizer.py capbilites from the core BERT repo

NOTE: All defintions are related to BERT and may vary from unicode defintions, for example, BERT considers '$' punctuation, but unicode does not.

Index

Package Files

basic.go feature.go full.go tokenizer.go unicode.go wordpiece.go

Constants

const (
    ClassToken        = "[CLS]"
    SeparatorToken    = "[SEP]"
    SequenceSeparator = " ||| "
)

Static tokens

const DefaultMaxWordChars = 200

DefaultMaxWordChars is the max length of a token for it to be tokenized, otherwise marked as unknown

const DefaultUnknownToken = "[UNK]"

DefaultUnknownToken is the token used to signify an unkown token

type Basic Uses

type Basic struct {
    // Lower will apply a lower case filter to input
    Lower bool
}

Basic is a BasicTokenizer that runs Runs basic tokenize (punctuation splitting, lower casing, etc.).

func NewBasic Uses

func NewBasic() Basic

NewBasic returns a basic tokenizer. Method is supplied to match constructor of other tokenizers

func (Basic) Tokenize Uses

func (bt Basic) Tokenize(text string) []string

Tokenize will segment a texxt into individual tokens. Follows algorithm from ref-imp Clean, PadChinese, Whitespace Split, Lower?, SplitPunc, Whitespace Split

type Feature Uses

type Feature struct {
    ID       int32
    Text     string
    Tokens   []string
    TokenIDs []int32
    Mask     []int32 // short?
    TypeIDs  []int32 // seqeuence ids, short?
}

Feature is an input feature for a BERT model. Maps to extract_features.InputFeature in ref-impl

func (Feature) Count Uses

func (f Feature) Count() int

Count will return the number of tokens in the feature by counting the mask bits

type FeatureFactory Uses

type FeatureFactory struct {
    Tokenizer VocabTokenizer
    SeqLen    int32
    // contains filtered or unexported fields
}

FeatureFactory will create features with the supplied tokenizer and sequence length

func (*FeatureFactory) Feature Uses

func (ff *FeatureFactory) Feature(text string) Feature

Feature will create a single feature from the factory ID creation is thread safe and incremental

func (*FeatureFactory) Features Uses

func (ff *FeatureFactory) Features(texts ...string) []Feature

Features will create multiple features with incremental IDs

type Full Uses

type Full struct {
    Basic     Basic
    Wordpiece Wordpiece
}

Full is a FullTokenizer which comprises of a Basic & Wordpiece tokenizer

func (Full) Tokenize Uses

func (f Full) Tokenize(text string) []string

Tokenize will tokenize the input text First basic is applited, then wordpiece on the tokens froms basic

func (Full) Vocab Uses

func (f Full) Vocab() vocab.Dict

Vocab returns the vocub used for this tokenizer

type Option Uses

type Option func(tkz Full) Full

Option alter the behavior of the tokenizer TODO add tests for these behavior changes

func WithLower Uses

func WithLower(lower bool) Option

WithLower will lowercase all input if set to true, or skip lowering if false NOTE: kink from reference implementation is that lowering also strips accents

func WithMaxChars Uses

func WithMaxChars(wc int) Option

WithMaxChars sets the maximum len of a token to be tokenized, if longer will be labeled as unknown

func WithUnknownToken Uses

func WithUnknownToken(unk string) Option

WithUnknownToken will alter the unkown token from default [UNK]

type Tokenizer Uses

type Tokenizer interface {
    Tokenize(text string) (tokens []string)
}

Tokenizer is an interface for chunking a string into it's tokens as per the BERT implematation

type VocabTokenizer Uses

type VocabTokenizer interface {
    Tokenizer
    vocab.Provider
}

VocabTokenizer comprises of a Tokenizer and VocabProvider

func NewTokenizer Uses

func NewTokenizer(voc vocab.Dict, opts ...Option) VocabTokenizer

NewTokenizer returns a new FullTokenizer Use Option array to modify default behavior

type Wordpiece Uses

type Wordpiece struct {
    // contains filtered or unexported fields
}

Wordpiece is a tokenizer that breaks tokens into subword units based on a supplied vocabulary https://arxiv.org/pdf/1609.08144.pdf Section 4.1 for details

func NewWordpiece Uses

func NewWordpiece(voc vocab.Dict) Wordpiece

NewWordpiece returns a WordpieceTokenizer with the default settings. Generally should be used in a FullTokenizer

func (Wordpiece) SetMaxWordChars Uses

func (wp Wordpiece) SetMaxWordChars(c int)

SetMaxWordChars will set the max chars for a word to be tokenized, generally this should be congfigured through the FullTokenizer

func (Wordpiece) SetUnknownToken Uses

func (wp Wordpiece) SetUnknownToken(tok string)

SetUnknownToken will set the , generally this should be congfigured through the FullTokenizer

func (Wordpiece) Tokenize Uses

func (wp Wordpiece) Tokenize(text string) []string

Tokenize will segment the text into subword tokens from the supplied vocabulary NOTE: This implementation does not EXACTLY match the ref-impl and behaves slightly differently See https://github.com/google-research/bert/issues/763

Directories

PathSynopsis
vocab

Package tokenize imports 6 packages (graph) and is imported by 2 packages. Updated 2019-08-24. Refresh now. Tools for package owners.