tokenize

package
v0.0.0-...-265756f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 31, 2019 License: MIT Imports: 6 Imported by: 4

Documentation

Overview

Package tokenize supplies tokenization operations for BERT. Ports the tokenizer.py capbilites from the core BERT repo

NOTE: All defintions are related to BERT and may vary from unicode defintions, for example, BERT considers '$' punctuation, but unicode does not.

Index

Constants

View Source
const (
	ClassToken        = "[CLS]"
	SeparatorToken    = "[SEP]"
	SequenceSeparator = " ||| "
)

Static tokens

View Source
const DefaultMaxWordChars = 200

DefaultMaxWordChars is the max length of a token for it to be tokenized, otherwise marked as unknown

View Source
const DefaultUnknownToken = "[UNK]"

DefaultUnknownToken is the token used to signify an unkown token

Variables

This section is empty.

Functions

This section is empty.

Types

type Basic

type Basic struct {
	// Lower will apply a lower case filter to input
	Lower bool
}

Basic is a BasicTokenizer that runs Runs basic tokenize (punctuation splitting, lower casing, etc.).

func NewBasic

func NewBasic() Basic

NewBasic returns a basic tokenizer. Method is supplied to match constructor of other tokenizers

func (Basic) Tokenize

func (bt Basic) Tokenize(text string) []string

Tokenize will segment a texxt into individual tokens. Follows algorithm from ref-imp Clean, PadChinese, Whitespace Split, Lower?, SplitPunc, Whitespace Split

type Feature

type Feature struct {
	ID       int32
	Text     string
	Tokens   []string
	TokenIDs []int32
	Mask     []int32 // short?
	TypeIDs  []int32 // seqeuence ids, short?
}

Feature is an input feature for a BERT model. Maps to extract_features.InputFeature in ref-impl

func (Feature) Count

func (f Feature) Count() int

Count will return the number of tokens in the feature by counting the mask bits

type FeatureFactory

type FeatureFactory struct {
	Tokenizer VocabTokenizer
	SeqLen    int32
	// contains filtered or unexported fields
}

FeatureFactory will create features with the supplied tokenizer and sequence length

func (*FeatureFactory) Feature

func (ff *FeatureFactory) Feature(text string) Feature

Feature will create a single feature from the factory ID creation is thread safe and incremental

func (*FeatureFactory) Features

func (ff *FeatureFactory) Features(texts ...string) []Feature

Features will create multiple features with incremental IDs

type Full

type Full struct {
	Basic     Basic
	Wordpiece Wordpiece
}

Full is a FullTokenizer which comprises of a Basic & Wordpiece tokenizer

func (Full) Tokenize

func (f Full) Tokenize(text string) []string

Tokenize will tokenize the input text First basic is applited, then wordpiece on the tokens froms basic

func (Full) Vocab

func (f Full) Vocab() vocab.Dict

Vocab returns the vocub used for this tokenizer

type Option

type Option func(tkz Full) Full

Option alter the behavior of the tokenizer TODO add tests for these behavior changes

func WithLower

func WithLower(lower bool) Option

WithLower will lowercase all input if set to true, or skip lowering if false NOTE: kink from reference implementation is that lowering also strips accents

func WithMaxChars

func WithMaxChars(wc int) Option

WithMaxChars sets the maximum len of a token to be tokenized, if longer will be labeled as unknown

func WithUnknownToken

func WithUnknownToken(unk string) Option

WithUnknownToken will alter the unkown token from default [UNK]

type Tokenizer

type Tokenizer interface {
	Tokenize(text string) (tokens []string)
}

Tokenizer is an interface for chunking a string into it's tokens as per the BERT implematation

type VocabTokenizer

type VocabTokenizer interface {
	Tokenizer
	vocab.Provider
}

VocabTokenizer comprises of a Tokenizer and VocabProvider

func NewTokenizer

func NewTokenizer(voc vocab.Dict, opts ...Option) VocabTokenizer

NewTokenizer returns a new FullTokenizer Use Option array to modify default behavior

type Wordpiece

type Wordpiece struct {
	// contains filtered or unexported fields
}

Wordpiece is a tokenizer that breaks tokens into subword units based on a supplied vocabulary https://arxiv.org/pdf/1609.08144.pdf Section 4.1 for details

func NewWordpiece

func NewWordpiece(voc vocab.Dict) Wordpiece

NewWordpiece returns a WordpieceTokenizer with the default settings. Generally should be used in a FullTokenizer

func (Wordpiece) SetMaxWordChars

func (wp Wordpiece) SetMaxWordChars(c int)

SetMaxWordChars will set the max chars for a word to be tokenized, generally this should be congfigured through the FullTokenizer

func (Wordpiece) SetUnknownToken

func (wp Wordpiece) SetUnknownToken(tok string)

SetUnknownToken will set the , generally this should be congfigured through the FullTokenizer

func (Wordpiece) Tokenize

func (wp Wordpiece) Tokenize(text string) []string

Tokenize will segment the text into subword tokens from the supplied vocabulary NOTE: This implementation does not EXACTLY match the ref-impl and behaves slightly differently See https://github.com/google-research/bert/issues/763

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL