tokenize

package

v0.0.0-...-265756f Latest Latest Go to latest Published: Jul 31, 2019 License: MIT Imports: 6 Imported by: 4

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/buckhx/gobert

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenize supplies tokenization operations for BERT. Ports the tokenizer.py capbilites from the core BERT repo

NOTE: All defintions are related to BERT and may vary from unicode defintions, for example, BERT considers '$' punctuation, but unicode does not.

Index ¶

Constants
type Basic
- func NewBasic() Basic
- func (bt Basic) Tokenize(text string) []string
type Feature
- func (f Feature) Count() int
type FeatureFactory
- func (ff *FeatureFactory) Feature(text string) Feature
- func (ff *FeatureFactory) Features(texts ...string) []Feature
type Full
- func (f Full) Tokenize(text string) []string
- func (f Full) Vocab() vocab.Dict
type Option
type Tokenizer
type VocabTokenizer
- func NewTokenizer(voc vocab.Dict, opts ...Option) VocabTokenizer
type Wordpiece
- func NewWordpiece(voc vocab.Dict) Wordpiece

Constants ¶

View Source

const (
	ClassToken        = "[CLS]"
	SeparatorToken    = "[SEP]"
	SequenceSeparator = " ||| "
)

Static tokens

View Source

const DefaultMaxWordChars = 200

DefaultMaxWordChars is the max length of a token for it to be tokenized, otherwise marked as unknown

View Source

const DefaultUnknownToken = "[UNK]"

DefaultUnknownToken is the token used to signify an unkown token

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Basic ¶

type Basic struct {
	// Lower will apply a lower case filter to input
	Lower bool
}

Basic is a BasicTokenizer that runs Runs basic tokenize (punctuation splitting, lower casing, etc.).

func NewBasic ¶

func NewBasic() Basic

NewBasic returns a basic tokenizer. Method is supplied to match constructor of other tokenizers

func (Basic) Tokenize ¶

func (bt Basic) Tokenize(text string) []string

Tokenize will segment a texxt into individual tokens. Follows algorithm from ref-imp Clean, PadChinese, Whitespace Split, Lower?, SplitPunc, Whitespace Split

type Feature ¶

type Feature struct {
	ID       int32
	Text     string
	Tokens   []string
	TokenIDs []int32
	Mask     []int32 // short?
	TypeIDs  []int32 // seqeuence ids, short?
}

Feature is an input feature for a BERT model. Maps to extract_features.InputFeature in ref-impl

func (Feature) Count ¶

func (f Feature) Count() int

Count will return the number of tokens in the feature by counting the mask bits

type FeatureFactory ¶

type FeatureFactory struct {
	Tokenizer VocabTokenizer
	SeqLen    int32
	// contains filtered or unexported fields
}

FeatureFactory will create features with the supplied tokenizer and sequence length

func (*FeatureFactory) Feature ¶

func (ff *FeatureFactory) Feature(text string) Feature

Feature will create a single feature from the factory ID creation is thread safe and incremental

func (*FeatureFactory) Features ¶

func (ff *FeatureFactory) Features(texts ...string) []Feature

Features will create multiple features with incremental IDs

type Full ¶

type Full struct {
	Basic     Basic
	Wordpiece Wordpiece
}

Full is a FullTokenizer which comprises of a Basic & Wordpiece tokenizer

func (Full) Tokenize ¶

func (f Full) Tokenize(text string) []string

Tokenize will tokenize the input text First basic is applited, then wordpiece on the tokens froms basic

func (Full) Vocab ¶

func (f Full) Vocab() vocab.Dict

Vocab returns the vocub used for this tokenizer

type Option ¶

type Option func(tkz Full) Full

Option alter the behavior of the tokenizer TODO add tests for these behavior changes

func WithLower ¶

func WithLower(lower bool) Option

WithLower will lowercase all input if set to true, or skip lowering if false NOTE: kink from reference implementation is that lowering also strips accents

func WithMaxChars ¶

func WithMaxChars(wc int) Option

WithMaxChars sets the maximum len of a token to be tokenized, if longer will be labeled as unknown

func WithUnknownToken ¶

func WithUnknownToken(unk string) Option

WithUnknownToken will alter the unkown token from default [UNK]

type Tokenizer ¶

type Tokenizer interface {
	Tokenize(text string) (tokens []string)
}

Tokenizer is an interface for chunking a string into it's tokens as per the BERT implematation

type VocabTokenizer ¶

type VocabTokenizer interface {
	Tokenizer
	vocab.Provider
}

VocabTokenizer comprises of a Tokenizer and VocabProvider

func NewTokenizer ¶

func NewTokenizer(voc vocab.Dict, opts ...Option) VocabTokenizer

NewTokenizer returns a new FullTokenizer Use Option array to modify default behavior

type Wordpiece ¶

type Wordpiece struct {
	// contains filtered or unexported fields
}

Wordpiece is a tokenizer that breaks tokens into subword units based on a supplied vocabulary https://arxiv.org/pdf/1609.08144.pdf Section 4.1 for details

func NewWordpiece ¶

func NewWordpiece(voc vocab.Dict) Wordpiece

NewWordpiece returns a WordpieceTokenizer with the default settings. Generally should be used in a FullTokenizer

func (Wordpiece) SetMaxWordChars ¶

func (wp Wordpiece) SetMaxWordChars(c int)

SetMaxWordChars will set the max chars for a word to be tokenized, generally this should be congfigured through the FullTokenizer

func (Wordpiece) SetUnknownToken ¶

func (wp Wordpiece) SetUnknownToken(tok string)

SetUnknownToken will set the , generally this should be congfigured through the FullTokenizer

func (Wordpiece) Tokenize ¶

func (wp Wordpiece) Tokenize(text string) []string

Tokenize will segment the text into subword tokens from the supplied vocabulary NOTE: This implementation does not EXACTLY match the ref-impl and behaves slightly differently See https://github.com/google-research/bert/issues/763

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
vocab

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL