tokenizer

package module
v0.0.0-...-4c58a66 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 30, 2024 License: MIT Imports: 11 Imported by: 0

README

Tokenizer

Build Status codecov Go Report Card GoDoc

Tokenizer — parse any string, slice or infinite buffer to any tokens.

Main features:

Use cases:

  • Parsing html, xml, json, yaml and other text formats.
  • Parsing huge or infinite texts.
  • Parsing any programming languages.
  • Parsing templates.
  • Parsing formulas.

For example, parsing SQL WHERE condition user_id = 119 and modified > "2020-01-01 00:00:00" or amount >= 122.34:

// define custom tokens keys
const ( 
	TEquality = 1
	TDot      = 2
	TMath     = 3
)

// configure tokenizer
parser := tokenizer.New()
parser.DefineTokens(TEquality, []string{"<", "<=", "==", ">=", ">", "!="})
parser.DefineTokens(TDot, []string{"."})
parser.DefineTokens(TMath, []string{"+", "-", "/", "*", "%"})
parser.DefineStringToken(`"`, `"`).SetEscapeSymbol(tokenizer.BackSlash)

// create tokens stream
stream := parser.ParseString(`user_id = 119 and modified > "2020-01-01 00:00:00" or amount >= 122.34`)
defer stream.Close()

// iterate over each token
for stream.Valid() {
	if stream.CurrentToken().Is(tokenizer.TokenKeyword) {
		field := stream.CurrentToken().ValueString()
		// ... 
	}
	stream.Next()
}

stream tokens:

string:  user_id  =  119  and  modified  >  "2020-01-01 00:00:00"  or  amount  >=  122.34
tokens: |user_id| =| 119| and| modified| >| "2020-01-01 00:00:00"| or| amount| >=| 122.34|
        |   0   | 1|  2 |  3 |    4    | 5|            6         | 7 |    8  | 9 |    10 |

0:  {key: TokenKeyword, value: "user_id"}                token.Value()          == "user_id"
1:  {key: TEquality, value: "="}                         token.Value()          == "="
2:  {key: TokenInteger, value: "119"}                    token.ValueInt()       == 119
3:  {key: TokenKeyword, value: "and"}                    token.Value()          == "and"
4:  {key: TokenKeyword, value: "modified"}               token.Value()          == "modified"
5:  {key: TEquality, value: ">"}                         token.Value()          == ">"
6:  {key: TokenString, value: "\"2020-01-01 00:00:00\""} token.ValueUnescaped() == "2020-01-01 00:00:00"
7:  {key: TokenKeyword, value: "or"}                     token.Value()          == "and"
8:  {key: TokenKeyword, value: "amount"}                 token.Value()          == "amount"
9:  {key: TEquality, value: ">="}                        token.Value()          == ">="
10: {key: TokenFloat, value: "122.34"}                   token.ValueFloat()     == 122.34

More examples:

Begin

Create and parse
import (
    "github.com/bzick/tokenizer"
)

var parser := tokenizer.New()
parser.AllowKeywordUnderscore() // ... and other configuration code

There is two ways to parse string or slice:

  • parser.ParseString(str)
  • parser.ParseBytes(slice)

The package allows to parse an endless stream of data into tokens. For parsing, you need to pass io.Reader, from which data will be read (chunk-by-chunk):

fp, err := os.Open("data.json") // huge JSON file
// check fs, configure tokenizer ...

stream := parser.ParseStream(fp, 4096).SetHistorySize(10)
defer stream.Close()
for stream.IsValid() { 
	// ...
	stream.Next()
}

Embedded tokens

  • tokenizer.TokenUnknown — unspecified token key.
  • tokenizer.TokenKeyword — keyword, any combination of letters, including unicode letters.
  • tokenizer.TokenInteger — integer value
  • tokenizer.TokenFloat — float/double value
  • tokenizer.TokenString — quoted string
  • tokenizer.TokenStringFragment — fragment framed (quoted) string
Unknown token — tokenizer.TokenUnknown

A token marks as TokenUnknown if the parser detects an unknown token:

parser.ParseString(`one!`)
{
    {
        Key: tokenizer.TokenKeyword
        Value: "One"
    },
    {
        Key: tokenizer.TokenUnknown
        Value: "!"
    }
}

By default, TokenUnknown tokens are added to the stream. To exclude them from the stream, use the tokenizer.StopOnUndefinedToken() method

{
    {
        Key: tokenizer.TokenKeyword
        Value: "one"
    }
}

Please note that if the tokenizer.StopOnUndefinedToken setting is enabled, then the string may not be fully parsed. To find out that the string was not fully parsed, check the length of the parsed string stream.GetParsedLength() and the length of the original string.

Keywords

Any word that is not a custom token is stored in a single token as tokenizer.TokenKeyword.

The word can contains unicode characters, numbers (see tokenizer.AllowNumbersInKeyword ()) and underscore (see tokenizer.AllowKeywordUnderscore ()).

parser.ParseString(`one two четыре`)
tokens: {
    {
        Key: tokenizer.TokenKeyword
        Value: "one"
    },
    {
        Key: tokenizer.TokenKeyword
        Value: "two"
    },
    {
        Key: tokenizer.TokenKeyword
        Value: "четыре"
    }
}
Integer number

Any integer is stored as one token with key tokenizer.Token Integer.

parser.ParseString(`223 999`)
tokens: {
    {
        Key: tokenizer.TokenInteger
        Value: "223"
    },
    {
        Key: tokenizer.TokenInteger
        Value: "999"
    },
}

To get int64 from the token value use stream.GetInt():

stream := tokenizer.ParseString("123")
fmt.Print("Token is %d", stream.CurrentToken().GetInt())  // Token is 123
Float number

Any float number is stored as one token with key tokenizer.TokenFloat. Float number may

  • have point, for example 1.2
  • have exponent, for example 1e6
  • have lower e or upper E letter in the exponent, for example 1E6, 1e6
  • have sign in the exponent, for example 1e-6, 1e6, 1e+6
tokenizer.ParseString(`1.3e-8`):
{
    {
        Key: tokenizer.TokenFloat
        Value: "1.3e-8"
    },
}

To get float64 from the token value use token.GetFloat():

stream := tokenizer.ParseString("1.3e2")
fmt.Print("Token is %d", stream.CurrentToken().GetFloat())  // Token is 130
Framed string

Strings that are framed with tokens are called framed strings. An obvious example is quoted a string like "one two". There quotes — edge tokens.

You can create and customize framed string through tokenizer.AddQuote():

const TokenDoubleQuotedString = 10
tokenizer.DefineStringToken(TokenDoubleQuotedString, `"`, `"`).SetEscapeSymbol('\\')

stream := tokenizer.ParseString(`"two \"three"`)
{
    {
        Key: tokenizer.TokenString
        Value: "\"two \\"three\""
    },
}

To get a framed string without edge tokens and special characters, use the stream.ValueUnescape() method:

v := stream.CurrentToken().ValueUnescape() // result: two "three

The method token.StringKey() will be return token string key defined in the DefineStringToken:

stream.CurrentToken().StringKey() == TokenDoubleQuotedString // true
Injection in framed string

Strings can contain expression substitutions that can be parsed into tokens. For example "one {{two}} three". Fragments of strings before, between and after substitutions will be stored in tokens as tokenizer.TokenStringFragment.

const (
    TokenOpenInjection = 1
    TokenCloseInjection = 2
    TokenQuotedString = 3
)

parser := tokenizer.New()
parser.DefineTokens(TokenOpenInjection, []string{"{{"})
parser.DefineTokens(TokenCloseInjection, []string{"}}"})
parser.DefineStringToken(TokenQuotedString, `"`, `"`).AddInjection(TokenOpenInjection, TokenCloseInjection)

parser.ParseString(`"one {{ two }} three"`)

Tokens:

{
    {
        Key: tokenizer.TokenStringFragment,
        Value: "one"
    },
    {
        Key: TokenOpenInjection,
        Value: "{{"
    },
    {
        Key: tokenizer.TokenKeyword,
        Value: "two"
    },
    {
        Key: TokenCloseInjection,
        Value: "}}"
    },
    {
        Key: tokenizer.TokenStringFragment,
        Value: "three"
    },
}

Use cases:

  • parse templates
  • parse placeholders

User defined tokens

The new token can be defined via the DefineTokens method:


const (
    TokenCurlyOpen    = 1
    TokenCurlyClose   = 2
    TokenSquareOpen   = 3
    TokenSquareClose  = 4
    TokenColon        = 5
    TokenComma        = 6
	TokenDoubleQuoted = 7
)

// json parser
parser := tokenizer.New()
parser.
	DefineTokens(TokenCurlyOpen, []string{"{"}).
	DefineTokens(TokenCurlyClose, []string{"}"}).
	DefineTokens(TokenSquareOpen, []string{"["}).
	DefineTokens(TokenSquareClose, []string{"]"}).
	DefineTokens(TokenColon, []string{":"}).
	DefineTokens(TokenComma, []string{","}).
	DefineStringToken(TokenDoubleQuoted, `"`, `"`).SetSpecialSymbols(tokenizer.DefaultStringEscapes)

stream := parser.ParseString(`{"key": [1]}`)

Known issues

  • zero-byte \0 ignores in the source string.

Benchmark

Parse string/bytes

pkg: tokenizer
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
BenchmarkParseBytes
    stream_test.go:251: Speed: 70 bytes string with 19.689µs: 3555284 byte/sec
    stream_test.go:251: Speed: 7000 bytes string with 848.163µs: 8253130 byte/sec
    stream_test.go:251: Speed: 700000 bytes string with 75.685945ms: 9248744 byte/sec
    stream_test.go:251: Speed: 11093670 bytes string with 1.16611538s: 9513355 byte/sec
BenchmarkParseBytes-8   	  158481	      7358 ns/op

Parse infinite stream

pkg: tokenizer
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
BenchmarkParseInfStream
    stream_test.go:226: Speed: 70 bytes at 33.826µs: 2069414 byte/sec
    stream_test.go:226: Speed: 7000 bytes at 627.357µs: 11157921 byte/sec
    stream_test.go:226: Speed: 700000 bytes at 27.675799ms: 25292856 byte/sec
    stream_test.go:226: Speed: 30316440 bytes at 1.18061702s: 25678471 byte/sec
BenchmarkParseInfStream-8   	  433092	      2726 ns/op
PASS

Documentation

Index

Constants

View Source
const BackSlash = '\\'

BackSlash just backslash byte

View Source
const DefaultChunkSize = 4096

DefaultChunkSize default chunk size for reader.

Variables

View Source
var DefaultStringEscapes = map[byte]byte{
	'n':  '\n',
	'r':  '\r',
	't':  '\t',
	'\\': '\\',
}

DefaultStringEscapes is default escaped symbols. Those symbols are often used everywhere.

Functions

This section is empty.

Types

type QuoteInjectSettings

type QuoteInjectSettings struct {
	// Token type witch opens quoted string.
	StartKey TokenKey
	// Token type witch closes quoted string.
	EndKey TokenKey
}

QuoteInjectSettings describes open injection token and close injection token.

type Stream

type Stream struct {
	// contains filtered or unexported fields
}

Stream iterator via parsed tokens. If data reads from an infinite buffer then the iterator will be read data from reader chunk-by-chunk.

func NewInfStream

func NewInfStream(p *parsing) *Stream

NewInfStream creates new stream with active parser.

func NewStream

func NewStream(p *parsing) *Stream

NewStream creates new parsed stream of tokens.

func (*Stream) Close

func (s *Stream) Close()

Close releases all token objects to pool

func (*Stream) CurrentToken

func (s *Stream) CurrentToken() *Token

CurrentToken always returns the token. If the pointer is not valid (see IsValid) CurrentToken will be returns TokenUndef token. Do not save result (Token) into variables — current token may be changed at any time.

func (*Stream) GetParsedLength

func (s *Stream) GetParsedLength() int

GetParsedLength returns currently count parsed bytes.

func (*Stream) GetSnippet

func (s *Stream) GetSnippet(before, after int) []Token

GetSnippet returns slice of tokens. Slice generated from current token position and include tokens before and after current token.

func (*Stream) GetSnippetAsString

func (s *Stream) GetSnippetAsString(before, after, maxStringLength int) string

GetSnippetAsString returns tokens before and after current token as string. `maxStringLength` specify max length of each token string. Zero — unlimited token string length. If string greater than maxLength method removes some runes in the middle of the string.

func (*Stream) GoNext

func (s *Stream) GoNext() *Stream

GoNext moves stream pointer to the next token. If there is no token, it initiates the parsing of the next chunk of data. If there is no data, the pointer will point to the TokenUndef token.

func (*Stream) GoNextIfNextIs

func (s *Stream) GoNextIfNextIs(key TokenKey, otherKeys ...TokenKey) bool

GoNextIfNextIs moves stream pointer to the next token if the next token has specific token keys. If keys matched pointer will be updated and method returned true. Otherwise, returned false.

func (*Stream) GoPrev

func (s *Stream) GoPrev() *Stream

GoPrev moves pointer of stream to the next token. The number of possible calls is limited if you specified SetHistorySize. If the beginning of the stream or the end of the history is reached, the pointer will point to the TokenUndef token.

func (*Stream) GoTo

func (s *Stream) GoTo(id int) *Stream

GoTo moves pointer of stream to specific token. The search is done by token ID.

func (*Stream) HeadToken

func (s *Stream) HeadToken() *Token

HeadToken returns pointer to head-token Head-token may be changed if history size set.

func (*Stream) IsAnyNextSequence

func (s *Stream) IsAnyNextSequence(keys ...[]TokenKey) bool

IsAnyNextSequence checks that at least one token from each group is contained in a sequence of tokens

func (*Stream) IsNextSequence

func (s *Stream) IsNextSequence(keys ...TokenKey) bool

IsNextSequence checks if these are next tokens in exactly the same sequence as specified.

func (*Stream) IsValid

func (s *Stream) IsValid() bool

IsValid checks if stream is valid. This means that the pointer has not reached the end of the stream.

func (*Stream) NextToken

func (s *Stream) NextToken() *Token

NextToken returns next token from the stream. If next token doesn't exist method return TypeUndef token. Do not save result (Token) into variables — next token may be changed at any time.

func (*Stream) PrevToken

func (s *Stream) PrevToken() *Token

PrevToken returns previous token from the stream. If previous token doesn't exist method return TypeUndef token. Do not save result (Token) into variables — previous token may be changed at any time.

func (*Stream) SetHistorySize

func (s *Stream) SetHistorySize(size int) *Stream

SetHistorySize sets the number of tokens that should remain after the current token

func (*Stream) String

func (s *Stream) String() string

type StringSettings

type StringSettings struct {
	Key          TokenKey
	StartToken   []byte
	EndToken     []byte
	EscapeSymbol byte
	SpecSymbols  map[byte]byte
	Injects      []QuoteInjectSettings
}

StringSettings describes framed(quoted) string tokens like quoted strings.

func (*StringSettings) AddInjection

func (q *StringSettings) AddInjection(startTokenKey, endTokenKey TokenKey) *StringSettings

AddInjection configure injection in to string. Injection - parsable fragment of framed(quoted) string. Often used for parsing of placeholders or template's expressions in the framed string.

func (*StringSettings) SetEscapeSymbol

func (q *StringSettings) SetEscapeSymbol(symbol byte) *StringSettings

SetEscapeSymbol set escape symbol for framed(quoted) string. Escape symbol allows ignoring close token of framed string. Also escape symbol allows using special symbols in the frame strings, like \n, \t.

func (*StringSettings) SetSpecialSymbols

func (q *StringSettings) SetSpecialSymbols(special map[byte]byte) *StringSettings

SetSpecialSymbols set mapping of all escapable symbols for escape symbol, like \n, \t, \r.

type Token

type Token struct {
	// contains filtered or unexported fields
}

Token struct describe one token.

func (*Token) ID

func (t *Token) ID() int

ID returns id of token. Id is the sequence number of tokens in the stream.

func (*Token) Indent

func (t *Token) Indent() []byte

Indent returns spaces before the token.

func (*Token) Is

func (t *Token) Is(key TokenKey, keys ...TokenKey) bool

Is checks if the token has any of these keys.

func (Token) IsFloat

func (t Token) IsFloat() bool

IsFloat checks if this token is float — the key is TokenFloat.

func (Token) IsInteger

func (t Token) IsInteger() bool

IsInteger checks if this token is integer — the key is TokenInteger.

func (Token) IsKeyword

func (t Token) IsKeyword() bool

IsKeyword checks if this is keyword — the key is TokenKeyword.

func (Token) IsNumber

func (t Token) IsNumber() bool

IsNumber checks if this token is integer or float — the key is TokenInteger or TokenFloat.

func (Token) IsString

func (t Token) IsString() bool

IsString checks if current token is a quoted string. Token key may be TokenString or TokenStringFragment.

func (*Token) IsValid

func (t *Token) IsValid() bool

IsValid checks if this token is valid — the key is not TokenUndef.

func (*Token) Key

func (t *Token) Key() TokenKey

Key returns the key of the token pointed to by the pointer. If pointer is not valid (see IsValid) TokenUndef will be returned.

func (*Token) Line

func (t *Token) Line() int

Line returns line number in input string. Line numbers starts from 1.

func (*Token) Offset

func (t *Token) Offset() int

Offset returns the byte position in input string (from start).

func (Token) String

func (t Token) String() string

String returns a multiline string with the token's information.

func (*Token) StringKey

func (t *Token) StringKey() TokenKey

StringKey returns key of string. If key not defined for string TokenString will be returned.

func (*Token) StringSettings

func (t *Token) StringSettings() *StringSettings

StringSettings returns StringSettings structure if token is framed string.

func (*Token) Value

func (t *Token) Value() []byte

Value returns value of current token as slice of bytes from source. If current token is invalid value returns nil.

Do not change bytes in the slice. Copy slice before change.

func (*Token) ValueFloat

func (t *Token) ValueFloat() float64

ValueFloat returns value as float64. If the token is not TokenInteger or TokenFloat zero will be returned. Method doesn't use cache. Each call starts a number parser.

func (Token) ValueInt

func (t Token) ValueInt() int64

ValueInt returns value as int64. If the token is float the result wild be round by math's rules. If the token is not TokenInteger or TokenFloat zero will be returned. Method doesn't use cache. Each call starts a number parser.

func (*Token) ValueString

func (t *Token) ValueString() string

ValueString returns value of the token as string. If the token is TokenUndef method returns empty string.

func (*Token) ValueUnescaped

func (t *Token) ValueUnescaped() []byte

ValueUnescaped returns clear (unquoted) string

  • without edge-tokens (quotes)
  • with character escaping handling

For example quoted string

"one \"two\"\t three"

transforms to

one "two"		three

Method doesn't use cache. Each call starts a string parser.

func (*Token) ValueUnescapedString

func (t *Token) ValueUnescapedString() string

ValueUnescapedString like as ValueUnescaped but returns string.

type TokenKey

type TokenKey int

TokenKey token type identifier

const (
	// TokenUnknown means that this token not embedded token and not user defined.
	TokenUnknown TokenKey = -6
	// TokenStringFragment means that this is only fragment of quoted string with injections
	// For example, "one {{ two }} three", where "one " and " three" — TokenStringFragment
	TokenStringFragment TokenKey = -5
	// TokenString means than this token is quoted string.
	// For example, "one two"
	TokenString TokenKey = -4
	// TokenFloat means that this token is float number with point and/or exponent.
	// For example, 1.2, 1e6, 1E-6
	TokenFloat TokenKey = -3
	// TokenInteger means that this token is integer number.
	// For example, 3, 49983
	TokenInteger TokenKey = -2
	// TokenKeyword means that this token is word.
	// For example, one, two, три
	TokenKeyword TokenKey = -1
	// TokenUndef means that token doesn't exist.
	// Then stream out of range of token list any getter or checker will return TokenUndef token.
	TokenUndef TokenKey = 0
)

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer stores all tokens configuration and behaviors.

func New

func New() *Tokenizer

New creates new tokenizer.

func (*Tokenizer) AllowAsteriskInKeyword

func (t *Tokenizer) AllowAsteriskInKeyword() *Tokenizer

AllowAsteriskInKeyword allows symbol '*' in keywords The method allows asterisks in keywords, but the keyword itself must not start with a asterisk. There should be no spaces between letters and asterisks.

func (*Tokenizer) AllowAtInKeyword

func (t *Tokenizer) AllowAtInKeyword() *Tokenizer

AllowAtInKeyword allows symbol '@' in keywords The method allows ats in keywords, but the keyword itself must not start with an at. There should be no spaces between letters and ats.

func (*Tokenizer) AllowDotInKeyword

func (t *Tokenizer) AllowDotInKeyword() *Tokenizer

AllowDotInKeyword allows symbol '.' in keywords The method allows dots in keywords, but the keyword itself must not start with a dot. There should be no spaces between letters and dots.

func (*Tokenizer) AllowKeywordUnderscore

func (t *Tokenizer) AllowKeywordUnderscore() *Tokenizer

AllowKeywordUnderscore allows underscore symbol in keywords, like `one_two` or `_three`

func (*Tokenizer) AllowNumbersInKeyword

func (t *Tokenizer) AllowNumbersInKeyword() *Tokenizer

AllowNumbersInKeyword allows numbers in keywords, like `one1` or `r2d2` The method allows numbers in keywords, but the keyword itself must not start with a number. There should be no spaces between letters and numbers.

func (*Tokenizer) AllowSlashInKeyword

func (t *Tokenizer) AllowSlashInKeyword() *Tokenizer

AllowSlashInKeyword allows symbol '/' in keywords The method allows slashes in keywords, but the keyword itself must not start with a slash. There should be no spaces between letters and slashes.

func (*Tokenizer) DefineStringToken

func (t *Tokenizer) DefineStringToken(key TokenKey, startToken, endToken string) *StringSettings

DefineStringToken defines a token string. For example, a piece of data surrounded by quotes: "string in quotes" or 'string on sigle quotes'. Arguments startToken and endToken defines open and close "quotes".

  • t.DefineStringToken("`", "`") - parse string "one `two three`" will be parsed as [{key: TokenKeyword, value: "one"}, {key: TokenString, value: "`two three`"}]
  • t.DefineStringToken("//", "\n") - parse string "parse // like comment\n" will be parsed as [{key: TokenKeyword, value: "parse"}, {key: TokenString, value: "// like comment"}]

func (*Tokenizer) DefineTokens

func (t *Tokenizer) DefineTokens(key TokenKey, tokens []string) *Tokenizer

DefineTokens add custom token. There `key` unique is identifier of `tokens`, `tokens` — slice of string of tokens. If key already exists tokens will be rewritten.

func (*Tokenizer) ParseBytes

func (t *Tokenizer) ParseBytes(str []byte) *Stream

ParseBytes parse the bytes slice into tokens

func (*Tokenizer) ParseStream

func (t *Tokenizer) ParseStream(r io.Reader, bufferSize uint) *Stream

ParseStream parse the string into tokens.

func (*Tokenizer) ParseString

func (t *Tokenizer) ParseString(str string) *Stream

ParseString parse the string into tokens

func (*Tokenizer) SetWhiteSpaces

func (t *Tokenizer) SetWhiteSpaces(ws []byte) *Tokenizer

SetWhiteSpaces sets custom whitespace symbols between tokens. By default: {' ', '\t', '\n', '\r'}

func (*Tokenizer) StopOnUndefinedToken

func (t *Tokenizer) StopOnUndefinedToken() *Tokenizer

StopOnUndefinedToken stops parsing if unknown token detected.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL