tokenizer

package

v0.1.3 Latest Latest Go to latest Published: Jan 21, 2023 License: MIT Imports: 8 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

README ¶

tokenizer

The tokenizer package uses the strings.Scan() package to tokenize a string buffer according to the Ego language rules. These extensions to the default Scan() behavior involve special tokens, such as ">=" which would normally scan as two tokens but are grouped together as a single token.

The package also contains utility functions for navigating a cursor in the token stream to peek, read, or advance the token stream during compilation. Finally, it contains routines for testing the nature of given tokens, such as determining if a token can be used as a symbol.

Creating A Token Stream

Use the New() function to pass in a string containing the text to tokenize, and receive a pointer to a Tokenizer object. This object contains all the information about the tokenization of the string, and a cursor that can be moved through the stream.

src := "print 3+5*2"
tokens := tokenizer.New(src)

The resulting tokens object can be used to scan through the tokens, read the token strings, etc.

Reading Tokens

You can read a token explicitly which also advances the cursor one position. You can also peek ahead or behind the cursor in the token stream to see other tokens that are not the current token. You can also test the next token, and if it matches then advance the cursor and return true.

t := tokens.Next()

This reads the next token in the stream, and advances the cursor. The string representation of the token is returned as the function result.

t := tokens.Peek(1)

This peeks ahead one token in advance of the cursor, and reads that token string. The current token position is not changed by this operation. Values greater than zero read ahead of the current position. A value of 0 re-reads the current position (the same value returned by the last Next() call, for example). A negative value reads previously-read tokens behind the current token position.

if tokens.IsNext("print") {
    // Handle print operations
}

The IsNext() function tests the next token to see if it matches the given string. If it does match, then the cursor advances one position and the function returns true. If the next token does not match the string, then the function returns false and the cursor is not changed. In the above example, when the conditional block runs, the "print" token will be behind the cursor, which is positioned at whatever token followed "print".

if tokens.IsAnyNext([]string{"this", "that"}) {
    // Handle this or that stuff
}

This is very similar to IsNext() but it compares the next token to the list of string values given, and if it matches any of those tokens, the function returns true and the token cursor is advanced. Inside the body of the condition statement in this example, the caller can use tokens.Peek(0) to see what the value of the token was that matched the list.

Token Position

The token cursor is normally moved only when a Next(), IsNext(), or IsAnyNext() call is made. However the caller can manually manipulate the cursor position in a number of ways.

tokens.Advance(1)

The Advance() method moves the cursor by the amount given. If the value is positive, the cursor is moved to ahead. If the value is zero, the cursor is unchanged. If the value is negative, the cursor is moved back that many positions. Note that you cannot move the cursor to before the first token or after the last token.

if tokens.AtEnd() {
    return
}

The AtEnd() function returns true if the cursor is at the end of the token stream. The cursor is not moved by this operation.

The caller can explicit change the token position to an absolute position, or to record the current position (this is useful if the tokenizer must parse ahead through a complex set of productions before deterining that the compilation is invalid and the tokenizer should be reset to try another path).

tokens.Reset()
t := tokens.Mark()
tokens.Set(t)

The first call will reset the token position to the start of the stream. This is the same position the cursor is after the tokenizer is first created with the New() method. The Mark() method will return the current token position, with the intention that the calling processor can mark the current location to return to it later. The Set() method is used to set a previously-collected cursor position as the new current cursor position.

Documentation ¶

Index ¶

Constants
Variables
func InList(s Token, test ...Token) bool
func IsSymbol(s string) bool
type Token
type TokenClass
- func (c TokenClass) String() string
type Tokenizer
- func New(src string, isCode bool) *Tokenizer

Constants ¶

View Source

const ToTheEnd = 999999

ToTheEnd means to advance the token stream to the end.

Variables ¶

View Source

var (
	// "assert" token.
	AssertToken = NewReservedToken("assert")

	// "bool" token.
	BoolToken = NewTypeToken("bool")

	// "{" token.
	BlockBeginToken = NewSpecialToken("{")

	// "}" token.
	BlockEndToken = NewSpecialToken("}")

	// "break" token.
	BreakToken = NewReservedToken("break")

	// "byte" token.
	ByteToken = NewTypeToken("byte")

	// "call" token.
	CallToken = NewReservedToken("call")

	// "case" token.
	CaseToken = NewIdentifierToken("case")

	// "catch" token.
	CatchToken = NewReservedToken("catch")

	// "chan"  token.
	ChanToken = NewTypeToken("chan")

	// "clear" token.
	ClearToken = NewIdentifierToken("clear")

	// "const" token.
	ConstToken = NewReservedToken("const")

	// "continue"  token.
	ContinueToken = NewReservedToken("continue")

	// "{" token.
	DataBeginToken = NewSpecialToken("{")

	// "}" token.
	DataEndToken = NewSpecialToken("}")

	// "default" token.
	DefaultToken = NewIdentifierToken("default")

	// "defer" token.
	DeferToken = NewReservedToken("defer")

	// "@" token.
	DirectiveToken = NewSpecialToken("@")

	// "else" token.
	ElseToken = NewReservedToken("else")

	// "{}" token.
	EmptyBlockToken = NewSpecialToken("{}")

	// "{}" token.
	EmptyInitializerToken = NewSpecialToken("{}")

	// "interface{}" token.
	EmptyInterfaceToken = NewTypeToken("interface{}")

	// "error" token.
	ErrorToken = NewIdentifierToken("error")

	// "exit" token.
	ExitToken = NewReservedToken("exit")

	// "fallthrough" token.
	FallthroughToken = NewReservedToken("fallthrough")

	// "float32" token.
	Float32Token = NewTypeToken("float32")

	// "float64" token.
	Float64Token = NewTypeToken("float64")

	// "for" token.
	ForToken = NewReservedToken("for")

	// "func" token.
	FuncToken = NewReservedToken("func")

	// "go" token.
	GoToken = NewReservedToken("go")

	// "if" token.
	IfToken = NewReservedToken("if")

	// "int" token.
	IntToken = NewTypeToken("int")

	// "int32"  token.
	Int32Token = NewTypeToken("int32")

	// "int64" token.
	Int64Token = NewTypeToken("int64")

	// "interface"  token.
	InterfaceToken = NewIdentifierToken("interface")

	// "import" token.
	ImportToken = NewReservedToken("import")

	// "make" token.
	MakeToken = NewReservedToken("make")

	// "map" token.
	MapToken = NewTypeToken("map")

	// "nil" token.
	NilToken = NewReservedToken("nil")

	// "package" token.
	PackageToken = NewReservedToken("package")

	// "panic" token.
	PanicToken = NewReservedToken("panic")

	// "print" token.
	PrintToken = NewReservedToken("print")

	// "range" token.
	RangeToken = NewIdentifierToken("range")

	// "return" token.
	ReturnToken = NewReservedToken("return")

	// "string" token.
	StringToken = NewTypeToken("string")

	// "struct" token.
	StructToken = NewTypeToken("struct")

	// "switch" token.
	SwitchToken = NewReservedToken("switch")

	// "test" token.
	TestToken = NewIdentifierToken("test")

	// "type" token.
	TypeToken = NewReservedToken("type")

	// "try" token.
	TryToken = NewReservedToken("try")

	// "var" token.
	VarToken = NewReservedToken("var")

	// "when" token.
	WhenToken = NewIdentifierToken("when")

	// ";" token.
	SemicolonToken = NewSpecialToken(";")

	// ":" token.
	ColonToken = NewSpecialToken(":")

	// ":=" token.
	DefineToken = NewSpecialToken(":=")

	// "=" token.
	AssignToken = NewSpecialToken("=")

	// "," token.
	CommaToken = NewSpecialToken(",")

	// "==" token.
	EqualsToken = NewSpecialToken("==")

	// ">" token.
	GreaterThanToken = NewSpecialToken(">")

	// ">=" token.
	GreaterThanOrEqualsToken = NewSpecialToken(">=")

	// "<" token.
	LessThanToken = NewSpecialToken("<")

	// "<=" token.
	LessThanOrEqualsToken = NewSpecialToken("<=")

	// "<<" token.
	ShiftLeftToken = NewSpecialToken("<<")

	// ">>" token.
	ShiftRightToken = NewSpecialToken(">>")

	// "!" token.
	NotToken = NewSpecialToken("!")

	// "!=" token.
	NotEqualsToken = NewSpecialToken("!=")

	// "%" token.
	ModuloToken = NewSpecialToken("%")

	// "^" token.
	ExponentToken = NewSpecialToken("^")

	// "+" token.
	AddToken = NewSpecialToken("+")

	// "-" token.
	SubtractToken = NewSpecialToken("-")

	// "*" token.
	MultiplyToken = NewSpecialToken("*")

	// "/" token.
	DivideToken = NewSpecialToken("/")

	// "*" token.
	PointerToken = NewSpecialToken("*")

	// '&" token.
	AddressToken = NewSpecialToken("&")

	// "&" token.
	AndToken = NewSpecialToken("&")

	// "|" token.
	OrToken = NewSpecialToken("|")

	// "&&" token.
	BooleanAndToken = NewSpecialToken("&&")

	// "||" token.
	BooleanOrToken = NewSpecialToken("||")

	// "+=" token.
	AddAssignToken = NewSpecialToken("+=")

	// "-=" token.
	SubtractAssignToken = NewSpecialToken("-=")

	// "*=" token.
	MultiplyAssignToken = NewSpecialToken("*=")

	// "/=" token.
	DivideAssignToken = NewSpecialToken("/=")

	// "++" token.
	IncrementToken = NewSpecialToken("++")

	// "--" token.
	DecrementToken = NewSpecialToken("--")

	// "." token.
	DotToken = NewSpecialToken(".")

	// "..." token.
	VariadicToken = NewSpecialToken("...")

	// "<-" token.
	ChannelReceiveToken = NewSpecialToken("<-")

	// "(" token.
	StartOfListToken = NewSpecialToken("(")

	// ")"" token.
	EndOfListToken = NewSpecialToken(")")

	// "[" token.
	StartOfArrayToken = NewSpecialToken("[")

	// "]" token.
	EndOfArrayToken = NewSpecialToken("]")

	// "?" token.
	OptionalToken = NewSpecialToken("?")

	// Empty token.
	EmptyToken = NewSpecialToken("")

	// "-" token.
	NegateToken = NewSpecialToken("-")
)

Symbolic names for each string token value.

View Source

var EndOfTokens = Token{/* contains filtered or unexported fields */}

EndOfTokens is a reserved token that means end of the buffer was reached.

View Source

var ExtendedReservedWords = map[Token]bool{
	CallToken:  true,
	CatchToken: true,
	PrintToken: true,
	TryToken:   true,
	ExitToken:  true,
	PanicToken: true,
}

ExtendedReservedWords are additional reserved words when running with language extensions enabled.

View Source

var ReservedWords = map[Token]bool{
	BoolToken:        true,
	BreakToken:       true,
	ByteToken:        true,
	ChanToken:        true,
	ConstToken:       true,
	ContinueToken:    true,
	DeferToken:       true,
	ElseToken:        true,
	FallthroughToken: true,
	Float32Token:     true,
	Float64Token:     true,
	ForToken:         true,
	FuncToken:        true,
	GoToken:          true,
	IfToken:          true,
	ImportToken:      true,
	InterfaceToken:   true,
	IntToken:         true,
	Int32Token:       true,
	Int64Token:       true,
	MapToken:         true,
	NilToken:         true,
	PackageToken:     true,
	ReturnToken:      true,
	SwitchToken:      true,
	StringToken:      true,
	StructToken:      true,
	TypeToken:        true,
	VarToken:         true,
}

ReservedWords is the list of reserved words in the _Ego_ language.

View Source

var SpecialTokens = map[Token]bool{
	BlockBeginToken:          true,
	BlockEndToken:            true,
	DataBeginToken:           true,
	DataEndToken:             true,
	DirectiveToken:           true,
	EmptyBlockToken:          true,
	EmptyInitializerToken:    true,
	SemicolonToken:           true,
	ColonToken:               true,
	DefineToken:              true,
	AssignToken:              true,
	CommaToken:               true,
	EqualsToken:              true,
	GreaterThanToken:         true,
	GreaterThanOrEqualsToken: true,
	LessThanToken:            true,
	LessThanOrEqualsToken:    true,
	ShiftLeftToken:           true,
	ShiftRightToken:          true,
	NotToken:                 true,
	NotEqualsToken:           true,
	ModuloToken:              true,
	ExponentToken:            true,
	AddToken:                 true,
	SubtractToken:            true,
	MultiplyToken:            true,
	DivideToken:              true,
	PointerToken:             true,
	AddressToken:             true,
	AndToken:                 true,
	OrToken:                  true,
	BooleanAndToken:          true,
	BooleanOrToken:           true,
	AddAssignToken:           true,
	SubtractAssignToken:      true,
	MultiplyAssignToken:      true,
	DivideAssignToken:        true,
	IncrementToken:           true,
	DecrementToken:           true,
	DotToken:                 true,
	VariadicToken:            true,
	ChannelReceiveToken:      true,
	StartOfListToken:         true,
	EndOfListToken:           true,
	StartOfArrayToken:        true,
	EndOfArrayToken:          true,
	OptionalToken:            true,
	EmptyToken:               true,
	NegateToken:              true,
}

SpecialTokens is a list of tokens that are considered special symantic characters.

View Source

var TypeTokens = map[Token]bool{
	BoolToken:    true,
	ByteToken:    true,
	IntToken:     true,
	Int32Token:   true,
	Int64Token:   true,
	Float32Token: true,
	Float64Token: true,
	StringToken:  true,
	StructToken:  true,
	MapToken:     true,
}

TypeTokens is a list of tokens that represent type names.

Functions ¶

func InList ¶

func InList(s Token, test ...Token) bool

InList is a support function that checks to see if a string matches any of a list of other strings.

func IsSymbol ¶

func IsSymbol(s string) bool

IsSymbol is a utility function to determine if a string contains is a symbol name.

Types ¶

type Token ¶

type Token struct {
	// contains filtered or unexported fields
}

Token defines a single token from the lexical scanning operation.

func NewFloatToken ¶

func NewFloatToken(spelling string) Token

func NewIdentifierToken ¶

func NewIdentifierToken(spelling string) Token

func NewIntegerToken ¶

func NewIntegerToken(spelling string) Token

func NewReservedToken ¶

func NewReservedToken(spelling string) Token

func NewSpecialToken ¶

func NewSpecialToken(spelling string) Token

func NewStringToken ¶

func NewStringToken(spelling string) Token

func NewToken ¶

func NewToken(class TokenClass, spelling string) Token

func NewTypeToken ¶

func NewTypeToken(spelling string) Token

func NewValueToken ¶

func NewValueToken(spelling string) Token

func (Token) Boolean ¶

func (t Token) Boolean() bool

func (Token) Float ¶

func (t Token) Float() float64

func (Token) Integer ¶

func (t Token) Integer() int64

func (Token) IsClass ¶

func (t Token) IsClass(class TokenClass) bool

func (Token) IsIdentifier ¶

func (t Token) IsIdentifier() bool

func (Token) IsName ¶

func (t Token) IsName() bool

func (Token) IsReserved ¶

func (t Token) IsReserved(includeExtensions bool) bool

IsReserved indicates if a name is a reserved word.

func (Token) IsString ¶

func (t Token) IsString() bool

func (Token) IsValue ¶

func (t Token) IsValue() bool

IsValue returns true if the token contains a value (integer, string, or other).

func (Token) Spelling ¶

func (t Token) Spelling() string

func (Token) String ¶

func (t Token) String() string

type TokenClass ¶

type TokenClass int

const (
	EndOfTokensClass TokenClass = iota
	IdentifierTokenClass
	TypeTokenClass
	StringTokenClass
	BooleanTokenClass
	IntegerTokenClass
	FloatTokenClass
	ReservedTokenClass
	SpecialTokenClass
	ValueTokenClass
)

func (TokenClass) String ¶

func (c TokenClass) String() string

type Tokenizer ¶

type Tokenizer struct {
	Source []string
	Tokens []Token
	TokenP int
	Line   []int
	Pos    []int
}

Tokenizer is an instance of a tokenized string.

func New ¶

func New(src string, isCode bool) *Tokenizer

New creates a tokenizer instance and breaks the string up into an array of tokens. The isCode flag is used to indicate this is Ego code, which has some different tokenizing rules.

func (*Tokenizer) Advance ¶

func (t *Tokenizer) Advance(p int)

Advance moves the pointer.

func (*Tokenizer) AnyNext ¶

func (t *Tokenizer) AnyNext(test ...Token) bool

AnyNext tests to see if the next token is in the given list of tokens, and if so advances and returns true, else does not advance and returns false.

func (*Tokenizer) AtEnd ¶

func (t *Tokenizer) AtEnd() bool

AtEnd indicates if we are at the end of the string.

func (*Tokenizer) DumpTokens ¶

func (t *Tokenizer) DumpTokens(before, after int)

func (*Tokenizer) GetLine ¶

func (t *Tokenizer) GetLine(line int) string

GetLine returns a given line of text from the token stream. This actuals refers to the original line splits done when the source was first received.

func (*Tokenizer) GetSource ¶

func (t *Tokenizer) GetSource() string

GetSource returns the entire string of the tokenizer.

func (*Tokenizer) GetTokens ¶

func (t *Tokenizer) GetTokens(pos1, pos2 int, spacing bool) string

GetTokens returns a string representing the tokens within the given range of tokens.

func (*Tokenizer) IsNext ¶

func (t *Tokenizer) IsNext(test Token) bool

IsNext tests to see if the next token is the given token, and if so advances and returns true, else does not advance and returns false.

func (*Tokenizer) Mark ¶

func (t *Tokenizer) Mark() int

Mark returns a location marker in the token stream.

func (*Tokenizer) Next ¶

func (t *Tokenizer) Next() Token

Next gets the next token in the tokenizer.

func (*Tokenizer) NextText ¶

func (t *Tokenizer) NextText() string

Next gets the next token in the tokenizer and returns it's text value as a string.

func (*Tokenizer) Peek ¶

func (t *Tokenizer) Peek(offset int) Token

Peek looks ahead at the next token without advancing the pointer.

func (*Tokenizer) PeekText ¶

func (t *Tokenizer) PeekText(offset int) string

Peek looks ahead at the next token without advancing the pointer.

func (*Tokenizer) Remainder ¶

func (t *Tokenizer) Remainder() string

Remainder returns the rest of the source, as initially presented to the tokenizer, from the current token position. This allows the caller to get "the rest" of a command line or other element as needed. If the token position is invalid (i.e. past end-of-tokens, for example) then an empty string is returned.

func (*Tokenizer) Reset ¶

func (t *Tokenizer) Reset()

Reset sets the tokenizer back to the start of the token stream.

func (*Tokenizer) Set ¶

func (t *Tokenizer) Set(mark int)

Set sets the next token to the given marker.

func (*Tokenizer) SetLineNumber ¶

func (t *Tokenizer) SetLineNumber(line int) error

Reset line numbers. This is done after a prolog that the user might not be aware of is injected, so errors reported during compilation or runtime reflect line numbers based on the @line specification rather than the actual literal line number.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL