tokenizer

package
v0.1.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 21, 2023 License: MIT Imports: 8 Imported by: 0

README

tokenizer

The tokenizer package uses the strings.Scan() package to tokenize a string buffer according to the Ego language rules. These extensions to the default Scan() behavior involve special tokens, such as ">=" which would normally scan as two tokens but are grouped together as a single token.

The package also contains utility functions for navigating a cursor in the token stream to peek, read, or advance the token stream during compilation. Finally, it contains routines for testing the nature of given tokens, such as determining if a token can be used as a symbol.

Creating A Token Stream

Use the New() function to pass in a string containing the text to tokenize, and receive a pointer to a Tokenizer object. This object contains all the information about the tokenization of the string, and a cursor that can be moved through the stream.

src := "print 3+5*2"
tokens := tokenizer.New(src)

The resulting tokens object can be used to scan through the tokens, read the token strings, etc.

Reading Tokens

You can read a token explicitly which also advances the cursor one position. You can also peek ahead or behind the cursor in the token stream to see other tokens that are not the current token. You can also test the next token, and if it matches then advance the cursor and return true.

t := tokens.Next()

This reads the next token in the stream, and advances the cursor. The string representation of the token is returned as the function result.

t := tokens.Peek(1)

This peeks ahead one token in advance of the cursor, and reads that token string. The current token position is not changed by this operation. Values greater than zero read ahead of the current position. A value of 0 re-reads the current position (the same value returned by the last Next() call, for example). A negative value reads previously-read tokens behind the current token position.

if tokens.IsNext("print") {
    // Handle print operations
}

The IsNext() function tests the next token to see if it matches the given string. If it does match, then the cursor advances one position and the function returns true. If the next token does not match the string, then the function returns false and the cursor is not changed. In the above example, when the conditional block runs, the "print" token will be behind the cursor, which is positioned at whatever token followed "print".

if tokens.IsAnyNext([]string{"this", "that"}) {
    // Handle this or that stuff
}

This is very similar to IsNext() but it compares the next token to the list of string values given, and if it matches any of those tokens, the function returns true and the token cursor is advanced. Inside the body of the condition statement in this example, the caller can use tokens.Peek(0) to see what the value of the token was that matched the list.

Token Position

The token cursor is normally moved only when a Next(), IsNext(), or IsAnyNext() call is made. However the caller can manually manipulate the cursor position in a number of ways.

tokens.Advance(1)

The Advance() method moves the cursor by the amount given. If the value is positive, the cursor is moved to ahead. If the value is zero, the cursor is unchanged. If the value is negative, the cursor is moved back that many positions. Note that you cannot move the cursor to before the first token or after the last token.

if tokens.AtEnd() {
    return
}

The AtEnd() function returns true if the cursor is at the end of the token stream. The cursor is not moved by this operation.

The caller can explicit change the token position to an absolute position, or to record the current position (this is useful if the tokenizer must parse ahead through a complex set of productions before deterining that the compilation is invalid and the tokenizer should be reset to try another path).

tokens.Reset()
t := tokens.Mark()
tokens.Set(t)

The first call will reset the token position to the start of the stream. This is the same position the cursor is after the tokenizer is first created with the New() method. The Mark() method will return the current token position, with the intention that the calling processor can mark the current location to return to it later. The Set() method is used to set a previously-collected cursor position as the new current cursor position.

Documentation

Index

Constants

View Source
const ToTheEnd = 999999

ToTheEnd means to advance the token stream to the end.

Variables

View Source
var (
	// "assert" token.
	AssertToken = NewReservedToken("assert")

	// "bool" token.
	BoolToken = NewTypeToken("bool")

	// "{" token.
	BlockBeginToken = NewSpecialToken("{")

	// "}" token.
	BlockEndToken = NewSpecialToken("}")

	// "break" token.
	BreakToken = NewReservedToken("break")

	// "byte" token.
	ByteToken = NewTypeToken("byte")

	// "call" token.
	CallToken = NewReservedToken("call")

	// "case" token.
	CaseToken = NewIdentifierToken("case")

	// "catch" token.
	CatchToken = NewReservedToken("catch")

	// "chan"  token.
	ChanToken = NewTypeToken("chan")

	// "clear" token.
	ClearToken = NewIdentifierToken("clear")

	// "const" token.
	ConstToken = NewReservedToken("const")

	// "continue"  token.
	ContinueToken = NewReservedToken("continue")

	// "{" token.
	DataBeginToken = NewSpecialToken("{")

	// "}" token.
	DataEndToken = NewSpecialToken("}")

	// "default" token.
	DefaultToken = NewIdentifierToken("default")

	// "defer" token.
	DeferToken = NewReservedToken("defer")

	// "@" token.
	DirectiveToken = NewSpecialToken("@")

	// "else" token.
	ElseToken = NewReservedToken("else")

	// "{}" token.
	EmptyBlockToken = NewSpecialToken("{}")

	// "{}" token.
	EmptyInitializerToken = NewSpecialToken("{}")

	// "interface{}" token.
	EmptyInterfaceToken = NewTypeToken("interface{}")

	// "error" token.
	ErrorToken = NewIdentifierToken("error")

	// "exit" token.
	ExitToken = NewReservedToken("exit")

	// "fallthrough" token.
	FallthroughToken = NewReservedToken("fallthrough")

	// "float32" token.
	Float32Token = NewTypeToken("float32")

	// "float64" token.
	Float64Token = NewTypeToken("float64")

	// "for" token.
	ForToken = NewReservedToken("for")

	// "func" token.
	FuncToken = NewReservedToken("func")

	// "go" token.
	GoToken = NewReservedToken("go")

	// "if" token.
	IfToken = NewReservedToken("if")

	// "int" token.
	IntToken = NewTypeToken("int")

	// "int32"  token.
	Int32Token = NewTypeToken("int32")

	// "int64" token.
	Int64Token = NewTypeToken("int64")

	// "interface"  token.
	InterfaceToken = NewIdentifierToken("interface")

	// "import" token.
	ImportToken = NewReservedToken("import")

	// "make" token.
	MakeToken = NewReservedToken("make")

	// "map" token.
	MapToken = NewTypeToken("map")

	// "nil" token.
	NilToken = NewReservedToken("nil")

	// "package" token.
	PackageToken = NewReservedToken("package")

	// "panic" token.
	PanicToken = NewReservedToken("panic")

	// "print" token.
	PrintToken = NewReservedToken("print")

	// "range" token.
	RangeToken = NewIdentifierToken("range")

	// "return" token.
	ReturnToken = NewReservedToken("return")

	// "string" token.
	StringToken = NewTypeToken("string")

	// "struct" token.
	StructToken = NewTypeToken("struct")

	// "switch" token.
	SwitchToken = NewReservedToken("switch")

	// "test" token.
	TestToken = NewIdentifierToken("test")

	// "type" token.
	TypeToken = NewReservedToken("type")

	// "try" token.
	TryToken = NewReservedToken("try")

	// "var" token.
	VarToken = NewReservedToken("var")

	// "when" token.
	WhenToken = NewIdentifierToken("when")

	// ";" token.
	SemicolonToken = NewSpecialToken(";")

	// ":" token.
	ColonToken = NewSpecialToken(":")

	// ":=" token.
	DefineToken = NewSpecialToken(":=")

	// "=" token.
	AssignToken = NewSpecialToken("=")

	// "," token.
	CommaToken = NewSpecialToken(",")

	// "==" token.
	EqualsToken = NewSpecialToken("==")

	// ">" token.
	GreaterThanToken = NewSpecialToken(">")

	// ">=" token.
	GreaterThanOrEqualsToken = NewSpecialToken(">=")

	// "<" token.
	LessThanToken = NewSpecialToken("<")

	// "<=" token.
	LessThanOrEqualsToken = NewSpecialToken("<=")

	// "<<" token.
	ShiftLeftToken = NewSpecialToken("<<")

	// ">>" token.
	ShiftRightToken = NewSpecialToken(">>")

	// "!" token.
	NotToken = NewSpecialToken("!")

	// "!=" token.
	NotEqualsToken = NewSpecialToken("!=")

	// "%" token.
	ModuloToken = NewSpecialToken("%")

	// "^" token.
	ExponentToken = NewSpecialToken("^")

	// "+" token.
	AddToken = NewSpecialToken("+")

	// "-" token.
	SubtractToken = NewSpecialToken("-")

	// "*" token.
	MultiplyToken = NewSpecialToken("*")

	// "/" token.
	DivideToken = NewSpecialToken("/")

	// "*" token.
	PointerToken = NewSpecialToken("*")

	// '&" token.
	AddressToken = NewSpecialToken("&")

	// "&" token.
	AndToken = NewSpecialToken("&")

	// "|" token.
	OrToken = NewSpecialToken("|")

	// "&&" token.
	BooleanAndToken = NewSpecialToken("&&")

	// "||" token.
	BooleanOrToken = NewSpecialToken("||")

	// "+=" token.
	AddAssignToken = NewSpecialToken("+=")

	// "-=" token.
	SubtractAssignToken = NewSpecialToken("-=")

	// "*=" token.
	MultiplyAssignToken = NewSpecialToken("*=")

	// "/=" token.
	DivideAssignToken = NewSpecialToken("/=")

	// "++" token.
	IncrementToken = NewSpecialToken("++")

	// "--" token.
	DecrementToken = NewSpecialToken("--")

	// "." token.
	DotToken = NewSpecialToken(".")

	// "..." token.
	VariadicToken = NewSpecialToken("...")

	// "<-" token.
	ChannelReceiveToken = NewSpecialToken("<-")

	// "(" token.
	StartOfListToken = NewSpecialToken("(")

	// ")"" token.
	EndOfListToken = NewSpecialToken(")")

	// "[" token.
	StartOfArrayToken = NewSpecialToken("[")

	// "]" token.
	EndOfArrayToken = NewSpecialToken("]")

	// "?" token.
	OptionalToken = NewSpecialToken("?")

	// Empty token.
	EmptyToken = NewSpecialToken("")

	// "-" token.
	NegateToken = NewSpecialToken("-")
)

Symbolic names for each string token value.

View Source
var EndOfTokens = Token{/* contains filtered or unexported fields */}

EndOfTokens is a reserved token that means end of the buffer was reached.

ExtendedReservedWords are additional reserved words when running with language extensions enabled.

ReservedWords is the list of reserved words in the _Ego_ language.

SpecialTokens is a list of tokens that are considered special symantic characters.

TypeTokens is a list of tokens that represent type names.

Functions

func InList

func InList(s Token, test ...Token) bool

InList is a support function that checks to see if a string matches any of a list of other strings.

func IsSymbol

func IsSymbol(s string) bool

IsSymbol is a utility function to determine if a string contains is a symbol name.

Types

type Token

type Token struct {
	// contains filtered or unexported fields
}

Token defines a single token from the lexical scanning operation.

func NewFloatToken

func NewFloatToken(spelling string) Token

func NewIdentifierToken

func NewIdentifierToken(spelling string) Token

func NewIntegerToken

func NewIntegerToken(spelling string) Token

func NewReservedToken

func NewReservedToken(spelling string) Token

func NewSpecialToken

func NewSpecialToken(spelling string) Token

func NewStringToken

func NewStringToken(spelling string) Token

func NewToken

func NewToken(class TokenClass, spelling string) Token

func NewTypeToken

func NewTypeToken(spelling string) Token

func NewValueToken

func NewValueToken(spelling string) Token

func (Token) Boolean

func (t Token) Boolean() bool

func (Token) Float

func (t Token) Float() float64

func (Token) Integer

func (t Token) Integer() int64

func (Token) IsClass

func (t Token) IsClass(class TokenClass) bool

func (Token) IsIdentifier

func (t Token) IsIdentifier() bool

func (Token) IsName

func (t Token) IsName() bool

func (Token) IsReserved

func (t Token) IsReserved(includeExtensions bool) bool

IsReserved indicates if a name is a reserved word.

func (Token) IsString

func (t Token) IsString() bool

func (Token) IsValue

func (t Token) IsValue() bool

IsValue returns true if the token contains a value (integer, string, or other).

func (Token) Spelling

func (t Token) Spelling() string

func (Token) String

func (t Token) String() string

type TokenClass

type TokenClass int
const (
	EndOfTokensClass TokenClass = iota
	IdentifierTokenClass
	TypeTokenClass
	StringTokenClass
	BooleanTokenClass
	IntegerTokenClass
	FloatTokenClass
	ReservedTokenClass
	SpecialTokenClass
	ValueTokenClass
)

func (TokenClass) String

func (c TokenClass) String() string

type Tokenizer

type Tokenizer struct {
	Source []string
	Tokens []Token
	TokenP int
	Line   []int
	Pos    []int
}

Tokenizer is an instance of a tokenized string.

func New

func New(src string, isCode bool) *Tokenizer

New creates a tokenizer instance and breaks the string up into an array of tokens. The isCode flag is used to indicate this is Ego code, which has some different tokenizing rules.

func (*Tokenizer) Advance

func (t *Tokenizer) Advance(p int)

Advance moves the pointer.

func (*Tokenizer) AnyNext

func (t *Tokenizer) AnyNext(test ...Token) bool

AnyNext tests to see if the next token is in the given list of tokens, and if so advances and returns true, else does not advance and returns false.

func (*Tokenizer) AtEnd

func (t *Tokenizer) AtEnd() bool

AtEnd indicates if we are at the end of the string.

func (*Tokenizer) DumpTokens

func (t *Tokenizer) DumpTokens(before, after int)

func (*Tokenizer) GetLine

func (t *Tokenizer) GetLine(line int) string

GetLine returns a given line of text from the token stream. This actuals refers to the original line splits done when the source was first received.

func (*Tokenizer) GetSource

func (t *Tokenizer) GetSource() string

GetSource returns the entire string of the tokenizer.

func (*Tokenizer) GetTokens

func (t *Tokenizer) GetTokens(pos1, pos2 int, spacing bool) string

GetTokens returns a string representing the tokens within the given range of tokens.

func (*Tokenizer) IsNext

func (t *Tokenizer) IsNext(test Token) bool

IsNext tests to see if the next token is the given token, and if so advances and returns true, else does not advance and returns false.

func (*Tokenizer) Mark

func (t *Tokenizer) Mark() int

Mark returns a location marker in the token stream.

func (*Tokenizer) Next

func (t *Tokenizer) Next() Token

Next gets the next token in the tokenizer.

func (*Tokenizer) NextText

func (t *Tokenizer) NextText() string

Next gets the next token in the tokenizer and returns it's text value as a string.

func (*Tokenizer) Peek

func (t *Tokenizer) Peek(offset int) Token

Peek looks ahead at the next token without advancing the pointer.

func (*Tokenizer) PeekText

func (t *Tokenizer) PeekText(offset int) string

Peek looks ahead at the next token without advancing the pointer.

func (*Tokenizer) Remainder

func (t *Tokenizer) Remainder() string

Remainder returns the rest of the source, as initially presented to the tokenizer, from the current token position. This allows the caller to get "the rest" of a command line or other element as needed. If the token position is invalid (i.e. past end-of-tokens, for example) then an empty string is returned.

func (*Tokenizer) Reset

func (t *Tokenizer) Reset()

Reset sets the tokenizer back to the start of the token stream.

func (*Tokenizer) Set

func (t *Tokenizer) Set(mark int)

Set sets the next token to the given marker.

func (*Tokenizer) SetLineNumber

func (t *Tokenizer) SetLineNumber(line int) error

Reset line numbers. This is done after a prolog that the user might not be aware of is injected, so errors reported during compilation or runtime reflect line numbers based on the @line specification rather than the actual literal line number.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL