lexer

package module
v0.0.0-...-593f9e9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 25, 2019 License: MIT Imports: 3 Imported by: 0

README

lexer

A framework for building lexers. See examples/json for a working example on how to build a JSON lexer with this framework.

How to use

Go get it with

go get -u github.com/TimSatke/lexer

Define your own states as you can see in examples/json/lexer.go.

Why should you use this

  • You want to build a parser and need a lexer, and don't want to build the base yourself, because it is effort to build and test it, and this framework already did that part

Will there be a parser framework?

Eventually, yes. At the moment though, no. This is due to my lack of experience with parsers and their architecture and design. Feel free to create one using this framework, though.

How does this lexer work?

First, you create a lexer. Second, you pass a starting state. This state will be executed (a state is a function that can drain/accept runes from the lexer's input), and must return the next state to go into.

The base lexer implementation provided with this package works on a byte slice. It has two markers, start and pos. pos is the current position in the input byte slice, start is the position, where the last token was emitted. Upon emit, start will be set to pos, marking the start position of the next token.

Whenever the lexer encounters unexpected runes, it is recommended to emit an error token and nil as the next state. This will stop the lexer to process the input. See this as an example on how to emit errors. The error tokens can then be processed by your parser, which will consume the lexer's token stream (retrieve it with lexer.TokenStream().Tokens()).

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func New

func New(input []byte, start State) *baseLexer

New returns a usable lexer implementation, that takes an input which will be lexed and a start state, which has to be provided. This state will be executed until nil is returned as a next state.

Types

type CharacterClass

type CharacterClass interface {
	// Matches returns whether the given rune is matched by this character
	// class.
	Matches(rune) bool
	String() string
}

CharacterClass is an interface providing methods for matching runes.

Implementations are

lexer.StringCharacterClass
lexer.NotStringCharacterClass

Both of these can be used to define constant character classes. See their documentation for more information.

type Lexer

type Lexer interface {
	// StartLexing will cause the lexer to start pushing tokens onto the token
	// stream. See the documentation of the implementing struct for information
	// on how to use this.
	StartLexing()

	// TokenStream returns the token stream that the lexer will push tokens
	// onto.
	TokenStream() token.Stream
	// Emit pushes a token of the given type with its position and all consumes
	// runes onto the token stream.
	Emit(token.Type)
	// EmitError emits an error token with the given error token type (that was
	// defined by you) and a given error message.
	EmitError(token.Type, string)
	// IsEOF determines whether the lexer has already reached the end of the
	// input.
	IsEOF() bool
	// Peek reads the next rune, but does not consume it. Peek does not advance
	// the lexer position in the input.
	Peek() rune
	// Next reads the next rune and consumes it. Next advances the lexer
	// position in the input by the byte-width of the read rune.
	Next() rune
	// Ignore discards all consumes runes. This behaves like Emit(...), except
	// it doesn't create/push a token onto the token stream.
	Ignore()
	// Backup unreads the last consumed rune.
	Backup()

	// Accept consumes the next rune, if and only if it is matched by the given
	// character class. Accept returns true if the next rune was matched and
	// consumed.
	Accept(CharacterClass) bool
	// AcceptMultiple consumes the next N runes that are matched by the given
	// character class. AcceptMultiple returns the amount of runes that were
	// matched.
	AcceptMultiple(CharacterClass) uint
}

Lexer is an interface providing all nececssary methods for lexing text. There is a default implementation for UTF-8 input, which can be used as follows.

func main() {
	l := lexer.New(input, lexRoot)
	_ = l
}

See the examples in the godoc for more information.

type NotStringCharacterClass

type NotStringCharacterClass string

NotStringCharacterClass is an implementation of lexer.CharacterClass, which matches runes that are NOT contained in the string used to define the character class.

const WhitespaceNoLinefeed = lexer.StringCharacterClass(" \t") // will match all runes that are neither ' ' nor '\t'

func (NotStringCharacterClass) Matches

func (s NotStringCharacterClass) Matches(r rune) bool

Matches returns true if the given rune is NOT contained inside the definition of this character class.

func (NotStringCharacterClass) String

func (s NotStringCharacterClass) String() string

type State

type State func(Lexer) State

State is a recursive definition of lexer states, that are executed non-recursively.

See the following example. The goal is, to lex strings that match exactly ABC. This should be tokenized into three tokens, TokenA, TokenB and TokenC. The following example shows, how to define states to achieve this (without error handling, just to show the sequence).

const (
	TokenA MyTokenType = iota
	TokenB
	TokenC
)
const (
	CCA = lexer.StringCharacterClass("A")
	CCB = lexer.StringCharacterClass("B")
	CCC = lexer.StringCharacterClass("C")
)
func lexABCString(l lexer.Lexer) lexer.State {
	return lexA
}
func lexA(l lexer.Lexer) lexer.State {
	l.Accept(CCA)
	l.Emit(TokenA)
	return lexB
}
func lexB(l lexer.Lexer) lexer.State {
	l.Accept(CCB)
	l.Emit(TokenB)
	return lexC
}
func lexC(l lexer.Lexer) lexer.State {
	l.Accept(CCC)
	l.Emit(TokenC)
	return nil
}

The lexer will start with lexABCString (assuming that this is the start State that you passed when creating the lexer), which will be executed. The lexer passed in is the lexer you are working with. lexABCString does nothing with the lexer, and returns lexA as next state. The lexer will execute lexA next. The lexer passed in is the same lexer as for lexABCString. lexA accepts an "A", emits a TokenA and returns lexB. The lexer will now execute lexB, which does almost the same as lexA. lexB the returns lexC, which will cause the lexer to execute lexC next. lexC returns nil as next state, which tells the lexer that the state machine is done, and it will stop execution and close the token stream.

type StringCharacterClass

type StringCharacterClass string

StringCharacterClass is an implementation of lexer.CharacterClass, which matches runes that are contained in the string used to define the character class.

const WhitespaceNoLinefeed = lexer.StringCharacterClass(" \t") // will match all runes that are either ' ' or '\t'

func (StringCharacterClass) Matches

func (s StringCharacterClass) Matches(r rune) bool

Matches returns true if the given rune is contained inside the definition of this character class.

func (StringCharacterClass) String

func (s StringCharacterClass) String() string

Directories

Path Synopsis
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL