lr

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 13, 2021 License: BSD-3-Clause Imports: 14 Imported by: 1

README

Experimental GLR Parser

This package contains an experimental parser which—hopefully—we will use some day to parse Markdown. GLR parsers are rare (outside of academic research) and there is no easy-to-port version in another programming language. We will just muddle ahead and will see where we can get.

GLR parsers rely on a special stack structure, called a GSS. A GSS can hold information about alternative parser states after a conflict (shift/reduce, reduce/reduce) occured.

For further information see for example

https://people.eecs.berkeley.edu/~necula/Papers/elkhound_cc04.pdf
https://cs.au.dk/~amoeller/papers/ambiguity/ambiguity.pdf

This is experimental software, currently not intended for production use in any way.

Documentation

Overview

Package lr implements prerequisites for LR parsing. It is mainly intended for Markdown parsing, but may be of use for other purposes, too.

Building a Grammar

Grammars are specified using a grammar builder object. Clients add rules, consisting of non-terminal symbols and terminals. Terminals carry a token value of type int. Grammars may contain epsilon-productions.

Example:

b := lr.NewGrammarBuilder("G")
b.LHS("S").N("A").T("a", 1).EOF()  // S  ->  A a EOF
b.LHS("A").N("B").N("D").End()     // A  ->  B D
b.LHS("B").T("b", 2).End()         // B  ->  b
b.LHS("B").Epsilon()               // B  ->
b.LHS("D").T("d", 3).End()         // D  ->  d
b.LHS("D").Epsilon()               // D  ->

This results in the following trivial grammar:

b.Grammar().Dump()

0: [S] ::= [A a #eof]
1: [A] ::= [B D]
2: [B] ::= [b]
3: [B] ::= []
4: [D] ::= [d]
5: [D] ::= []

Static Grammar Analysis

After the grammar is complete, it has to be analysed. For this end, the grammar is subjected to an LRAnalysis object, which computes FIRST and FOLLOW sets for the grammar and determines all epsilon-derivable rules.

Although FIRST and FOLLOW-sets are mainly intended to be used for internal purposes of constructing the parser tables, methods for getting FIRST(N) and FOLLOW(N) of non-terminals are defined to be public.

ga := lr.Analysis(g)  // analyser for grammar above
ga.Grammar().EachNonTerminal(
    func(name string, N Symbol) interface{} {             // ad-hoc mapper function
        fmt.Printf("FIRST(%s) = %v", name, ga.First(N))   // get FIRST-set for N
        return nil
    })

// Output:
FIRST(S) = [1 2 3]         // terminal token values as int, 1 = 'a'
FIRST(A) = [0 2 3]         // 0 = epsilon
FIRST(B) = [0 2]           // 2 = 'b'
FIRST(D) = [0 3]           // 3 = 'd'

Parser Construction

Using grammar analysis as input, a bottom-up parser can be constructed. First a characteristic finite state machine (CFSM) is built from the grammar. The CFSM will then be transformed into a GOTO table (LR(0)-table) and an ACTION table for a SLR(1) parser. The CFSM will not be thrown away, but is made available to the client. This is intended for debugging purposes, but may be useful for error recovery, too. It can be exported to Graphviz's Dot-format.

Example:

lrgen := NewLRTableGenerator(ga)  // ga is a GrammarAnalysis, see above
lrgen.CreateTables()              // construct LR parser tables

BSD License

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of this software nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Index

Constants

View Source
const (
	EpsilonType = 0
	EOFType     = -1    // pseudo terminal token for end of input
	NonTermType = -1000 // IDs of terminals MUST be in { -2 … -999 }
)

Symbols' values must adhere to these ranges.

View Source
const (
	ShiftAction  = -1
	AcceptAction = -2
)

Actions for parser action tables.

Variables

View Source
var NullItem = Item{nil, 0, 0}

NullItem is the invalid item.

Functions

func ActionTableAsHTML

func ActionTableAsHTML(lrgen *TableGenerator, w io.Writer)

ActionTableAsHTML exports the SLR(1) ACTION-table in HTML-format.

func Dump

func Dump(iset *iteratable.Set)

Dump an item set to the tracer.

func GotoTableAsHTML

func GotoTableAsHTML(lrgen *TableGenerator, w io.Writer)

GotoTableAsHTML exports a GOTO-table in HTML-format.

func StartItem

func StartItem(r *Rule) (Item, *Symbol)

StartItem returns an Earley item from a rule with the dot at position 0.

func T

func T() tracing.Trace

T traces to the global syntax tracer.

Types

type CFSM

type CFSM struct {
	S0 *CFSMState // start state
	// contains filtered or unexported fields
}

CFSM is the characteristic finite state machine for a LR grammar, i.e. the LR(0) state diagram. Will be constructed by a TableGenerator. Clients normally do not use it directly. Nevertheless, there are some methods defined on it, e.g, for debugging purposes, or even to compute your own tables from it.

func (*CFSM) CFSM2GraphViz

func (c *CFSM) CFSM2GraphViz(filename string)

CFSM2GraphViz exports a CFSM to the Graphviz Dot format, given a filename.

type CFSMState

type CFSMState struct {
	ID int // serial ID of this state

	Accept bool // is this an accepting state?
	// contains filtered or unexported fields
}

CFSMState is a state within the CFSM for a grammar.

func (*CFSMState) Dump

func (s *CFSMState) Dump()

Dump is a debugging helper

func (*CFSMState) String

func (s *CFSMState) String() string

type Grammar

type Grammar struct {
	Name string // a grammar has a name, for documentation only

	Epsilon *Symbol // a special symbol representing epsilon
	EOF     *Symbol // a special symbol representing end of input
	// contains filtered or unexported fields
}

Grammar is a type for a grammar. Usually created using a GrammarBuilder.

func (*Grammar) Dump

func (g *Grammar) Dump()

Dump is a debugging helper: dump symbols and rules to stdout.

func (*Grammar) EachNonTerminal

func (g *Grammar) EachNonTerminal(mapper func(sym *Symbol) interface{}) []interface{}

EachNonTerminal iterates over all non-terminal symbols of the grammar. Return values of the mapper function for all non-terminals are returned as an array.

func (*Grammar) EachSymbol

func (g *Grammar) EachSymbol(mapper func(sym *Symbol) interface{}) []interface{}

EachSymbol iterates over all symbols of the grammar. Return values of the mapper function are returned as an array.

func (*Grammar) EachTerminal

func (g *Grammar) EachTerminal(mapper func(sym *Symbol) interface{}) []interface{}

EachTerminal iterates over all terminals of the grammar. Return values of the mapper function for all terminals are returned as an array.

func (*Grammar) FindNonTermRules

func (g *Grammar) FindNonTermRules(sym *Symbol, includeEpsRules bool) *iteratable.Set

FindNonTermRules returns a set of Earley-items, where each item stems from a rule with a given LHS and the dot is at position 0.

func (*Grammar) Rule

func (g *Grammar) Rule(no int) *Rule

Rule gets a grammar rule.

func (*Grammar) Size

func (g *Grammar) Size() int

Size returns the number of rules in the grammar.

func (*Grammar) SymbolByName

func (g *Grammar) SymbolByName(name string) *Symbol

SymbolByName gets a symbol for a given name, if found in the grammar.

func (*Grammar) Terminal

func (g *Grammar) Terminal(tokval int) *Symbol

Terminal returns the terminal symbol for a given token value, if it is defined in the grammar.

type GrammarBuilder

type GrammarBuilder struct {
	// contains filtered or unexported fields
}

A GrammarBuilder is used to construct a Grammar.

b := NewGrammarBuilder("G")
b.LHS("S").N("A").T("a", 1).End()  // S  ->  A a
b.LHS("A").N("B").N("D").End()     // A  ->  B D
b.LHS("B").T("b", 2).End()         // B  ->  b
b.LHS("B").Epsilon()               // B  ->
b.LHS("D").T("d", 3).End()         // D  ->  d
b.LHS("D").Epsilon()               // D  ->

This results in the following grammar: b.Grammar().Dump()

0: [S']::= [S]
1: [S] ::= [A a]
2: [A] ::= [B D]
3: [B] ::= [b]
4: [B] ::= []
5: [D] ::= [d]
6: [D] ::= []

A call to b.Grammar() returns the (completed) grammar.

func NewGrammarBuilder

func NewGrammarBuilder(gname string) *GrammarBuilder

NewGrammarBuilder gets a new grammar builder, given the name of the grammar to build.

func (*GrammarBuilder) Grammar

func (gb *GrammarBuilder) Grammar() (*Grammar, error)

Grammar returns the (completed) grammar.

func (*GrammarBuilder) LHS

func (gb *GrammarBuilder) LHS(s string) *RuleBuilder

LHS starts a rule given the left hand side symbol (non-terminal).

func (*GrammarBuilder) SetTokenizerHook

func (gb *GrammarBuilder) SetTokenizerHook(hook TokenizerHook)

SetTokenizerHook sets a tokenizer hook, which will be called by the grammar to produce terminal tokens.

type Item

type Item struct {
	Origin uint64 // start position in input
	// contains filtered or unexported fields
}

Item is an Earley item.

func (Item) Advance

func (i Item) Advance() Item

Advance advances the dot of an item over the next symbol. Returns NullItem if the dot is already past the last symbol.

func (Item) PeekSymbol

func (i Item) PeekSymbol() *Symbol

PeekSymbol returns the symbol after the dot, if any.

func (Item) Prefix

func (i Item) Prefix() []*Symbol

Prefix returns a slice, so the result should probably be considered read-only. It returns the symbols of the RHS before the dot.

func (Item) Rule

func (i Item) Rule() *Rule

Rule returns the grammar rule of this item.

func (Item) String

func (i Item) String() string

type LRAnalysis

type LRAnalysis struct {
	// contains filtered or unexported fields
}

LRAnalysis is an object for grammar analysis (compute FIRST and FOLLOW sets).

func Analysis

func Analysis(g *Grammar) *LRAnalysis

Analysis creates an analyser for a grammar. The analyser immediately starts its work and computes FIRST and FOLLOW sets.

func (*LRAnalysis) CleanUp

func (ga *LRAnalysis) CleanUp() error

CleanUp cleans a context free grammar by removing unproductive non-terminals and rules, and by removing unreachable non-terminals.

If the returned error is of kind ProductionsRemovedError, sub-errors are giving details about the removals. Check with errors.Is(…).

func (*LRAnalysis) DerivesEpsilon

func (ga *LRAnalysis) DerivesEpsilon(sym *Symbol) bool

DerivesEpsilon returns true if there are rules in the grammar which let a given symbol sym be derived to epsilon.

func (*LRAnalysis) First

func (ga *LRAnalysis) First(sym *Symbol) *intsets.Sparse

First returns the FIRST set for a non-terminal. Returns a set of token values.

func (*LRAnalysis) Follow

func (ga *LRAnalysis) Follow(sym *Symbol) *intsets.Sparse

Follow returns the FOLLOW set for a non-terminal. Returns a set of token values.

func (*LRAnalysis) Grammar

func (ga *LRAnalysis) Grammar() *Grammar

Grammar retunrns the grammer this analyser operates on.

type Rule

type Rule struct {
	Serial int     // order number of this rule within a grammar
	LHS    *Symbol // symbols of left hand side
	// contains filtered or unexported fields
}

A Rule is a type for rules of a grammar. Rules cannot be shared between grammars.

func (*Rule) IsEps

func (r *Rule) IsEps() bool

IsEps returns true if this an epsilon-rule.

func (*Rule) RHS

func (r *Rule) RHS() []*Symbol

RHS gets the right hand side of a rule as a shallow copy. Clients should treat it as read-only.

func (*Rule) String

func (r *Rule) String() string

type RuleBuilder

type RuleBuilder struct {
	// contains filtered or unexported fields
}

RuleBuilder is a builder type for rules.

func (*RuleBuilder) AppendSymbol

func (rb *RuleBuilder) AppendSymbol(sym *Symbol) *RuleBuilder

AppendSymbol appends your own symbol objects to the builder to extend the RHS of a rule. Clients will have to make sure no different 2 symbols have the same ID and no symbol ID equals a token value of a non-terminal. This restriction is necessary to help produce correct GOTO tables for LR-parsing.

func (*RuleBuilder) EOF

func (rb *RuleBuilder) EOF() *Rule

EOF appends EOF as a (terminal) symbol to a rule. This is usually not called by clients, but rather internally by the grammar builder. If you know what you're doing, be careful.

This completes the rule (no other builder calls should be made for this rule).

func (*RuleBuilder) End

func (rb *RuleBuilder) End() *Rule

End a rule. This completes the rule (no other builder calls should be made for this rule).

func (*RuleBuilder) Epsilon

func (rb *RuleBuilder) Epsilon() *Rule

Epsilon sets epsilon as the RHS of a production. This must be called directly after rb.LHS(...). It closes the rule, thus no call to End() or EOF() must follow.

func (*RuleBuilder) L

func (rb *RuleBuilder) L(s string) *RuleBuilder

L appends a terminal/lexeme to the builder. This will create a symbol for a terminal, with a token value > -1000. This is due to the convention of the stdlib-package text/parser, which uses token values > 0 for single-rune tokens and token values < 0 for common language elements like identifiers, strings, numbers, etc. (it is assumed that no symbol set will require more than 1000 of such language elements). The method call will panic if this restriction is violated.

The token value will either be generated from an internal sequence, or – if a tokenizer-hook is set – by the hook.

func (*RuleBuilder) N

func (rb *RuleBuilder) N(s string) *RuleBuilder

N appends a non-terminal to the builder. The internal symbol created for the non-terminal will have an ID less than -1000.

func (*RuleBuilder) T

func (rb *RuleBuilder) T(s string, tokval int) *RuleBuilder

T appends a terminal to the builder. The symbol created for the terminal must not have a token value <= -1000 and not have value 0 or -1. This is due to the convention of the stdlib-package text/parser, which uses token values > 0 for single-rune tokens and token values < 0 for common language elements like identifiers, strings, numbers, etc. (it is assumed that no symbol set will require more than 1000 of such language elements). The method call will panic if this restriction is violated.

type Span

type Span [2]uint64 // (x…y)

Span is a small type for capturing a length of input token run. For every terminal and non-terminal, a parse tree/forest will track which input positions this symbol covers. A span denotes a start position and the position just behind the end.

func (*Span) From

func (s *Span) From() uint64

From returns the start value of a span.

func (*Span) Len

func (s *Span) Len() uint64

Len returns the length of (x…y)

func (*Span) String

func (s *Span) String() string

func (*Span) To

func (s *Span) To() uint64

To returns the end value of a span.

type Symbol

type Symbol struct {
	Name  string // visual representation, if any
	Value int    // ID or token value
}

Symbol is a symbol type used for grammars and grammar builders.

func (*Symbol) IsTerminal

func (lrsym *Symbol) IsTerminal() bool

IsTerminal returns true if this symbol represents a terminal.

func (*Symbol) String

func (lrsym *Symbol) String() string

func (*Symbol) Token

func (lrsym *Symbol) Token() int

Token is just an alias for lrsym.Value.

type TableGenerator

type TableGenerator struct {
	HasConflicts bool
	// contains filtered or unexported fields
}

TableGenerator is a generator object to construct LR parser tables. Clients usually create a Grammar G, then a LRAnalysis-object for G, and then a table generator. TableGenerator.CreateTables() constructs the CFSM and parser tables for an LR-parser recognizing grammar G.

func NewTableGenerator

func NewTableGenerator(ga *LRAnalysis) *TableGenerator

NewTableGenerator creates a new TableGenerator for a (previously analysed) grammar.

func (*TableGenerator) AcceptingStates

func (lrgen *TableGenerator) AcceptingStates() []int

AcceptingStates returns all states of the CFSM which represent an accept action. Clients have to call CreateTables() first.

func (*TableGenerator) ActionTable

func (lrgen *TableGenerator) ActionTable() *sparse.IntMatrix

ActionTable returns the ACTION table for LR-parsing a grammar. The tables have to be built by calling CreateTables() previously (or a separate call to BuildSLR1ActionTable(...).)

func (*TableGenerator) BuildGotoTable

func (lrgen *TableGenerator) BuildGotoTable() *sparse.IntMatrix

BuildGotoTable builds the GOTO table. This is normally not called directly, but rather via CreateTables().

func (*TableGenerator) BuildLR0ActionTable

func (lrgen *TableGenerator) BuildLR0ActionTable() (*sparse.IntMatrix, bool)

BuildLR0ActionTable contructs the LR(0) Action table. This method is not called by CreateTables(), as we normally use an SLR(1) parser and therefore an action table with lookahead included. This method is provided as an add-on.

func (*TableGenerator) BuildSLR1ActionTable

func (lrgen *TableGenerator) BuildSLR1ActionTable() (*sparse.IntMatrix, bool)

BuildSLR1ActionTable constructs the SLR(1) Action table. This method is normally not called by clients, but rather via CreateTables(). It builds an action table including lookahead (using the FOLLOW-set created by the grammar analyzer).

func (*TableGenerator) CFSM

func (lrgen *TableGenerator) CFSM() *CFSM

CFSM returns the characteristic finite state machine (CFSM) for a grammar. Usually clients call lrgen.CreateTables() beforehand, but it is possible to call lrgen.CFSM() directly. The CFSM will be created, if it has not been constructed previously.

func (*TableGenerator) CreateTables

func (lrgen *TableGenerator) CreateTables()

CreateTables creates the necessary data structures for an SLR parser.

func (*TableGenerator) GotoTable

func (lrgen *TableGenerator) GotoTable() *sparse.IntMatrix

GotoTable returns the GOTO table for LR-parsing a grammar. The tables have to be built by calling CreateTables() previously (or a separate call to BuildGotoTable(...).)

type TokenizerHook

type TokenizerHook interface {
	NewToken(string) (string, int)
}

TokenizerHook is an interface for a hook function, which will produce properties for a valid terminal token for this grammar.

Directories

Path Synopsis
Package dss implements variants of a DAG-structured stack (DSS).
Package dss implements variants of a DAG-structured stack (DSS).
Package earley provides an Earley-Parser.
Package earley provides an Earley-Parser.
Package glr implements a small-scale GLR(1)-parser.
Package glr implements a small-scale GLR(1)-parser.
Package iteratable implements iteratable container data structures.
Package iteratable implements iteratable container data structures.
Package slr provides an SLR(1)-parser.
Package slr provides an SLR(1)-parser.
Package sparse implements a simple type for sparse integer matrices.
Package sparse implements a simple type for sparse integer matrices.
Package sppf implements a "Shared Packed Parse Forest".
Package sppf implements a "Shared Packed Parse Forest".

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL