lexer

package module
v0.0.0-...-e884d4b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 22, 2018 License: BSD-3-Clause Imports: 15 Imported by: 2

README

github.com/cznic/lexer has moved to modernc.org/lexer (vcs).

Please update your import paths to modernc.org/lexer.

This repo is now archived.

Documentation

Overview

Package lexer provides generating actionless scanners (lexeme recognizers) at run time.

Scanners are defined by regular expressions and/or lexical grammars, mapping between those definitions, token numeric identifiers and an optional set of starting id sets, providing simmilar functionality as switching start states in *nix LEX. The generated FSMs are Unicode arune based and all unicode.Categories and unicode.Scripts are supported by the regexp syntax using the \p{name} construct.

Syntax supported by ParseRE (ATM a very basic subset of RE2, docs bellow are a mod of: http://code.google.com/p/re2/wiki/Syntax, original docs license unclear)

Single characters:

.            any character, excluding newline
[xyz]        character class
[^xyz]       negated character class
\p{Greek}    Unicode character class
\P{Greek}    negated Unicode character class

Composites:

xy           x followed by y
x|y          x or y

Repetitions:

x*           zero or more x
x+           one or more x
x?           zero or one x

Grouping:

(re)         group

Empty strings:

^            at beginning of text or line
$            at end of text or line
\A           at beginning of text
\z           at end of text

Escape sequences:

\a           bell (≡ \007)
\b           backspace (≡ \010)
\f           form feed (≡ \014)
\n           newline (≡ \012)
\r           carriage return (≡ \015)
\t           horizontal tab (≡ \011)
\v           vertical tab character (≡ \013)
\M           M is one of metachars \.+*?()|[]^$
\xhh         arune \u00hh, h is a hex digit

Character class elements:

x            single Unicode character
A-Z          Unicode character range (inclusive)

Unicode character class names--general category:

Cc           control
Cf           format
Co           private use
Cs           surrogate
letter       Lu, Ll, Lt, Lm, or Lo
Ll           lowercase letter
Lm           modifier letter
Lo           other letter
Lt           titlecase letter
Lu           uppercase letter
Mc           spacing mark
Me           enclosing mark
Mn           non-spacing mark
Nd           decimal number
Nl           letter number
No           other number
Pc           connector punctuation
Pd           dash punctuation
Pe           close punctuation
Pf           final punctuation
Pi           initial punctuation
Po           other punctuation
Ps           open punctuation
Sc           currency symbol
Sk           modifier symbol
Sm           math symbol
So           other symbol
Zl           line separator
Zp           paragraph separator
Zs           space separator

Unicode character class names--scripts:

Arabic                 Arabic
Armenian               Armenian
Avestan                Avestan
Balinese               Balinese
Bamum                  Bamum
Bengali                Bengali
Bopomofo               Bopomofo
Braille                Braille
Buginese               Buginese
Buhid                  Buhid
Canadian_Aboriginal    Canadian Aboriginal
Carian                 Carian
Common                 Common
Coptic                 Coptic
Cuneiform              Cuneiform
Cypriot                Cypriot
Cyrillic               Cyrillic
Deseret                Deseret
Devanagari             Devanagari
Egyptian_Hieroglyphs   Egyptian Hieroglyphs
Ethiopic               Ethiopic
Georgian               Georgian
Glagolitic             Glagolitic
Gothic                 Gothic
Greek                  Greek
Gujarati               Gujarati
Gurmukhi               Gurmukhi
Hangul                 Hangul
Han                    Han
Hanunoo                Hanunoo
Hebrew                 Hebrew
Hiragana               Hiragana
Cham                   Cham
Cherokee               Cherokee
Imperial_Aramaic       Imperial Aramaic
Inherited              Inherited
Inscriptional_Pahlavi  Inscriptional Pahlavi
Inscriptional_Parthian Inscriptional Parthian
Javanese               Javanese
Kaithi                 Kaithi
Kannada                Kannada
Katakana               Katakana
Kayah_Li               Kayah Li
Kharoshthi             Kharoshthi
Khmer                  Khmer
Lao                    Lao
Latin                  Latin
Lepcha                 Lepcha
Limbu                  Limbu
Linear_B               Linear B
Lisu                   Lisu
Lycian                 Lycian
Lydian                 Lydian
Malayalam              Malayalam
Meetei_Mayek           Meetei Mayek
Mongolian              Mongolian
Myanmar                Myanmar
New_Tai_Lue            New Tai Lue
Nko                    Nko
Ogham                  Ogham
Old_Italic             Old Italic
Old_Persian            Old Persian
Old_South_Arabian      Old South Arabian
Old_Turkic             Old Turkic
Ol_Chiki               Ol Chiki
Oriya                  Oriya
Osmanya                Osmanya
Phags_Pa               Phags Pa
Phoenician             Phoenician
Rejang                 Rejang
Runic                  Runic
Samaritan              Samaritan
Saurashtra             Saurashtra
Shavian                Shavian
Sinhala                Sinhala
Sundanese              Sundanese
Syloti_Nagri           Syloti Nagri
Syriac                 Syriac
Tagalog                Tagalog
Tagbanwa               Tagbanwa
Tai_Le                 Tai Le
Tai_Tham               Tai Tham
Tai_Viet               Tai Viet
Tamil                  Tamil
Telugu                 Telugu
Thaana                 Thaana
Thai                   Thai
Tibetan                Tibetan
Tifinagh               Tifinagh
Ugaritic               Ugaritic
Vai                    Vai
Yi                     Yi

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type AssertEdge

type AssertEdge struct {
	EpsilonEdge
	Asserts EdgeAssert
}

AssertEdge is a non consuming edge which asserts line/text start/end.

func NewAssertEdge

func NewAssertEdge(target *NfaState, asserts EdgeAssert) *AssertEdge

NewAssertEdge returns a new AssertdEdge pointing to target, asserting asserts.

func (*AssertEdge) Accepts

func (e *AssertEdge) Accepts(s *ScannerSource) bool

Accepts is the AssertEdge implementation of the Edger interface.

func (*AssertEdge) String

func (e *AssertEdge) String() (s string)

type EOFReader

type EOFReader int

EOFReader implements a RuneReader allways returning 0 (EOF)

func (EOFReader) ReadRune

func (r EOFReader) ReadRune() (arune rune, size int, err error)

type EdgeAssert

type EdgeAssert int
const (
	TextStart EdgeAssert = iota
	TextEnd
	LineStart
	LineEnd
)

type Edger

type Edger interface {
	Accepts(s *ScannerSource) bool // Accepts() returns wheter an edge accepts the ScannerSource present state.
	Priority() int                 // Priority returns the priority tag of an edge (lower value wins).
	Target() *NfaState             // Target() returns the edge's target NFA state.
	String() string
	SetTarget(s *NfaState) *NfaState // SetTarget() assigns s as a new target and returns the original Target
}

Edger interface defines the method set for all NFA edge types.

type EpsilonEdge

type EpsilonEdge struct {
	Prio int
	Targ *NfaState
}

EpsilonEdge is a non consuming, always accepting NFA edge.

func (*EpsilonEdge) Accepts

func (e *EpsilonEdge) Accepts(s *ScannerSource) bool

Accepts is the EpsilonEdge implementation of the Edger interface.

func (*EpsilonEdge) Priority

func (e *EpsilonEdge) Priority() int

Priority is the EpsilonEdge implementation of the Edger interface.

func (*EpsilonEdge) SetTarget

func (e *EpsilonEdge) SetTarget(s *NfaState) (old *NfaState)

func (*EpsilonEdge) String

func (e *EpsilonEdge) String() (s string)

func (*EpsilonEdge) Target

func (e *EpsilonEdge) Target() *NfaState

Target is the EpsilonEdge implementation of the Edger interface.

type Lexer

type Lexer struct {
	// contains filtered or unexported fields
}

func CompileLexer

func CompileLexer(starts [][]int, tokdefs map[string]int, grammar, start string) (lexer *Lexer, err error)

TODO:full docs

func MustCompileLexer

func MustCompileLexer(starts [][]int, tokdefs map[string]int, grammar, start string) (lexer *Lexer)

MustCompileLexer is like CompileLexer but panics if the definitions cannot be compiled. It simplifies safe initialization of global variables holding compiled Lexers.

func (*Lexer) Scanner

func (lx *Lexer) Scanner(fname string, r io.RuneReader) *Scanner

Scanner returns a new Scanner which can run the Lexer FSM. A Scanner is not safe for concurent access but many Scanners can safely share the same Lexer.

The RuneReader can be nil. Then an EOFReader is supplied and the real RuneReader(s) can be Included anytime afterwards.

func (*Lexer) String

func (lx *Lexer) String() (s string)

type Nfa

type Nfa []*NfaState

Nfa is a set of NfaStates.

func (*Nfa) AddState

func (n *Nfa) AddState(s *NfaState) *NfaState

AddState adds and existing NfaState to Nfa. One NfaState should not appear in more than one Nfa because the NfaState Index property should always reflect its position in the owner Nfa.

func (*Nfa) NewState

func (n *Nfa) NewState() (s *NfaState)

NewState returns a newly created NfaState and adds it to the Nfa.

func (*Nfa) OneOrMore

func (n *Nfa) OneOrMore(in, out *NfaState) (from, to *NfaState)

OneOrMore converts a Nfa component C to C+

func (*Nfa) ParseRE

func (n *Nfa) ParseRE(name, re string) (in, out *NfaState, err error)

ParseRE compiles a regular expression re into Nfa, returns the re component starting and accepting states or an Error if any.

func (Nfa) String

func (n Nfa) String() (s string)

func (*Nfa) ZeroOrMore

func (n *Nfa) ZeroOrMore(in, out *NfaState) (from, to *NfaState)

ZeroOrMore converts a Nfa component C to C*

func (*Nfa) ZeroOrOne

func (n *Nfa) ZeroOrOne(in, out *NfaState) (from, to *NfaState)

ZeroOrOne converts a Nfa component C to C?

type NfaState

type NfaState struct {
	Index        uint    // Index of this state in its owning NFA.
	Consuming    []Edger // The NFA state non consuming edge set.
	NonConsuming []Edger // The NFA state consuming edge set.
}

NfaState desribes a single NFA state.

func (*NfaState) AddConsuming

func (n *NfaState) AddConsuming(edge Edger) Edger

AddConsuming adds an Edger to the state's consuming edge set and returns the Edger. No checks are made if the edge really is a consuming egde.

func (*NfaState) AddNonConsuming

func (n *NfaState) AddNonConsuming(edge Edger) Edger

AddNonConsuming adds an Edger to the state's non consuming edge set and returns the Edger. No checks are made if the edge really is a non consuming edge.

func (*NfaState) String

func (n *NfaState) String() (s string)

type RangesEdge

type RangesEdge struct {
	EpsilonEdge
	Invert bool                // Accepts all but Ranges as in [^exp]
	Ranges *unicode.RangeTable // Accepted arune set
}

RangesEdge is a consuming egde which accepts arune ranges except \U+0000.

func NewRangesEdge

func NewRangesEdge(target *NfaState, invert bool, ranges *unicode.RangeTable) *RangesEdge

NewRangesEdge returns a new RangesEdge pointing to target which accepts ranges.

func (*RangesEdge) Accepts

func (e *RangesEdge) Accepts(s *ScannerSource) bool

Accepts is the RangesEdge implementation of the Edger interface.

func (*RangesEdge) String

func (e *RangesEdge) String() (s string)

type RuneEdge

type RuneEdge struct {
	EpsilonEdge
	Rune rune
}

RuneEdge is a consuming egde which accepts a single arune.

func NewRuneEdge

func NewRuneEdge(target *NfaState, arune rune) *RuneEdge

NewRuneEdge returns a new RuneEdge pointing to target which accepts arune.

func (*RuneEdge) Accepts

func (e *RuneEdge) Accepts(s *ScannerSource) bool

Accepts is the RuneEdge implementation of the Edger interface.

func (*RuneEdge) String

func (e *RuneEdge) String() string

type Scanner

type Scanner struct {
	// contains filtered or unexported fields
}

func (*Scanner) Begin

func (s *Scanner) Begin(state StartSetID)

Begin switches the Scanner's start state (start set).

func (*Scanner) Include

func (s *Scanner) Include(fname string, r io.RuneReader)

Include includes a RuneReader having fname. Recursive including is not checked. Include discards the one arune lookahead data if there are any. Lookahead data exists iff Next() has been called and Move() has not yet been called afterwards.

func (*Scanner) PopState

func (s *Scanner) PopState()

PopState pops the top of the stack and switches to it via Begin().

func (*Scanner) Position

func (s *Scanner) Position() token.Position

Position returns the current Scanner position, i.e. after a Scan() it returns the position after the current token.

func (*Scanner) PushState

func (s *Scanner) PushState(newState StartSetID)

PushState pushes the current start condition onto the top of the start condition stack and switches to newState as though you had used Begin(newState).

func (*Scanner) Scan

func (s *Scanner) Scan() (arune rune, ok bool)

Scan scans the Scanner source, consumes arunes as long as there is a chance to recognize a token (i.e. until the Scanner FSM stops).

If the scanner is starting a Scan at EOF:
    Return 0, false.

If a valid token was recognized:
    If the token's numeric id is >= 0:
        Return id, true.
    If the id is < 0:
        If the Scan has consumed at least one arune:
            Scan restarts discarding any consumed arunes.
        If the Scan has not consumed any arune:
            Scanner is stalled¹. Move on by one arune, return unicode.ReplacementChar, false.

If a valid token was not recognized:
    If the Scanner has not consumed any arune:
        Return the current arune, false.² Move on by one arune.
    If the Scanner has moved by exactly one arune:
        Return that arune, false.²
    If the Scanner has consumed more than one arune:
        Return unicode.ReplacementChar, false.

The actual arunes consumed by the last Scan can be retrieved by Token.

If the assigned token ids do not overlap with the otherwise expected arunes, i.e. their ids are e.g. in the Unicode private usage area, then it is possible, as any other unsuccessful scan will return either zero (EOF) or unicode.ReplacementChar, to ignore the returned ok value and drive a parser only by the arune/token id value. This is presumably the easier way for e.g. goyacc.

¹The FSM has stopped in an accepting state without consuming any arunes. Caused by using (re)* or (re)? for negative numeric id (i.e. ignored) tokens. Better avoid that.

²Intended for processing single arune tokens (e.g. a semicolon) without defining the regexp and token id for it. Examples of such usage can be found in many .y files.

func (*Scanner) Token

func (s *Scanner) Token() []rune

Token returns the arunes consumed by last Scan. Repeated Scans for ignored tokens (id < 0) are discarded.

func (*Scanner) TokenStart

func (s *Scanner) TokenStart() token.Position

TokenStart returns the starting position of the token returned by last Scan.

func (*Scanner) TopState

func (s *Scanner) TopState() StartSetID

TopState returns the top of the stack without altering the stack's contents.

type ScannerRune

type ScannerRune struct {
	Position token.Position // Starting position of Rune
	Rune     rune           // Rune value
	Size     int            // Rune size
	Err      error          // os.EOF or nil. Any other value invalidates all other fields of a ScannerRune.
}

ScannerRune is a struct holding info about a arune and its origin

type ScannerSource

type ScannerSource struct {
	// contains filtered or unexported fields
}

ScannerSource is a Source with one ScannerRune look behind and an on demand one ScannerRune lookahead.

func NewScannerSource

func NewScannerSource(fname string, r io.RuneReader) *ScannerSource

NewScannerSource returns a new ScannerSource from a RuneReader having fname. The RuneReader can be nil. Then an EOFReader is supplied and the real RuneReader(s) can be Included anytime afterwards.

func (*ScannerSource) Accept

func (s *ScannerSource) Accept(arune rune) bool

Accept checks if arune matches Current. If true then does Move.

func (*ScannerSource) Collect

func (s *ScannerSource) Collect() (arunes []rune)

Collect returns all arunes seen by the ScannerSource since last Collect or CollectString. Either Collect or CollectString can be called but only one of them as both clears the collector.

func (*ScannerSource) CollectString

func (s *ScannerSource) CollectString() string

CollectString returns all arunes seen by the ScannerSource since last CollectString or Collect as a string. Either Collect or CollectString can be called but only one of them as both clears the collector.

func (*ScannerSource) Current

func (s *ScannerSource) Current() rune

CurrentRune returns the current ScannerSource arune. At EOF it's zero.

func (*ScannerSource) CurrentRune

func (s *ScannerSource) CurrentRune() ScannerRune

Current returns the current ScannerSource ScannerRune.

func (*ScannerSource) Include

func (s *ScannerSource) Include(fname string, r io.RuneReader)

Include includes a RuneReader having fname. Recursive including is not checked. Include discards the one arune lookahead data if there are any. Lookahead data exists iff Next() has been called and Move() has not yet been called afterwards.

func (*ScannerSource) Move

func (s *ScannerSource) Move()

Move moves ScannerSource one arune ahead.

func (*ScannerSource) Next

func (s *ScannerSource) Next() rune

NextRune returns ScannerSource next (lookahead) arune. It is zero if next is EOF

func (*ScannerSource) NextRune

func (s *ScannerSource) NextRune() ScannerRune

Next returns ScannerSource next (lookahead) ScannerRune. It's Rune is zero if next is EOF.

func (*ScannerSource) Position

func (s *ScannerSource) Position() token.Position

Position returns the current ScannerSource position, i.e. after a Move() it returns the position after CurrentRune.

func (*ScannerSource) Prev

func (s *ScannerSource) Prev() rune

PrevRune returns the previous (look behind) ScannerRune arune. Before first Move() its zero.

func (*ScannerSource) PrevRune

func (s *ScannerSource) PrevRune() ScannerRune

Prev returns then previous (look behind) ScanerRune. Before first Move() its Rune is zero and Position.IsValid == false

type Source

type Source struct {
	// contains filtered or unexported fields
}

Source provides a stack of arune streams with position information.

func NewSource

func NewSource(fname string, r io.RuneReader) *Source

NewSource returns a new Source from a RuneReader having fname. The RuneReader can be nil. Then an EOFReader is supplied and the real RuneReader(s) can be Included anytime afterwards.

func (*Source) Include

func (s *Source) Include(fname string, r io.RuneReader)

Include includes a RuneReader having fname. Recursive including is not checked.

func (*Source) Position

func (s *Source) Position() token.Position

Position return the position of the next Read.

func (*Source) Read

func (s *Source) Read() (r ScannerRune)

Read returns the next Source ScannerRune.

type StartSetID

type StartSetID int

StartSetID is a type of a lexer start set identificator. It is used by Begin and PushState.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL