goparsify

package module

v0.3.2 Latest Latest Go to latest Published: Feb 1, 2021 License: MIT Imports: 9 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/db48x/goparsify

Links

Open Source Insights

README ¶

goparsify

A parser-combinator library for building easy to test, read and maintain parsers using functional composition.

Everything should be unicode safe by default, but you can opt out of unicode whitespace for a decent ~20% performance boost.

Run(parser, input, ASCIIWhitespace)

benchmarks

I dont have many benchmarks set up yet, its pretty quick:

$ go test -benchmem -bench=. ./json
BenchmarkUnmarshalParsec-8         20000             74880 ns/op           50846 B/op       1318 allocs/op
BenchmarkUnmarshalParsify-8        30000             50631 ns/op           45055 B/op        233 allocs/op
BenchmarkUnmarshalStdlib-8         30000             46989 ns/op           14210 B/op        260 allocs/op
PASS
ok      github.com/vektah/goparsify/json        6.124s

Most of the remaining small allocs are from putting things in interface{} and are pretty unavoidable. https://www.darkcoding.net/software/go-the-price-of-interface/ is a good read.

debugging parsers

When a parser isnt working as you intended you can build with debugging and enable logging to get a detailed log of exactly what the parser is doing.

First build with debug using -tags debug
enable logging by calling EnableLogging(os.Stdout) in your code

This works great with tests, eg in the goparsify source tree

adam:goparsify(master)$ go test -tags debug ./html -v
=== RUN   TestParse
html.go:48 | <body>hello <p  | tag {
html.go:43 | <body>hello <p  |   tstart {
html.go:43 | body>hello <p c |     < found <
html.go:20 | >hello <p color |     identifier found body
html.go:33 | >hello <p color |     attrs {
html.go:32 | >hello <p color |       attr {
html.go:20 | >hello <p color |         identifier did not find [a-zA-Z][a-zA-Z0-9]*
html.go:32 | >hello <p color |       } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:33 | >hello <p color |     } found
html.go:43 | hello <p color= |     > found >
html.go:43 | hello <p color= |   } found [<,body,,map[string]string{},>]
html.go:24 | hello <p color= |   elements {
html.go:23 | hello <p color= |     element {
html.go:21 | <p color="blue" |       text found hello
html.go:23 | <p color="blue" |     } found "hello "
html.go:23 | <p color="blue" |     element {
html.go:21 | <p color="blue" |       text did not find <>
html.go:48 | <p color="blue" |       tag {
html.go:43 | <p color="blue" |         tstart {
html.go:43 | p color="blue"> |           < found <
html.go:20 |  color="blue">w |           identifier found p
html.go:33 |  color="blue">w |           attrs {
html.go:32 |  color="blue">w |             attr {
html.go:20 | ="blue">world</ |               identifier found color
html.go:32 | "blue">world</p |               = found =
html.go:32 | >world</p></bod |               string literal found "blue"
html.go:32 | >world</p></bod |             } found [color,=,"blue"]
html.go:32 | >world</p></bod |             attr {
html.go:20 | >world</p></bod |               identifier did not find [a-zA-Z][a-zA-Z0-9]*
html.go:32 | >world</p></bod |             } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:33 | >world</p></bod |           } found [[color,=,"blue"]]
html.go:43 | world</p></body |           > found >
html.go:43 | world</p></body |         } found [<,p,,map[string]string{"color":"blue"},>]
html.go:24 | world</p></body |         elements {
html.go:23 | world</p></body |           element {
html.go:21 | </p></body>     |             text found world
html.go:23 | </p></body>     |           } found "world"
html.go:23 | </p></body>     |           element {
html.go:21 | </p></body>     |             text did not find <>
html.go:48 | </p></body>     |             tag {
html.go:43 | </p></body>     |               tstart {
html.go:43 | /p></body>      |                 < found <
html.go:20 | /p></body>      |                 identifier did not find [a-zA-Z][a-zA-Z0-9]*
html.go:43 | </p></body>     |               } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:48 | </p></body>     |             } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:23 | </p></body>     |           } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:24 | </p></body>     |         } found ["world"]
html.go:44 | </p></body>     |         tend {
html.go:44 | p></body>       |           </ found </
html.go:20 | ></body>        |           identifier found p
html.go:44 | </body>         |           > found >
html.go:44 | </body>         |         } found [</,,p,>]
html.go:48 | </body>         |       } found "hello "
html.go:23 | </body>         |     } found html.htmlTag{Name:"p", Attributes:map[string]string{"color":"blue"}, Body:[]interface {}{"world"}}
html.go:23 | </body>         |     element {
html.go:48 | </body>         |       tag {
html.go:43 | </body>         |         tstart {
html.go:43 | /body>          |           < found <
html.go:20 | /body>          |           identifier did not find [a-zA-Z][a-zA-Z0-9]*
html.go:43 | </body>         |         } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:48 | </body>         |       } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:21 | </body>         |       text did not find <>
html.go:23 | </body>         |     } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:24 | </body>         |   } found ["hello ",html.htmlTag{Name:"p", Attributes:map[string]string{"color":"blue"}, Body:[]interface {}{"world"}}]
html.go:44 | </body>         |   tend {
html.go:44 | body>           |     </ found </
html.go:20 | >               |     identifier found body
html.go:44 |                 |     > found >
html.go:44 |                 |   } found [</,,body,>]
html.go:48 |                 | } found [[<,body,,map[string]string{},>],,[]interface {}{"hello ", html.htmlTag{Name:"p", Attributes:map[string]string{"color":"blue"}, Body:[]interface {}{"world"}}},[</,,body,>]]
--- PASS: TestParse (0.00s)
PASS
ok      github.com/vektah/goparsify/html        0.117s

debugging performance

If you build the parser with -tags debug it will instrument each parser and a call to DumpDebugStats() will show stats:

var name	matches	total time	self time	calls	errors	location
_value	Any()	5.0685431s	34.0131ms	878801	0	json.go:36
_object	Seq()	3.7513821s	10.5038ms	161616	40403	json.go:24
_properties	Some()	3.6863512s	5.5028ms	121213	0	json.go:14
_properties	Seq()	3.4912614s	46.0229ms	818185	0	json.go:14
_array	Seq()	931.4679ms	3.5014ms	65660	55558	json.go:16
_array	Some()	911.4597ms	0s	10102	0	json.go:16
_properties	string literal	126.0662ms	44.5201ms	818185	0	json.go:14
_string	string literal	67.033ms	26.0126ms	671723	136369	json.go:12
_properties	:	50.0238ms	45.0205ms	818185	0	json.go:14
_properties	,	48.5189ms	36.0146ms	818185	121213	json.go:14
_number	number literal	28.5159ms	10.5062ms	287886	106066	json.go:13
_true	true	17.5086ms	12.5069ms	252537	232332	json.go:10
_null	null	14.5082ms	11.007ms	252538	252535	json.go:9
_object	}	10.5051ms	10.5033ms	121213	0	json.go:24
_false	false	10.5049ms	5.0019ms	232333	222229	json.go:11
_object	{	10.0046ms	5.0052ms	161616	40403	json.go:24
_array	,	4.5024ms	4.0018ms	50509	10102	json.go:16
_array	[	4.5014ms	2.0006ms	65660	55558	json.go:16
_array	]	0s	0s	10102	0	json.go:16

All times are cumulative, it would be nice to break this down into a parse tree with relative times. This is a nice addition to pprof as it will break down the parsers based on where they are used instead of grouping them all by type.

This is free when the debug tag isnt used.

example calculator

Lets say we wanted to build a calculator that could take an expression and calculate the result.

Lets start with test:

func TestNumbers(t *testing.T) {
	result, err := Calc(`1`)
	require.NoError(t, err)
	require.EqualValues(t, 1, result)
}

Then define a parser for numbers

var number = NumberLit().Map(func(n Result) Result {
    switch i := n.Result.(type) {
    case int64:
        return Result{Result: float64(i)}
    case float64:
        return Result{Result: i}
    default:
        panic(fmt.Errorf("unknown value %#v", i))
    }
})

func Calc(input string) (float64, error) {
	result, err := Run(y, input)
	if err != nil {
		return 0, err
	}

	return result.(float64), nil
}

This parser will return numbers either as float64 or int depending on the literal, for this calculator we only want floats so we Map the results and type cast.

Run the tests and make sure everything is ok.

Time to add addition

func TestAddition(t *testing.T) {
	result, err := Calc(`1+1`)
	require.NoError(t, err)
	require.EqualValues(t, 2, result)
}


var sumOp  = Chars("+-", 1, 1)

sum = Seq(number, Some(And(sumOp, number))).Map(func(n Result) Result {
    i := n.Child[0].Result.(float64)

    for _, op := range n.Child[1].Child {
        switch op.Child[0].Token {
        case "+":
            i += op.Child[1].Result.(float64)
        case "-":
            i -= op.Child[1].Result.(float64)
        }
    }

    return Result{Result: i}
})

// and update Calc to point to the new root parser -> `result, err := ParseString(sum, input)`

This parser will match number ([+-] number)+, then map its to be the sum. See how the Child map directly to the positions in the parsers? n is the result of the and, n.Child[0] is its first argument, n.Child[1] is the result of the Some parser, n.Child[1].Child[0] is the result of the first And and so fourth. Given how closely tied the parser and the Map are it is good to keep the two together.

You can continue like this and add multiplication and parenthesis fairly easily. Eventually if you keep adding parsers you will end up with a loop, and go will give you a handy error message like:

typechecking loop involving value = goparsify.Any(number, groupExpr)

we need to break the loop using a pointer, then set its value in init

var (
    value Parser
    prod = Seq(&value, Some(And(prodOp, &value)))
)

func init() {
	value = Any(number, groupExpr)
}

Take a look at calc for a full example.

preventing backtracking with cuts

A cut is a marker that prevents backtracking past the point it was set. This greatly improves error messages when used correctly:

alpha := Chars("a-z")

// without a cut if the close tag is left out the parser will backtrack and ignore the rest of the string
nocut := Many(Any(Seq("<", alpha, ">"), alpha))
_, err := Run(nocut, "asdf <foo")
fmt.Println(err.Error())
// Outputs: left unparsed: <foo

// with a cut, once we see the open tag we know there must be a close tag that matches it, so the parser will error
cut := Many(Any(Seq("<", Cut(), alpha, ">"), alpha))
_, err = Run(cut, "asdf <foo")
fmt.Println(err.Error())
// Outputs: offset 9: expected >

prior art

Inspired by https://github.com/prataprc/goparsec

Documentation ¶

Index ¶

Variables
func ASCIIWhitespace(s *State)
func DisableLogging()
func DumpDebugStats()
func EnableLogging(w io.Writer)
func IsValidRegexpDelimiter(r rune) (bool, rune)
func NoWhitespace(s *State)
func Run(parser Parserish, input string, ws ...VoidParser) (result interface{}, err error)
func UnicodeWhitespace(s *State)
type Error
- func (e *Error) Error() string
- func (e *Error) LocateError(s string) string
- func (e *Error) Pos() int
type Parser
- func Any(parsers ...Parserish) Parser
- func Bind(parser Parserish, val interface{}) Parser
- func Chars(matcher string, repetition ...int) Parser
- func CustomRegexpMatchLiteral(isValid func(rune) (bool, rune), escapes map[rune]rune) Parser
- func CustomRegexpReplaceLiteral(isValid func(rune) (bool, rune), escapes map[rune]rune) Parser
- func CustomStringLiteral(isValid func(rune) (bool, rune), escapes map[rune]rune) Parser
- func Cut() Parser
- func Exact(match string) Parser
- func Many(parser Parserish, separator ...Parserish) Parser
- func Map(parser Parserish, f func(n *Result)) Parser
- func Maybe(parser Parserish) Parser
- func Merge(parser Parserish) Parser
- func NewParser(description string, p Parser) Parser
- func NoAutoWS(parser Parserish) Parser
- func NotChars(matcher string, repetition ...int) Parser
- func NumberLit() Parser
- func Parsify(p Parserish) Parser
- func ParsifyAll(parsers ...Parserish) []Parser
- func Regex(pattern string) Parser
- func Seq(parsers ...Parserish) Parser
- func Some(parser Parserish, separator ...Parserish) Parser
- func StringLit(allowedQuotes string) Parser
- func UnicodeRegexpMatchLiteral() Parser
- func UnicodeRegexpReplaceLiteral() Parser
- func UnicodeStringLiteral() Parser
- func Until(terminators ...string) Parser
- func (p Parser) Map(f func(n *Result)) Parser
type Parserish
type Result
- func (r Result) String() string
type State
- func NewState(input string) *State
- func (s *State) Advance(i int)
- func (s *State) ErrorHere(expected string)
- func (s *State) Errored() bool
- func (s *State) Get() string
- func (s *State) Preview(x int) string
- func (s *State) Recover()
type UnparsedInputError
- func (e UnparsedInputError) Error() string
type VoidParser

Examples ¶

Cut

Constants ¶

This section is empty.

Variables ¶

View Source

var TrashResult = &Result{}

TrashResult is used in places where the result isnt wanted, but something needs to be passed in to satisfy the interface.

Functions ¶

func ASCIIWhitespace ¶

func ASCIIWhitespace(s *State)

ASCIIWhitespace matches any of the standard whitespace characters. It is faster than the UnicodeWhitespace parser as it does not need to decode unicode runes.

func DisableLogging ¶

func DisableLogging()

DisableLogging will stop writing logs

func DumpDebugStats ¶

func DumpDebugStats()

DumpDebugStats will print out the curring timings for each parser if built with -tags debug

func EnableLogging ¶

func EnableLogging(w io.Writer)

EnableLogging will write logs to the given writer as the next parse happens

func IsValidRegexpDelimiter ¶

func IsValidRegexpDelimiter(r rune) (bool, rune)

IsValidRegexpDelimiter allows quote taken from the set of unicode punctuation characters, plus angle brackets (which are actually math symbols). It ensures that the closing quote character will be symmetrically paired with the opening character if possible.

func NoWhitespace ¶

func NoWhitespace(s *State)

NoWhitespace disables automatic whitespace matching

func Run ¶

func Run(parser Parserish, input string, ws ...VoidParser) (result interface{}, err error)

Run applies some input to a parser and returns the result, failing if the input isnt fully consumed. It is a convenience method for the most common way to invoke a parser.

func UnicodeWhitespace ¶

func UnicodeWhitespace(s *State)

UnicodeWhitespace matches any unicode space character. Its a little slower than the ascii parser because it matches a rune at a time.

Types ¶

type Error ¶

type Error struct {
	// contains filtered or unexported fields
}

Error represents a parse error. These will often be set, the parser will back up a little and find another viable path. In general when combining errors the longest error should be returned.

func (*Error) Error ¶

func (e *Error) Error() string

Error satisfies the golang error interface

func (*Error) LocateError ¶

func (e *Error) LocateError(s string) string

LocalError locates the error position in the input string s and returns the error description along with a cursor to the input.

func (*Error) Pos ¶

func (e *Error) Pos() int

Pos is the offset into the document the error was found

type Parser ¶

type Parser func(*State, *Result)

Parser is the workhorse of parsify. A parser takes a State and returns a result, consuming some of the State in the process. Given state is shared there are a few rules that should be followed:

A parser that errors must set state.Error
A parser that errors must not change state.Pos
A parser that consumed some input should advance state.Pos

func Any ¶

func Any(parsers ...Parserish) Parser

Any matches the first successful parser and returns its result

func Bind ¶

func Bind(parser Parserish, val interface{}) Parser

Bind will set the node .Result when the given parser matches This is useful for giving a value to keywords and constant literals like true and false. See the json parser for an example.

func Chars ¶

func Chars(matcher string, repetition ...int) Parser

Chars is the swiss army knife of character matches. It can match:

ranges: Chars("a-z") will match one or more lowercase letter
alphabets: Chars("abcd") will match one or more of the letters abcd in any order
min and max: Chars("a-z0-9", 4, 6) will match 4-6 lowercase alphanumeric characters

the above can be combined in any order

func CustomRegexpMatchLiteral ¶

func CustomRegexpMatchLiteral(isValid func(rune) (bool, rune), escapes map[rune]rune) Parser

func CustomRegexpReplaceLiteral ¶

func CustomRegexpReplaceLiteral(isValid func(rune) (bool, rune), escapes map[rune]rune) Parser

func CustomStringLiteral ¶

func CustomStringLiteral(isValid func(rune) (bool, rune), escapes map[rune]rune) Parser

CustomStringLiteral matches a quoted string and returns it in .Token. It may contain:

unicode
escaped characters, eg \", \n, \t
unicode sequences, eg \uBEEF

The opening and closing quotes are validated by the isValid function you pass in. This function should return true if its argument is a valid opening quote character, plus the correct closing quote character that ends the string. See IsValidRegexpDelimiter.

The only valid escape characters are those defined in the escapes argument, plus one for the closer returned by isValid.

func Cut ¶

func Cut() Parser

Cut prevents backtracking beyond this point. Usually used after keywords when you are sure this is the correct path. Improves performance and error reporting.

Example ¶

// without a cut if the close tag is left out the parser will backtrack and ignore the rest of the string
alpha := Chars("a-z")
nocut := Many(Any(Seq("<", alpha, ">"), alpha))
_, err := Run(nocut, "asdf <foo")
fmt.Println(err.Error())

// with a cut, once we see the open tag we know there must be a close tag that matches it, so the parser will error
cut := Many(Any(Seq("<", Cut(), alpha, ">"), alpha))
_, err = Run(cut, "asdf <foo")
fmt.Println(err.Error())

Output:

left unparsed: <foo
offset 9: expected >

func Exact ¶

func Exact(match string) Parser

Exact will fully match the exact string supplied, or error. The match will be stored in .Token

func Many ¶

func Many(parser Parserish, separator ...Parserish) Parser

Many matches one or more parsers and returns the value as .Child[n] an optional separator can be provided and that value will be consumed but not returned. Only one separator can be provided.

func Map ¶

func Map(parser Parserish, f func(n *Result)) Parser

Map applies the callback if the parser matches. This is used to set the Result based on the matched result.

func Maybe ¶

func Maybe(parser Parserish) Parser

Maybe will 0 or 1 of the parser

func Merge ¶

func Merge(parser Parserish) Parser

Merge all child Tokens together recursively

func NewParser ¶

func NewParser(description string, p Parser) Parser

NewParser should be called around the creation of every Parser. It does nothing normally and should incur no runtime overhead, but when building with -tags debug it will instrument every parser to collect valuable timing information displayable with DumpDebugStats.

func NoAutoWS ¶

func NoAutoWS(parser Parserish) Parser

NoAutoWS disables automatically ignoring whitespace between tokens for all parsers underneath

func NotChars ¶

func NotChars(matcher string, repetition ...int) Parser

NotChars accepts the full range of input from Chars, but it will stop when any character matches. If you need to match until you see a sequence use Until instead

func NumberLit ¶

func NumberLit() Parser

NumberLit matches a floating point or integer number and returns it as a int64 or float64 in .Result

func Parsify ¶

func Parsify(p Parserish) Parser

Parsify takes a Parserish and makes a Parser out of it. It should be called by any Parser that accepts a Parser as an argument. It should never be called during instead call it during parser creation so there is no runtime cost.

See Parserish for details.

func ParsifyAll ¶

func ParsifyAll(parsers ...Parserish) []Parser

ParsifyAll calls Parsify on all parsers

func Regex ¶

func Regex(pattern string) Parser

Regex returns a match if the regex successfully matches

func Seq ¶

func Seq(parsers ...Parserish) Parser

Seq matches all of the given parsers in order and returns their result as .Child[n]

func Some ¶

func Some(parser Parserish, separator ...Parserish) Parser

Some matches zero or more parsers and returns the value as .Child[n] an optional separator can be provided and that value will be consumed but not returned. Only one separator can be provided.

func StringLit ¶

func StringLit(allowedQuotes string) Parser

StringLit matches a quoted string and returns it in .Token. It may contain:

unicode
escaped characters, eg \", \n, \t
unicode sequences, eg \uBEEF

allowedQuotes is the list of allowed quote characters; both the opening and closing quotes will be the same character from this string

func UnicodeRegexpMatchLiteral ¶

func UnicodeRegexpMatchLiteral() Parser

func UnicodeRegexpReplaceLiteral ¶

func UnicodeRegexpReplaceLiteral() Parser

func UnicodeStringLiteral ¶

func UnicodeStringLiteral() Parser

UnicodeStringLiteral matches a quoted string and returns it in .Token. It may contain:

unicode
escaped characters, eg \", \n, \t
unicode sequences, eg \uBEEF

The opening and closing quote character may be any matched pair of unicode characters from the Pi/Pf categories, or from the Ps/Pe categories, plus angle brackets, or if they may be a punctuation character, as long as they are the same punctuation character.

func Until ¶

func Until(terminators ...string) Parser

Until will consume all input until one of the given terminator sequences is found. If you want to stop when seeing single characters see NotChars instead

func (Parser) Map ¶

func (p Parser) Map(f func(n *Result)) Parser

Map shorthand for Map(p, func())

type Parserish ¶

type Parserish interface{}

Parserish types are any type that can be turned into a Parser by Parsify These currently include *Parser and string literals.

This makes recursive grammars cleaner and allows string literals to be used directly in most contexts. eg, matching balanced paren:

var group Parser
group = Seq("(", Maybe(&group), ")")

vs

var group ParserPtr{}
group.P = Seq(Exact("("), Maybe(group.Parse), Exact(")"))

type Result ¶

type Result struct {
	Token  string
	Child  []Result
	Result interface{}
	Input  string
	Start  int
	End    int
}

Result is the output of a parser. Usually only one of its fields will be set and should be though of more as a union type. having it avoids interface{} littered all through the parsing code and makes the it easy to do the two most common operations, getting a token and finding a child.

func (Result) String ¶

func (r Result) String() string

String stringifies a node. This is only called from debug code.

type State ¶

type State struct {
	// The full input string
	Input string
	// An offset into the string, pointing to the current tip
	Pos int
	// Do not backtrack past this point
	Cut int
	// Error is a secondary return channel from parsers, but used so heavily
	// in backtracking that it has been inlined to avoid allocations.
	Error Error
	// Called to determine what to ignore when WS is called, or when WS fires
	WS VoidParser
}

State is the current parse state. It is entirely public because parsers are expected to mutate it during the parse.

func NewState ¶

func NewState(input string) *State

NewState creates a new State from a string

func (*State) Advance ¶

func (s *State) Advance(i int)

Advance the Pos along by i bytes

func (*State) ErrorHere ¶

func (s *State) ErrorHere(expected string)

ErrorHere raises an error at the current position.

func (*State) Errored ¶

func (s *State) Errored() bool

Errored returns true if the current parser has failed.

func (*State) Get ¶

func (s *State) Get() string

Get the remaining input.

func (*State) Preview ¶

func (s *State) Preview(x int) string

Preview of the the next x characters

func (*State) Recover ¶

func (s *State) Recover()

Recover from the current error. Often called by combinators that can match when one of their children succeed, but others have failed.

type UnparsedInputError ¶

type UnparsedInputError struct {
	Remaining string
}

UnparsedInputError is returned by Run when not all of the input was consumed. There may still be a valid result

func (UnparsedInputError) Error ¶

func (e UnparsedInputError) Error() string

Error satisfies the golang error interface

type VoidParser ¶

type VoidParser func(*State)

VoidParser is a special type of parser that never returns anything but can still consume input

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
calc
debug
html
json
profile

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL