goparsify

package module

v0.5.0 Latest Latest Go to latest Published: Apr 12, 2023 License: MIT Imports: 9 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/ajitid/goparsify

Links

Open Source Insights

README ¶

Goparsify

Here's what this fork does:

Modernizes dep and build management by pulling in changes from https://github.com/damienstanton/goparsify
Fixes few issues by pulling in changes from https://github.com/oec/goparsify
Renames Many/Some to comprehensible terms: ZeroOrMore and OneOrMore
Introduces a new combinator called Chain

A parser-combinator library for building easy to test, read and maintain parsers using functional composition.

Everything should be unicode safe by default, but you can opt out of unicode whitespace for a decent ~20% performance boost.

Run(parser, input, ASCIIWhitespace)

Benchmarks

$ make bench
goos: linux
goarch: amd64
pkg: github.com/ajitid/goparsify/json
BenchmarkUnmarshalParsec-16     	  10000	   133884 ns/op	  49378 B/op	   1307 allocs/op
BenchmarkUnmarshalParsify-16    	  10000	   100330 ns/op	  68972 B/op	    480 allocs/op
BenchmarkUnmarshalStdlib-16     	  14707	    81414 ns/op	  13427 B/op	    251 allocs/op
PASS
ok  	github.com/ajitid/goparsify/json	4.389s

Most of the remaining small allocs are from putting things in interface{} and are pretty unavoidable. https://www.darkcoding.net/software/go-the-price-of-interface/ is a good read.

Debugging parsers

When a parser isnt working as you intended you can build with debugging and enable logging to get a detailed log of exactly what the parser is doing.

First build with debug using -tags debug:

$ go build -tags debug

Then, enable logging by calling EnableLogging(os.Stdout) in your code. This works great with tests, eg in the goparsify source tree:

$ go test -tags debug ./html -v
=== RUN   TestParse
html.go:48 | <body>hello <p  | tag {
html.go:43 | <body>hello <p  |   tstart {
html.go:43 | body>hello <p c |     < found "<"
html.go:20 | >hello <p color |     identifier found "body"
html.go:33 | >hello <p color |     attrs {
html.go:32 | >hello <p color |       attr {
html.go:20 | >hello <p color |         identifier did not find [a-zA-Z][a-zA-Z0-9]*
html.go:32 | >hello <p color |       } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:33 | >hello <p color |     } found ""
html.go:43 | hello <p color= |     > found ">"
html.go:43 | hello <p color= |   } found "[<,body,,map[string
html.go:24 | hello <p color= |   elements {
html.go:23 | hello <p color= |     element {
html.go:21 | <p color=\"blue |       text found "hello "
html.go:23 | <p color=\"blue |     } found "\"hello \""
html.go:23 | <p color=\"blue |     element {
html.go:21 | <p color=\"blue |       text did not find <>
html.go:48 | <p color=\"blue |       tag {
html.go:43 | <p color=\"blue |         tstart {
html.go:43 | p color=\"blue\ |           < found "<"
html.go:20 |  color=\"blue\" |           identifier found "p"
html.go:33 |  color=\"blue\" |           attrs {
html.go:32 |  color=\"blue\" |             attr {
html.go:20 | =\"blue\">world |               identifier found "color"
html.go:32 | \"blue\">world< |               = found "="
html.go:32 | >world</p></bod |               string literal found "blue"
html.go:32 | >world</p></bod |             } found "[color,=,blue]"
html.go:32 | >world</p></bod |             attr {
html.go:20 | >world</p></bod |               identifier did not find [a-zA-Z][a-zA-Z0-9]*
html.go:32 | >world</p></bod |             } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:33 | >world</p></bod |           } found "[[color,=,blue]]"
html.go:43 | world</p></body |           > found ">"
html.go:43 | world</p></body |         } found "[<,p,,map[string]st
html.go:24 | world</p></body |         elements {
html.go:23 | world</p></body |           element {
html.go:21 | </p></body>     |             text found "world"
html.go:23 | </p></body>     |           } found "\"world\""
html.go:23 | </p></body>     |           element {
html.go:21 | </p></body>     |             text did not find <>
html.go:48 | </p></body>     |             tag {
html.go:43 | </p></body>     |               tstart {
html.go:43 | /p></body>      |                 < found "<"
html.go:20 | /p></body>      |                 identifier did not find [a-zA-Z][a-zA-Z0-9]*
html.go:43 | </p></body>     |               } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:48 | </p></body>     |             } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:23 | </p></body>     |           } did not find <> or [a-zA-Z][a-zA-Z0-9]*
html.go:24 | </p></body>     |         } found "[\"world\"]"
html.go:44 | </p></body>     |         tend {
html.go:44 | p></body>       |           </ found "</"
html.go:20 | ></body>        |           identifier found "p"
html.go:44 | </body>         |           > found ">"
html.go:44 | </body>         |         } found "[</,,p,>]"
html.go:48 | </body>         |       } found "[[<,p,,map[string]s
html.go:23 | </body>         |     } found "html.htmlTag{Name:\
html.go:23 | </body>         |     element {
html.go:21 | </body>         |       text did not find <>
html.go:48 | </body>         |       tag {
html.go:43 | </body>         |         tstart {
html.go:43 | /body>          |           < found "<"
html.go:20 | /body>          |           identifier did not find [a-zA-Z][a-zA-Z0-9]*
html.go:43 | </body>         |         } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:48 | </body>         |       } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:23 | </body>         |     } did not find <> or [a-zA-Z][a-zA-Z0-9]*
html.go:24 | </body>         |   } found "[\"hello \",html.ht
html.go:44 | </body>         |   tend {
html.go:44 | body>           |     </ found "</"
html.go:20 | >               |     identifier found "body"
html.go:44 |                 |     > found ">"
html.go:44 |                 |   } found "[</,,body,>]"
html.go:48 |                 | } found "[[<,body,,map[strin
--- PASS: TestParse (0.00s)
PASS
ok  	github.com/ajitid/goparsify/html	(cached)

Debugging performance

If you build the parser with -tags debug it will instrument each parser and a call to DumpDebugStats() will show stats:

var name	matches	total time	self time	calls	errors	location
_value	Any()	5.0685431s	34.0131ms	878801	0	json.go:36
_object	Seq()	3.7513821s	10.5038ms	161616	40403	json.go:24
_properties	ZeroOrMore()	3.6863512s	5.5028ms	121213	0	json.go:14
_properties	Seq()	3.4912614s	46.0229ms	818185	0	json.go:14
_array	Seq()	931.4679ms	3.5014ms	65660	55558	json.go:16
_array	ZeroOrMore()	911.4597ms	0s	10102	0	json.go:16
_properties	string literal	126.0662ms	44.5201ms	818185	0	json.go:14
_string	string literal	67.033ms	26.0126ms	671723	136369	json.go:12
_properties	:	50.0238ms	45.0205ms	818185	0	json.go:14
_properties	,	48.5189ms	36.0146ms	818185	121213	json.go:14
_number	number literal	28.5159ms	10.5062ms	287886	106066	json.go:13
_true	true	17.5086ms	12.5069ms	252537	232332	json.go:10
_null	null	14.5082ms	11.007ms	252538	252535	json.go:9
_object	}	10.5051ms	10.5033ms	121213	0	json.go:24
_false	false	10.5049ms	5.0019ms	232333	222229	json.go:11
_object	{	10.0046ms	5.0052ms	161616	40403	json.go:24
_array	,	4.5024ms	4.0018ms	50509	10102	json.go:16
_array	[	4.5014ms	2.0006ms	65660	55558	json.go:16
_array	]	0s	0s	10102	0	json.go:16

All times are cumulative, it would be nice to break this down into a parse tree with relative times. This is a nice addition to pprof as it will break down the parsers based on where they are used instead of grouping them all by type.

This is free when the debug tag isnt used.

Example calculator

Lets say we wanted to build a calculator that could take an expression and calculate the result.

Lets start with test:

func TestNumbers(t *testing.T) {
    result, err := Calc(`1`)
    require.NoError(t, err)
    require.EqualValues(t, 1, result)
}

Then define a parser for numbers

var number = NumberLit().Map(func(n Result) Result {
    switch i := n.Result.(type) {
    case int64:
        return Result{Result: float64(i)}
    case float64:
        return Result{Result: i}
    default:
        panic(fmt.Errorf("unknown value %#v", i))
    }
})

func Calc(input string) (float64, error) {
    result, err := Run(y, input)
    if err != nil {
        return 0, err
    }
    return result.(float64), nil
}

This parser will return numbers either as float64 or int depending on the literal, for this calculator we only want floats so we Map the results and type cast.

Run the tests and make sure everything is ok.

Time to add addition

func TestAddition(t *testing.T) {
    result, err := Calc(`1+1`)
    require.NoError(t, err)
    require.EqualValues(t, 2, result)
}


var sumOp  = Chars("+-", 1, 1)

sum = Seq(number, ZeroOrMore(And(sumOp, number))).Map(func(n Result) Result {
    i := n.Child[0].Result.(float64)

    for _, op := range n.Child[1].Child {
        switch op.Child[0].Token {
        case "+":
            i += op.Child[1].Result.(float64)
        case "-":
            i -= op.Child[1].Result.(float64)
        }
    }

    return Result{Result: i}
})

// and update Calc to point to the new root parser -> `result, err := ParseString(sum, input)`

This parser will match number ([+-] number)+, then map its to be the sum. See how the Child map directly to the positions in the parsers? n is the result of the and, n.Child[0] is its first argument, n.Child[1] is the result of the ZeroOrMore parser, n.Child[1].Child[0] is the result of the first And and so fourth. Given how closely tied the parser and the Map are it is good to keep the two together.

You can continue like this and add multiplication and parenthesis fairly easily. Eventually if you keep adding parsers you will end up with a loop, and go will give you a handy error message like:

typechecking loop involving value = goparsify.Any(number, groupExpr)

we need to break the loop using a pointer, then set its value in init

var (
    value Parser
    prod = Seq(&value, ZeroOrMore(And(prodOp, &value)))
)

func init() {
    value = Any(number, groupExpr)
}

Take a look at calc for a full example.

Preventing backtracking with cuts

A cut is a marker that prevents backtracking past the point it was set. This greatly improves error messages when used correctly:

alpha := Chars("a-z")

// without a cut if the close tag is left out the parser will backtrack and ignore the rest of the string
nocut := OneOrMore(Any(Seq("<", alpha, ">"), alpha))
_, err := Run(nocut, "asdf <foo")
fmt.Println(err.Error())
// Outputs: left unparsed: <foo

// with a cut, once we see the open tag we know there must be a close tag that matches it, so the parser will error
cut := OneOrMore(Any(Seq("<", Cut(), alpha, ">"), alpha))
_, err = Run(cut, "asdf <foo")
fmt.Println(err.Error())
// Outputs: offset 9: expected >

prior art

Inspired by https://github.com/prataprc/goparsec

Documentation ¶

Index ¶

Variables
func ASCIIWhitespace(s *State)
func DisableLogging()
func DumpDebugStats()
func EnableLogging(w io.Writer)
func NoWhitespace(s *State)
func Run(parser Parserish, input string, ws ...VoidParser) (result interface{}, err error)
func UnicodeWhitespace(s *State)
type Error
- func (e *Error) Error() string
- func (e *Error) LocateError(s string) string
- func (e *Error) Pos() int
type Parser
- func Any(parsers ...Parserish) Parser
- func Bind(parser Parserish, val interface{}) Parser
- func Chain(parser Parserish, getNextParser func(prevN *Result) Parserish) Parser
- func Chars(matcher string, repetition ...int) Parser
- func Cut() Parser
- func Exact(match string) Parser
- func Map(parser Parserish, f func(n *Result)) Parser
- func Maybe(parser Parserish) Parser
- func Merge(parser Parserish) Parser
- func NewParser(description string, p Parser) Parser
- func NoAutoWS(parser Parserish) Parser
- func Noop() Parser
- func NotChars(matcher string, repetition ...int) Parser
- func NumberLit() Parser
- func OneOrMore(parser Parserish, separator ...Parserish) Parser
- func Parsify(p Parserish) Parser
- func ParsifyAll(parsers ...Parserish) []Parser
- func Regex(pattern string) Parser
- func Seq(parsers ...Parserish) Parser
- func StringLit(allowedQuotes string) Parser
- func Until(terminators ...string) Parser
- func ZeroOrMore(parser Parserish, separator ...Parserish) Parser
- func (p Parser) Chain(getNextParser func(prevN *Result) Parserish) Parser
- func (p Parser) Map(f func(n *Result)) Parser
type Parserish
type Result
- func NewResult(input string) *Result
- func (r Result) String() string
type State
- func NewState(input string) *State
- func (s *State) Advance(i int)
- func (s *State) ErrorHere(expected string)
- func (s *State) Errored() bool
- func (s *State) Get() string
- func (s *State) Preview(x int) string
- func (s *State) Recover()
type UnparsedInputError
- func (e UnparsedInputError) Error() string
type VoidParser

Examples ¶

Cut

Constants ¶

This section is empty.

Variables ¶

View Source

var TrashResult = &Result{}

TrashResult is used in places where the result isnt wanted, but something needs to be passed in to satisfy the interface.

Functions ¶

func ASCIIWhitespace ¶

func ASCIIWhitespace(s *State)

ASCIIWhitespace matches any of the standard whitespace characters. It is faster than the UnicodeWhitespace parser as it does not need to decode unicode runes.

func DisableLogging ¶

func DisableLogging()

DisableLogging will stop writing logs

func DumpDebugStats ¶

func DumpDebugStats()

DumpDebugStats will print out the curring timings for each parser if built with -tags debug

func EnableLogging ¶

func EnableLogging(w io.Writer)

EnableLogging will write logs to the given writer as the next parse happens

func NoWhitespace ¶

func NoWhitespace(s *State)

NoWhitespace disables automatic whitespace matching

func Run ¶

func Run(parser Parserish, input string, ws ...VoidParser) (result interface{}, err error)

Run applies some input to a parser and returns the result, failing if the input isnt fully consumed. It is a convenience method for the most common way to invoke a parser.

func UnicodeWhitespace ¶

func UnicodeWhitespace(s *State)

UnicodeWhitespace matches any unicode space character. Its a little slower than the ascii parser because it matches a rune at a time.

Types ¶

type Error ¶

type Error struct {
	// contains filtered or unexported fields
}

Error represents a parse error. These will often be set, the parser will back up a little and find another viable path. In general when combining errors the longest error should be returned.

func (*Error) Error ¶

func (e *Error) Error() string

Error satisfies the golang error interface

func (*Error) LocateError ¶

func (e *Error) LocateError(s string) string

LocalError locates the error position in the input string s and returns the error description along with a cursor to the input.

func (*Error) Pos ¶

func (e *Error) Pos() int

Pos is the offset into the document the error was found

type Parser ¶

type Parser func(*State, *Result)

Parser is the workhorse of parsify. A parser takes a State and returns a result, consuming some of the State in the process. Given state is shared there are a few rules that should be followed:

A parser that errors must set state.Error
A parser that errors must not change state.Pos
A parser that consumed some input should advance state.Pos

func Any ¶

func Any(parsers ...Parserish) Parser

Any matches the first successful parser and returns its result

func Bind ¶

func Bind(parser Parserish, val interface{}) Parser

Bind will set the node .Result when the given parser matches This is useful for giving a value to keywords and constant literals like true and false. See the json parser for an example.

func Chain ¶ added in v0.3.1

func Chain(parser Parserish, getNextParser func(prevN *Result) Parserish) Parser

Chain lets you choose which parser to call on the basis of the result of previous parser.

Chain's second argument is a function that takes in the result of the first parser, and with this knowledge, lets you return a successive parser.

Result of this successive parser is considered as the result of the Chain.

func Chars ¶

func Chars(matcher string, repetition ...int) Parser

Chars is the swiss army knife of character matches. It can match:

ranges: Chars("a-z") will match one or more lowercase letter
alphabets: Chars("abcd") will match one or more of the letters abcd in any order
min and max: Chars("a-z0-9", 4, 6) will match 4-6 lowercase alphanumeric characters

the above can be combined in any order

func Cut ¶

func Cut() Parser

Cut prevents backtracking beyond this point. Usually used after keywords when you are sure this is the correct path. Improves performance and error reporting.

Example ¶

// without a cut if the close tag is left out the parser will backtrack and ignore the rest of the string
alpha := Chars("a-z")
nocut := OneOrMore(Any(Seq("<", alpha, ">"), alpha))
_, err := Run(nocut, "asdf <foo")
fmt.Println(err.Error())

// with a cut, once we see the open tag we know there must be a close tag that matches it, so the parser will error
cut := OneOrMore(Any(Seq("<", Cut(), alpha, ">"), alpha))
_, err = Run(cut, "asdf <foo")
fmt.Println(err.Error())

Output:

left unparsed: <foo
offset 9: expected >

func Exact ¶

func Exact(match string) Parser

Exact will fully match the exact string supplied, or error. The match will be stored in .Token

func Map ¶

func Map(parser Parserish, f func(n *Result)) Parser

Map applies the callback if the parser matches. This is used to set the Result based on the matched result.

func Maybe ¶

func Maybe(parser Parserish) Parser

Maybe will 0 or 1 of the parser

func Merge ¶

func Merge(parser Parserish) Parser

Merge all child Tokens together recursively

func NewParser ¶

func NewParser(description string, p Parser) Parser

NewParser should be called around the creation of every Parser. It does nothing normally and should incur no runtime overhead, but when building with -tags debug it will instrument every parser to collect valuable timing information displayable with DumpDebugStats.

func NoAutoWS ¶

func NoAutoWS(parser Parserish) Parser

NoAutoWS disables automatically ignoring whitespace between tokens for all parsers underneath

func Noop ¶ added in v0.4.4

func Noop() Parser

Noop gives a no-op parser i.e. a parser that does nothing

func NotChars ¶

func NotChars(matcher string, repetition ...int) Parser

NotChars accepts the full range of input from Chars, but it will stop when any character matches. If you need to match until you see a sequence use Until instead

func NumberLit ¶

func NumberLit() Parser

NumberLit matches a floating point or integer number and returns it as a int64 or float64 in .Result

func OneOrMore ¶

func OneOrMore(parser Parserish, separator ...Parserish) Parser

OneOrMore matches one or more parsers and returns the value as .Child[n] an optional separator can be provided and that value will be consumed but not returned. Only one separator can be provided.

func Parsify ¶

func Parsify(p Parserish) Parser

Parsify takes a Parserish and makes a Parser out of it. It should be called by any Parser that accepts a Parser as an argument. It should never be called during instead call it during parser creation so there is no runtime cost.

See Parserish for details.

func ParsifyAll ¶

func ParsifyAll(parsers ...Parserish) []Parser

ParsifyAll calls Parsify on all parsers

func Regex ¶

func Regex(pattern string) Parser

Regex returns a match if the regex successfully matches

func Seq ¶

func Seq(parsers ...Parserish) Parser

Seq matches all of the given parsers in order and returns their result as .Child[n]

func StringLit ¶

func StringLit(allowedQuotes string) Parser

StringLit matches a quoted string and returns it in .Token. It may contain:

unicode
escaped characters, eg \" or \n
unicode sequences, eg \uBEEF

func Until ¶

func Until(terminators ...string) Parser

Until will consume all input until one of the given terminator sequences is found. If you want to stop when seeing single characters see NotChars instead

func ZeroOrMore ¶

func ZeroOrMore(parser Parserish, separator ...Parserish) Parser

ZeroOrMore matches zero or more parsers and returns the value as .Child[n] an optional separator can be provided and that value will be consumed but not returned. Only one separator can be provided.

func (Parser) Chain ¶ added in v0.3.1

func (p Parser) Chain(getNextParser func(prevN *Result) Parserish) Parser

Chain shorthand for Chain(p, func())

func (Parser) Map ¶

func (p Parser) Map(f func(n *Result)) Parser

Map shorthand for Map(p, func())

type Parserish ¶

type Parserish interface{}

Parserish types are any type that can be turned into a Parser by Parsify These currently include *Parser and string literals.

This makes recursive grammars cleaner and allows string literals to be used directly in most contexts. eg, matching balanced paren:

var group Parser
group = Seq("(", Maybe(&group), ")")

vs

var group ParserPtr{}
group.P = Seq(Exact("("), Maybe(group.Parse), Exact(")"))

type Result ¶

type Result struct {
	Token  string
	Child  []Result
	Result interface{}
	Input  string
	Start  int
	End    int
}

Result is the output of a parser. Usually only one of its fields will be set and should be though of more as a union type. having it avoids interface{} littered all through the parsing code and makes the it easy to do the two most common operations, getting a token and finding a child.

func NewResult ¶ added in v0.4.5

func NewResult(input string) *Result

func (Result) String ¶

func (r Result) String() string

String stringifies a node. This is only called from debug code.

type State ¶

type State struct {
	// The full input string
	Input string
	// An offset into the string, pointing to the current tip
	Pos int
	// Do not backtrack past this point
	Cut int
	// Error is a secondary return channel from parsers, but used so heavily
	// in backtracking that it has been inlined to avoid allocations.
	Error Error
	// Called to determine what to ignore when WS is called, or when WS fires
	WS VoidParser
}

State is the current parse state. It is entirely public because parsers are expected to mutate it during the parse.

func NewState ¶

func NewState(input string) *State

NewState creates a new State from a string

func (*State) Advance ¶

func (s *State) Advance(i int)

Advance the Pos along by i bytes

func (*State) ErrorHere ¶

func (s *State) ErrorHere(expected string)

ErrorHere raises an error at the current position.

func (*State) Errored ¶

func (s *State) Errored() bool

Errored returns true if the current parser has failed.

func (*State) Get ¶

func (s *State) Get() string

Get the remaining input.

func (*State) Preview ¶

func (s *State) Preview(x int) string

Preview of the the next x characters

func (*State) Recover ¶

func (s *State) Recover()

Recover from the current error. Often called by combinators that can match when one of their children succeed, but others have failed.

type UnparsedInputError ¶

type UnparsedInputError struct {
	Remaining string
}

UnparsedInputError is returned by Run when not all of the input was consumed. There may still be a valid result

func (UnparsedInputError) Error ¶

func (e UnparsedInputError) Error() string

Error satisfies the golang error interface

type VoidParser ¶

type VoidParser func(*State)

VoidParser is a special type of parser that never returns anything but can still consume input

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
calc
debug
html
json
profile

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL