goparsify

package module
v0.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 12, 2023 License: MIT Imports: 9 Imported by: 0

README

Goparsify

Go


Here's what this fork does:


A parser-combinator library for building easy to test, read and maintain parsers using functional composition.

Everything should be unicode safe by default, but you can opt out of unicode whitespace for a decent ~20% performance boost.

Run(parser, input, ASCIIWhitespace)

Benchmarks

$ make bench
goos: linux
goarch: amd64
pkg: github.com/ajitid/goparsify/json
BenchmarkUnmarshalParsec-16     	  10000	   133884 ns/op	  49378 B/op	   1307 allocs/op
BenchmarkUnmarshalParsify-16    	  10000	   100330 ns/op	  68972 B/op	    480 allocs/op
BenchmarkUnmarshalStdlib-16     	  14707	    81414 ns/op	  13427 B/op	    251 allocs/op
PASS
ok  	github.com/ajitid/goparsify/json	4.389s

Most of the remaining small allocs are from putting things in interface{} and are pretty unavoidable. https://www.darkcoding.net/software/go-the-price-of-interface/ is a good read.

Debugging parsers

When a parser isnt working as you intended you can build with debugging and enable logging to get a detailed log of exactly what the parser is doing.

First build with debug using -tags debug:

$ go build -tags debug

Then, enable logging by calling EnableLogging(os.Stdout) in your code. This works great with tests, eg in the goparsify source tree:

$ go test -tags debug ./html -v
=== RUN   TestParse
html.go:48 | <body>hello <p  | tag {
html.go:43 | <body>hello <p  |   tstart {
html.go:43 | body>hello <p c |     < found "<"
html.go:20 | >hello <p color |     identifier found "body"
html.go:33 | >hello <p color |     attrs {
html.go:32 | >hello <p color |       attr {
html.go:20 | >hello <p color |         identifier did not find [a-zA-Z][a-zA-Z0-9]*
html.go:32 | >hello <p color |       } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:33 | >hello <p color |     } found ""
html.go:43 | hello <p color= |     > found ">"
html.go:43 | hello <p color= |   } found "[<,body,,map[string
html.go:24 | hello <p color= |   elements {
html.go:23 | hello <p color= |     element {
html.go:21 | <p color=\"blue |       text found "hello "
html.go:23 | <p color=\"blue |     } found "\"hello \""
html.go:23 | <p color=\"blue |     element {
html.go:21 | <p color=\"blue |       text did not find <>
html.go:48 | <p color=\"blue |       tag {
html.go:43 | <p color=\"blue |         tstart {
html.go:43 | p color=\"blue\ |           < found "<"
html.go:20 |  color=\"blue\" |           identifier found "p"
html.go:33 |  color=\"blue\" |           attrs {
html.go:32 |  color=\"blue\" |             attr {
html.go:20 | =\"blue\">world |               identifier found "color"
html.go:32 | \"blue\">world< |               = found "="
html.go:32 | >world</p></bod |               string literal found "blue"
html.go:32 | >world</p></bod |             } found "[color,=,blue]"
html.go:32 | >world</p></bod |             attr {
html.go:20 | >world</p></bod |               identifier did not find [a-zA-Z][a-zA-Z0-9]*
html.go:32 | >world</p></bod |             } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:33 | >world</p></bod |           } found "[[color,=,blue]]"
html.go:43 | world</p></body |           > found ">"
html.go:43 | world</p></body |         } found "[<,p,,map[string]st
html.go:24 | world</p></body |         elements {
html.go:23 | world</p></body |           element {
html.go:21 | </p></body>     |             text found "world"
html.go:23 | </p></body>     |           } found "\"world\""
html.go:23 | </p></body>     |           element {
html.go:21 | </p></body>     |             text did not find <>
html.go:48 | </p></body>     |             tag {
html.go:43 | </p></body>     |               tstart {
html.go:43 | /p></body>      |                 < found "<"
html.go:20 | /p></body>      |                 identifier did not find [a-zA-Z][a-zA-Z0-9]*
html.go:43 | </p></body>     |               } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:48 | </p></body>     |             } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:23 | </p></body>     |           } did not find <> or [a-zA-Z][a-zA-Z0-9]*
html.go:24 | </p></body>     |         } found "[\"world\"]"
html.go:44 | </p></body>     |         tend {
html.go:44 | p></body>       |           </ found "</"
html.go:20 | ></body>        |           identifier found "p"
html.go:44 | </body>         |           > found ">"
html.go:44 | </body>         |         } found "[</,,p,>]"
html.go:48 | </body>         |       } found "[[<,p,,map[string]s
html.go:23 | </body>         |     } found "html.htmlTag{Name:\
html.go:23 | </body>         |     element {
html.go:21 | </body>         |       text did not find <>
html.go:48 | </body>         |       tag {
html.go:43 | </body>         |         tstart {
html.go:43 | /body>          |           < found "<"
html.go:20 | /body>          |           identifier did not find [a-zA-Z][a-zA-Z0-9]*
html.go:43 | </body>         |         } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:48 | </body>         |       } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:23 | </body>         |     } did not find <> or [a-zA-Z][a-zA-Z0-9]*
html.go:24 | </body>         |   } found "[\"hello \",html.ht
html.go:44 | </body>         |   tend {
html.go:44 | body>           |     </ found "</"
html.go:20 | >               |     identifier found "body"
html.go:44 |                 |     > found ">"
html.go:44 |                 |   } found "[</,,body,>]"
html.go:48 |                 | } found "[[<,body,,map[strin
--- PASS: TestParse (0.00s)
PASS
ok  	github.com/ajitid/goparsify/html	(cached)

Debugging performance

If you build the parser with -tags debug it will instrument each parser and a call to DumpDebugStats() will show stats:

var name matches total time self time calls errors location
_value Any() 5.0685431s 34.0131ms 878801 0 json.go:36
_object Seq() 3.7513821s 10.5038ms 161616 40403 json.go:24
_properties ZeroOrMore() 3.6863512s 5.5028ms 121213 0 json.go:14
_properties Seq() 3.4912614s 46.0229ms 818185 0 json.go:14
_array Seq() 931.4679ms 3.5014ms 65660 55558 json.go:16
_array ZeroOrMore() 911.4597ms 0s 10102 0 json.go:16
_properties string literal 126.0662ms 44.5201ms 818185 0 json.go:14
_string string literal 67.033ms 26.0126ms 671723 136369 json.go:12
_properties : 50.0238ms 45.0205ms 818185 0 json.go:14
_properties , 48.5189ms 36.0146ms 818185 121213 json.go:14
_number number literal 28.5159ms 10.5062ms 287886 106066 json.go:13
_true true 17.5086ms 12.5069ms 252537 232332 json.go:10
_null null 14.5082ms 11.007ms 252538 252535 json.go:9
_object } 10.5051ms 10.5033ms 121213 0 json.go:24
_false false 10.5049ms 5.0019ms 232333 222229 json.go:11
_object { 10.0046ms 5.0052ms 161616 40403 json.go:24
_array , 4.5024ms 4.0018ms 50509 10102 json.go:16
_array [ 4.5014ms 2.0006ms 65660 55558 json.go:16
_array ] 0s 0s 10102 0 json.go:16

All times are cumulative, it would be nice to break this down into a parse tree with relative times. This is a nice addition to pprof as it will break down the parsers based on where they are used instead of grouping them all by type.

This is free when the debug tag isnt used.

Example calculator

Lets say we wanted to build a calculator that could take an expression and calculate the result.

Lets start with test:

func TestNumbers(t *testing.T) {
    result, err := Calc(`1`)
    require.NoError(t, err)
    require.EqualValues(t, 1, result)
}

Then define a parser for numbers

var number = NumberLit().Map(func(n Result) Result {
    switch i := n.Result.(type) {
    case int64:
        return Result{Result: float64(i)}
    case float64:
        return Result{Result: i}
    default:
        panic(fmt.Errorf("unknown value %#v", i))
    }
})

func Calc(input string) (float64, error) {
    result, err := Run(y, input)
    if err != nil {
        return 0, err
    }
    return result.(float64), nil
}

This parser will return numbers either as float64 or int depending on the literal, for this calculator we only want floats so we Map the results and type cast.

Run the tests and make sure everything is ok.

Time to add addition

func TestAddition(t *testing.T) {
    result, err := Calc(`1+1`)
    require.NoError(t, err)
    require.EqualValues(t, 2, result)
}


var sumOp  = Chars("+-", 1, 1)

sum = Seq(number, ZeroOrMore(And(sumOp, number))).Map(func(n Result) Result {
    i := n.Child[0].Result.(float64)

    for _, op := range n.Child[1].Child {
        switch op.Child[0].Token {
        case "+":
            i += op.Child[1].Result.(float64)
        case "-":
            i -= op.Child[1].Result.(float64)
        }
    }

    return Result{Result: i}
})

// and update Calc to point to the new root parser -> `result, err := ParseString(sum, input)`

This parser will match number ([+-] number)+, then map its to be the sum. See how the Child map directly to the positions in the parsers? n is the result of the and, n.Child[0] is its first argument, n.Child[1] is the result of the ZeroOrMore parser, n.Child[1].Child[0] is the result of the first And and so fourth. Given how closely tied the parser and the Map are it is good to keep the two together.

You can continue like this and add multiplication and parenthesis fairly easily. Eventually if you keep adding parsers you will end up with a loop, and go will give you a handy error message like:

typechecking loop involving value = goparsify.Any(number, groupExpr)

we need to break the loop using a pointer, then set its value in init

var (
    value Parser
    prod = Seq(&value, ZeroOrMore(And(prodOp, &value)))
)

func init() {
    value = Any(number, groupExpr)
}

Take a look at calc for a full example.

Preventing backtracking with cuts

A cut is a marker that prevents backtracking past the point it was set. This greatly improves error messages when used correctly:

alpha := Chars("a-z")

// without a cut if the close tag is left out the parser will backtrack and ignore the rest of the string
nocut := OneOrMore(Any(Seq("<", alpha, ">"), alpha))
_, err := Run(nocut, "asdf <foo")
fmt.Println(err.Error())
// Outputs: left unparsed: <foo

// with a cut, once we see the open tag we know there must be a close tag that matches it, so the parser will error
cut := OneOrMore(Any(Seq("<", Cut(), alpha, ">"), alpha))
_, err = Run(cut, "asdf <foo")
fmt.Println(err.Error())
// Outputs: offset 9: expected >
prior art

Inspired by https://github.com/prataprc/goparsec

Documentation

Index

Examples

Constants

This section is empty.

Variables

View Source
var TrashResult = &Result{}

TrashResult is used in places where the result isnt wanted, but something needs to be passed in to satisfy the interface.

Functions

func ASCIIWhitespace

func ASCIIWhitespace(s *State)

ASCIIWhitespace matches any of the standard whitespace characters. It is faster than the UnicodeWhitespace parser as it does not need to decode unicode runes.

func DisableLogging

func DisableLogging()

DisableLogging will stop writing logs

func DumpDebugStats

func DumpDebugStats()

DumpDebugStats will print out the curring timings for each parser if built with -tags debug

func EnableLogging

func EnableLogging(w io.Writer)

EnableLogging will write logs to the given writer as the next parse happens

func NoWhitespace

func NoWhitespace(s *State)

NoWhitespace disables automatic whitespace matching

func Run

func Run(parser Parserish, input string, ws ...VoidParser) (result interface{}, err error)

Run applies some input to a parser and returns the result, failing if the input isnt fully consumed. It is a convenience method for the most common way to invoke a parser.

func UnicodeWhitespace

func UnicodeWhitespace(s *State)

UnicodeWhitespace matches any unicode space character. Its a little slower than the ascii parser because it matches a rune at a time.

Types

type Error

type Error struct {
	// contains filtered or unexported fields
}

Error represents a parse error. These will often be set, the parser will back up a little and find another viable path. In general when combining errors the longest error should be returned.

func (*Error) Error

func (e *Error) Error() string

Error satisfies the golang error interface

func (*Error) LocateError

func (e *Error) LocateError(s string) string

LocalError locates the error position in the input string s and returns the error description along with a cursor to the input.

func (*Error) Pos

func (e *Error) Pos() int

Pos is the offset into the document the error was found

type Parser

type Parser func(*State, *Result)

Parser is the workhorse of parsify. A parser takes a State and returns a result, consuming some of the State in the process. Given state is shared there are a few rules that should be followed:

  • A parser that errors must set state.Error
  • A parser that errors must not change state.Pos
  • A parser that consumed some input should advance state.Pos

func Any

func Any(parsers ...Parserish) Parser

Any matches the first successful parser and returns its result

func Bind

func Bind(parser Parserish, val interface{}) Parser

Bind will set the node .Result when the given parser matches This is useful for giving a value to keywords and constant literals like true and false. See the json parser for an example.

func Chain added in v0.3.1

func Chain(parser Parserish, getNextParser func(prevN *Result) Parserish) Parser

Chain lets you choose which parser to call on the basis of the result of previous parser.

Chain's second argument is a function that takes in the result of the first parser, and with this knowledge, lets you return a successive parser.

Result of this successive parser is considered as the result of the Chain.

func Chars

func Chars(matcher string, repetition ...int) Parser

Chars is the swiss army knife of character matches. It can match:

  • ranges: Chars("a-z") will match one or more lowercase letter
  • alphabets: Chars("abcd") will match one or more of the letters abcd in any order
  • min and max: Chars("a-z0-9", 4, 6) will match 4-6 lowercase alphanumeric characters

the above can be combined in any order

func Cut

func Cut() Parser

Cut prevents backtracking beyond this point. Usually used after keywords when you are sure this is the correct path. Improves performance and error reporting.

Example
// without a cut if the close tag is left out the parser will backtrack and ignore the rest of the string
alpha := Chars("a-z")
nocut := OneOrMore(Any(Seq("<", alpha, ">"), alpha))
_, err := Run(nocut, "asdf <foo")
fmt.Println(err.Error())

// with a cut, once we see the open tag we know there must be a close tag that matches it, so the parser will error
cut := OneOrMore(Any(Seq("<", Cut(), alpha, ">"), alpha))
_, err = Run(cut, "asdf <foo")
fmt.Println(err.Error())
Output:

left unparsed: <foo
offset 9: expected >

func Exact

func Exact(match string) Parser

Exact will fully match the exact string supplied, or error. The match will be stored in .Token

func Map

func Map(parser Parserish, f func(n *Result)) Parser

Map applies the callback if the parser matches. This is used to set the Result based on the matched result.

func Maybe

func Maybe(parser Parserish) Parser

Maybe will 0 or 1 of the parser

func Merge

func Merge(parser Parserish) Parser

Merge all child Tokens together recursively

func NewParser

func NewParser(description string, p Parser) Parser

NewParser should be called around the creation of every Parser. It does nothing normally and should incur no runtime overhead, but when building with -tags debug it will instrument every parser to collect valuable timing information displayable with DumpDebugStats.

func NoAutoWS

func NoAutoWS(parser Parserish) Parser

NoAutoWS disables automatically ignoring whitespace between tokens for all parsers underneath

func Noop added in v0.4.4

func Noop() Parser

Noop gives a no-op parser i.e. a parser that does nothing

func NotChars

func NotChars(matcher string, repetition ...int) Parser

NotChars accepts the full range of input from Chars, but it will stop when any character matches. If you need to match until you see a sequence use Until instead

func NumberLit

func NumberLit() Parser

NumberLit matches a floating point or integer number and returns it as a int64 or float64 in .Result

func OneOrMore

func OneOrMore(parser Parserish, separator ...Parserish) Parser

OneOrMore matches one or more parsers and returns the value as .Child[n] an optional separator can be provided and that value will be consumed but not returned. Only one separator can be provided.

func Parsify

func Parsify(p Parserish) Parser

Parsify takes a Parserish and makes a Parser out of it. It should be called by any Parser that accepts a Parser as an argument. It should never be called during instead call it during parser creation so there is no runtime cost.

See Parserish for details.

func ParsifyAll

func ParsifyAll(parsers ...Parserish) []Parser

ParsifyAll calls Parsify on all parsers

func Regex

func Regex(pattern string) Parser

Regex returns a match if the regex successfully matches

func Seq

func Seq(parsers ...Parserish) Parser

Seq matches all of the given parsers in order and returns their result as .Child[n]

func StringLit

func StringLit(allowedQuotes string) Parser

StringLit matches a quoted string and returns it in .Token. It may contain:

  • unicode
  • escaped characters, eg \" or \n
  • unicode sequences, eg \uBEEF

func Until

func Until(terminators ...string) Parser

Until will consume all input until one of the given terminator sequences is found. If you want to stop when seeing single characters see NotChars instead

func ZeroOrMore

func ZeroOrMore(parser Parserish, separator ...Parserish) Parser

ZeroOrMore matches zero or more parsers and returns the value as .Child[n] an optional separator can be provided and that value will be consumed but not returned. Only one separator can be provided.

func (Parser) Chain added in v0.3.1

func (p Parser) Chain(getNextParser func(prevN *Result) Parserish) Parser

Chain shorthand for Chain(p, func())

func (Parser) Map

func (p Parser) Map(f func(n *Result)) Parser

Map shorthand for Map(p, func())

type Parserish

type Parserish interface{}

Parserish types are any type that can be turned into a Parser by Parsify These currently include *Parser and string literals.

This makes recursive grammars cleaner and allows string literals to be used directly in most contexts. eg, matching balanced paren:

var group Parser
group = Seq("(", Maybe(&group), ")")

vs

var group ParserPtr{}
group.P = Seq(Exact("("), Maybe(group.Parse), Exact(")"))

type Result

type Result struct {
	Token  string
	Child  []Result
	Result interface{}
	Input  string
	Start  int
	End    int
}

Result is the output of a parser. Usually only one of its fields will be set and should be though of more as a union type. having it avoids interface{} littered all through the parsing code and makes the it easy to do the two most common operations, getting a token and finding a child.

func NewResult added in v0.4.5

func NewResult(input string) *Result

func (Result) String

func (r Result) String() string

String stringifies a node. This is only called from debug code.

type State

type State struct {
	// The full input string
	Input string
	// An offset into the string, pointing to the current tip
	Pos int
	// Do not backtrack past this point
	Cut int
	// Error is a secondary return channel from parsers, but used so heavily
	// in backtracking that it has been inlined to avoid allocations.
	Error Error
	// Called to determine what to ignore when WS is called, or when WS fires
	WS VoidParser
}

State is the current parse state. It is entirely public because parsers are expected to mutate it during the parse.

func NewState

func NewState(input string) *State

NewState creates a new State from a string

func (*State) Advance

func (s *State) Advance(i int)

Advance the Pos along by i bytes

func (*State) ErrorHere

func (s *State) ErrorHere(expected string)

ErrorHere raises an error at the current position.

func (*State) Errored

func (s *State) Errored() bool

Errored returns true if the current parser has failed.

func (*State) Get

func (s *State) Get() string

Get the remaining input.

func (*State) Preview

func (s *State) Preview(x int) string

Preview of the the next x characters

func (*State) Recover

func (s *State) Recover()

Recover from the current error. Often called by combinators that can match when one of their children succeed, but others have failed.

type UnparsedInputError

type UnparsedInputError struct {
	Remaining string
}

UnparsedInputError is returned by Run when not all of the input was consumed. There may still be a valid result

func (UnparsedInputError) Error

func (e UnparsedInputError) Error() string

Error satisfies the golang error interface

type VoidParser

type VoidParser func(*State)

VoidParser is a special type of parser that never returns anything but can still consume input

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL