gojsonlex

package module

v0.2.2 Latest Latest Go to latest Published: Sep 11, 2020 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/gibsn/gojsonlex

Links

Open Source Insights

README ¶

GoJSONLex

gojsonlex is a drop in replacement for encoding/json lexer optimised for efficiency. gojsonlex is 2-3 times faster than encoding/json and requires memory only enough to buffer the longest token in the input. Currently gojsonlex skips all delimiters (this behaviour will be changed).

API Documentation

https://pkg.go.dev/github.com/gibsn/gojsonlex

Motivation

Let's consider a case when you want to parse the output of some tool that encodes binary data to one huge JSON dict:

{
  "bands": [
    {
      "name": "Metallica",
      "origin": "USA",
      "albums": [
        ...
      ]
    },
    ...
    {
      "name": "Enter Shikari",
      "origin": "England"
      "albums": [
        ...
      ]
    }
  ]
}

Let's say "albums" can be arbitrary long, the whole JSON is 10GB, but you actually want to print out all "origin" values and don't care about the rest. You do not want to decode the whole JSON into one struct (like most JSON parsers do) since it can be huge. Luckily in this case you do not actually need to parse any arbitrary JSON, you are ok with a more narrow grammar. A parser for such a grammar could look like this:

for {
	currToken, err := lexer.Token()
	if err != nil {
		// ...
	}

	switch state {
	case searchingForOriginKey:
		if currToken == "origin" {
			state := pendingOriginValue
		}
	case pendingOriginValue:
		fmt.Println(currToken)
		state = searchingForOriginKey
	}
}

Ok, so now you need a JSON lexer. Some lexers that I checked did buffer a large portion of input in order to parse a composite type (which is bad since "albums" can be huge). The only lexer that did not require that much memory was the standard encoding/json, however it could be optimized to consume less CPU. That's how gojsonlex was born.

Overview

Example from the previous section could be implemented with gojsonlex like this:

l, err := gojsonlex.NewJSONLexer(r)
if err != nil {
	// ...
}

state = stateSearchingForOriginKey

for {
	currToken, err := lexer.Token()
	if err != nil {
		// ...
	}
	
	s, ok := currToken.(string)
	if !ok {
		continue
	}

	switch state {
	case stateSearchingForOriginKey:
		if s == "origin" {
			state := pendingOriginValue
		}
	case statePendingOriginValue:
		fmt.Println(s)
		state = searchingForOriginKey
	}
}

In order to maintain zero allocations Token() will always return an unsafe string that is valid only until the next Token() call. You must make a deep copy (using StringDeepCopy()) of that string in case you may need it after the next Token() call.

Though gojsonlex.Token() is faster than that from encoding/json, it sacfrifices performance in order to match the default interface. You may want to consider using TokenFast() to achieve the best performance (in exchange for more coding):

for {
	currToken, err := lexer.TokenFast()
	if err != nil {
		// ...
	}
	
	if currToken.Type() != LexerTokenTypeString {
		continue
	}
	
	s := currToken.String()

	switch state {
	case stateSearchingForOriginKey:
		if s == "origin" {
			state := pendingOriginValue
		}
	case statePendingOriginValue:
		fmt.Println(s)
		state = searchingForOriginKey
	}
}

Examples

Please refer to the 'examples' directory for the examples of gojsonlex usage. Run make examples to build all examples.

stdinparser

stdinparser is a simple utility that reads JSON from StdIn and dumps JSON tokens to StdOut

Benchmarks

BenchmarkEncodingJSON-8    	     576	   1973465 ns/op	  432581 B/op	   26706 allocs/op
BenchmarkJSONLexer-8       	    1212	    959528 ns/op	   99200 B/op	    6300 allocs/op
BenchmarkJSONLexerFast-8   	    1532	    771233 ns/op	       0 B/op	       0 allocs/op

Status

In development

Documentation ¶

Index ¶

func CanAppearInNumber(c rune) bool
func HexBytesToUint(in []byte) (result uint64, err error)
func IsDelim(c rune) bool
func IsHexDigit(c rune) bool
func IsValidEscapedSymbol(c rune) bool
func StringDeepCopy(s string) string
func UnescapeBytesInplace(input []byte) ([]byte, error)
type JSONLexer
- func NewJSONLexer(r io.Reader) (*JSONLexer, error)
type TokenGeneric
type TokenType
- func (t TokenType) String() string

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CanAppearInNumber ¶ added in v0.2.2

func CanAppearInNumber(c rune) bool

CanAppearInNUmber reports whether the given rune can appear in a JSON number

func HexBytesToUint ¶ added in v0.2.2

func HexBytesToUint(in []byte) (result uint64, err error)

func IsDelim ¶

func IsDelim(c rune) bool

IsDelim reports whether the given rune is a JSON delimiter

func IsHexDigit ¶ added in v0.2.0

func IsHexDigit(c rune) bool

IsHexDigit reports whether the given rune is a valid hex digit

func IsValidEscapedSymbol ¶ added in v0.2.0

func IsValidEscapedSymbol(c rune) bool

IsValidEscapedSymbol reports whether the given rune is one of the special symbols permitted in JSON

func StringDeepCopy ¶ added in v0.2.0

func StringDeepCopy(s string) string

StringDeepCopy creates a copy of the given string with it's own underlying bytearray. Use this function to make a copy of string returned by Token()

func UnescapeBytesInplace ¶ added in v0.2.2

func UnescapeBytesInplace(input []byte) ([]byte, error)

UnescapeBytesInplace iterates over the given slice of byte unescaping all escaped symbols inplace. Since the unescaped symbols take less space the shrinked slice of bytes is returned

Types ¶

type JSONLexer ¶

type JSONLexer struct {
	// contains filtered or unexported fields
}

JSONLexer is a JSON lexical analyzer with streaming API support, where stream is a sequence of JSON tokens. JSONLexer does its own IO buffering so prefer low-level readers if you want to miminize memory footprint.

JSONLexer uses a ring buffer for parsing tokens, every token must fit in its size, otherwise buffer will be automatically grown. Initial size of buffer is 4096 bytes, however you can tweak it with SetBufSize() in case you know that most tokens are going to be long.

JSONLexer uses unsafe pointers into the underlying buf to minimize allocations, see Token() for the provided guarantees.

func NewJSONLexer ¶

func NewJSONLexer(r io.Reader) (*JSONLexer, error)

NewJSONLexer creates a new JSONLexer with the given reader.

func (*JSONLexer) SetBufSize ¶

func (l *JSONLexer) SetBufSize(bufSize int)

SetBufSize creates a new buffer of the given size. MUST be called before parsing started.

func (*JSONLexer) SetDebug ¶ added in v0.2.0

func (l *JSONLexer) SetDebug(debug bool)

SetDebug enables debug logging

func (*JSONLexer) SetSkipDelims ¶

func (l *JSONLexer) SetSkipDelims(mustSkip bool)

SetSkipDelims tells JSONLexer to skip delimiters and return only keys and values. This can be useful in case you want to simply match the input to some specific grammar and have no intention of doing full syntax analysis.

func (*JSONLexer) Token ¶

func (l *JSONLexer) Token() (json.Token, error)

Token returns the next JSON token, all delimiters are skipped. Token will return io.EOF when all input has been exhausted. All strings returned by Token are guaranteed to be valid until the next Token call, otherwise you MUST make a deep copy.

func (*JSONLexer) TokenFast ¶ added in v0.2.0

func (l *JSONLexer) TokenFast() (TokenGeneric, error)

TokenFast is a more efficient version of Token(). All strings returned by Token are guaranteed to be valid until the next Token call, otherwise you MUST make a deep copy.

type TokenGeneric ¶ added in v0.2.0

type TokenGeneric struct {
	// contains filtered or unexported fields
}

TokenGeneric is a generic struct used to represent any possible JSON token

func (*TokenGeneric) Bool ¶ added in v0.2.0

func (t *TokenGeneric) Bool() bool

func (*TokenGeneric) Delim ¶ added in v0.2.0

func (t *TokenGeneric) Delim() byte

func (*TokenGeneric) IsNull ¶ added in v0.2.0

func (t *TokenGeneric) IsNull() bool

func (*TokenGeneric) Number ¶ added in v0.2.0

func (t *TokenGeneric) Number() float64

func (*TokenGeneric) String ¶ added in v0.2.0

func (t *TokenGeneric) String() string

String returns string that points into internal lexer buffer and is guaranteed to be valid until the next Token call, otherwise you MUST make a deep copy

func (*TokenGeneric) StringCopy ¶ added in v0.2.0

func (t *TokenGeneric) StringCopy() string

StringCopy return a deep copy of string

func (*TokenGeneric) Type ¶ added in v0.2.0

func (t *TokenGeneric) Type() TokenType

Type returns type of the token

type TokenType ¶

type TokenType byte

const (
	LexerTokenTypeDelim TokenType = iota
	LexerTokenTypeString
	LexerTokenTypeNumber
	LexerTokenTypeBool
	LexerTokenTypeNull
)

func (TokenType) String ¶ added in v0.2.0

func (t TokenType) String() string

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples
stdinparser

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL