gojsonlex

package module
v0.2.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 11, 2020 License: MIT Imports: 10 Imported by: 0

README

GoJSONLex

gojsonlex is a drop in replacement for encoding/json lexer optimised for efficiency. gojsonlex is 2-3 times faster than encoding/json and requires memory only enough to buffer the longest token in the input. Currently gojsonlex skips all delimiters (this behaviour will be changed).

API Documentation

https://pkg.go.dev/github.com/gibsn/gojsonlex

Motivation

Let's consider a case when you want to parse the output of some tool that encodes binary data to one huge JSON dict:

{
  "bands": [
    {
      "name": "Metallica",
      "origin": "USA",
      "albums": [
        ...
      ]
    },
    ...
    {
      "name": "Enter Shikari",
      "origin": "England"
      "albums": [
        ...
      ]
    }
  ]
}

Let's say "albums" can be arbitrary long, the whole JSON is 10GB, but you actually want to print out all "origin" values and don't care about the rest. You do not want to decode the whole JSON into one struct (like most JSON parsers do) since it can be huge. Luckily in this case you do not actually need to parse any arbitrary JSON, you are ok with a more narrow grammar. A parser for such a grammar could look like this:

for {
	currToken, err := lexer.Token()
	if err != nil {
		// ...
	}

	switch state {
	case searchingForOriginKey:
		if currToken == "origin" {
			state := pendingOriginValue
		}
	case pendingOriginValue:
		fmt.Println(currToken)
		state = searchingForOriginKey
	}
}

Ok, so now you need a JSON lexer. Some lexers that I checked did buffer a large portion of input in order to parse a composite type (which is bad since "albums" can be huge). The only lexer that did not require that much memory was the standard encoding/json, however it could be optimized to consume less CPU. That's how gojsonlex was born.

Overview

Example from the previous section could be implemented with gojsonlex like this:

l, err := gojsonlex.NewJSONLexer(r)
if err != nil {
	// ...
}

state = stateSearchingForOriginKey

for {
	currToken, err := lexer.Token()
	if err != nil {
		// ...
	}
	
	s, ok := currToken.(string)
	if !ok {
		continue
	}

	switch state {
	case stateSearchingForOriginKey:
		if s == "origin" {
			state := pendingOriginValue
		}
	case statePendingOriginValue:
		fmt.Println(s)
		state = searchingForOriginKey
	}
}

In order to maintain zero allocations Token() will always return an unsafe string that is valid only until the next Token() call. You must make a deep copy (using StringDeepCopy()) of that string in case you may need it after the next Token() call.

Though gojsonlex.Token() is faster than that from encoding/json, it sacfrifices performance in order to match the default interface. You may want to consider using TokenFast() to achieve the best performance (in exchange for more coding):

for {
	currToken, err := lexer.TokenFast()
	if err != nil {
		// ...
	}
	
	if currToken.Type() != LexerTokenTypeString {
		continue
	}
	
	s := currToken.String()

	switch state {
	case stateSearchingForOriginKey:
		if s == "origin" {
			state := pendingOriginValue
		}
	case statePendingOriginValue:
		fmt.Println(s)
		state = searchingForOriginKey
	}
}

Examples

Please refer to the 'examples' directory for the examples of gojsonlex usage. Run make examples to build all examples.

stdinparser

stdinparser is a simple utility that reads JSON from StdIn and dumps JSON tokens to StdOut

Benchmarks

BenchmarkEncodingJSON-8    	     576	   1973465 ns/op	  432581 B/op	   26706 allocs/op
BenchmarkJSONLexer-8       	    1212	    959528 ns/op	   99200 B/op	    6300 allocs/op
BenchmarkJSONLexerFast-8   	    1532	    771233 ns/op	       0 B/op	       0 allocs/op

Status

In development

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CanAppearInNumber added in v0.2.2

func CanAppearInNumber(c rune) bool

CanAppearInNUmber reports whether the given rune can appear in a JSON number

func HexBytesToUint added in v0.2.2

func HexBytesToUint(in []byte) (result uint64, err error)

func IsDelim

func IsDelim(c rune) bool

IsDelim reports whether the given rune is a JSON delimiter

func IsHexDigit added in v0.2.0

func IsHexDigit(c rune) bool

IsHexDigit reports whether the given rune is a valid hex digit

func IsValidEscapedSymbol added in v0.2.0

func IsValidEscapedSymbol(c rune) bool

IsValidEscapedSymbol reports whether the given rune is one of the special symbols permitted in JSON

func StringDeepCopy added in v0.2.0

func StringDeepCopy(s string) string

StringDeepCopy creates a copy of the given string with it's own underlying bytearray. Use this function to make a copy of string returned by Token()

func UnescapeBytesInplace added in v0.2.2

func UnescapeBytesInplace(input []byte) ([]byte, error)

UnescapeBytesInplace iterates over the given slice of byte unescaping all escaped symbols inplace. Since the unescaped symbols take less space the shrinked slice of bytes is returned

Types

type JSONLexer

type JSONLexer struct {
	// contains filtered or unexported fields
}

JSONLexer is a JSON lexical analyzer with streaming API support, where stream is a sequence of JSON tokens. JSONLexer does its own IO buffering so prefer low-level readers if you want to miminize memory footprint.

JSONLexer uses a ring buffer for parsing tokens, every token must fit in its size, otherwise buffer will be automatically grown. Initial size of buffer is 4096 bytes, however you can tweak it with SetBufSize() in case you know that most tokens are going to be long.

JSONLexer uses unsafe pointers into the underlying buf to minimize allocations, see Token() for the provided guarantees.

func NewJSONLexer

func NewJSONLexer(r io.Reader) (*JSONLexer, error)

NewJSONLexer creates a new JSONLexer with the given reader.

func (*JSONLexer) SetBufSize

func (l *JSONLexer) SetBufSize(bufSize int)

SetBufSize creates a new buffer of the given size. MUST be called before parsing started.

func (*JSONLexer) SetDebug added in v0.2.0

func (l *JSONLexer) SetDebug(debug bool)

SetDebug enables debug logging

func (*JSONLexer) SetSkipDelims

func (l *JSONLexer) SetSkipDelims(mustSkip bool)

SetSkipDelims tells JSONLexer to skip delimiters and return only keys and values. This can be useful in case you want to simply match the input to some specific grammar and have no intention of doing full syntax analysis.

func (*JSONLexer) Token

func (l *JSONLexer) Token() (json.Token, error)

Token returns the next JSON token, all delimiters are skipped. Token will return io.EOF when all input has been exhausted. All strings returned by Token are guaranteed to be valid until the next Token call, otherwise you MUST make a deep copy.

func (*JSONLexer) TokenFast added in v0.2.0

func (l *JSONLexer) TokenFast() (TokenGeneric, error)

TokenFast is a more efficient version of Token(). All strings returned by Token are guaranteed to be valid until the next Token call, otherwise you MUST make a deep copy.

type TokenGeneric added in v0.2.0

type TokenGeneric struct {
	// contains filtered or unexported fields
}

TokenGeneric is a generic struct used to represent any possible JSON token

func (*TokenGeneric) Bool added in v0.2.0

func (t *TokenGeneric) Bool() bool

func (*TokenGeneric) Delim added in v0.2.0

func (t *TokenGeneric) Delim() byte

func (*TokenGeneric) IsNull added in v0.2.0

func (t *TokenGeneric) IsNull() bool

func (*TokenGeneric) Number added in v0.2.0

func (t *TokenGeneric) Number() float64

func (*TokenGeneric) String added in v0.2.0

func (t *TokenGeneric) String() string

String returns string that points into internal lexer buffer and is guaranteed to be valid until the next Token call, otherwise you MUST make a deep copy

func (*TokenGeneric) StringCopy added in v0.2.0

func (t *TokenGeneric) StringCopy() string

StringCopy return a deep copy of string

func (*TokenGeneric) Type added in v0.2.0

func (t *TokenGeneric) Type() TokenType

Type returns type of the token

type TokenType

type TokenType byte
const (
	LexerTokenTypeDelim TokenType = iota
	LexerTokenTypeString
	LexerTokenTypeNumber
	LexerTokenTypeBool
	LexerTokenTypeNull
)

func (TokenType) String added in v0.2.0

func (t TokenType) String() string

Directories

Path Synopsis
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL