codec

package module
v0.4.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 15, 2021 License: BSD-3-Clause Imports: 13 Imported by: 0

README

Fast Encoding of Go Values

This project is an enhanced version of the package pkgsite/internal/godoc/codec.

The original motivation was fast decoding of parsed Go files, of type go/ast.File. The pkg.go.dev site saves these when processing a module, and decodes them on the serving path to render documentation. So decoding had to be fast, and had to handle the cycles that these structures contain. It also had to work with existing types that we did not control. We couldn't find any existing encoders with these properties, so we wrote our own.

For usage, see the package documentation.

Encoding Scheme

Go values are converted to byte sequences by mapping them to a low-level wire protocol.

Wire Protocol

The wire protocol is a virtual machine in which every encoded value begins with a 1-byte code that describes what (if anything) follows. The encoding does not preserve type information--for instance, the value 1 could be an int or a bool-- but it does have enough information to skip values, since the decoder must be able to do that if it encounters a struct field it doesn't know.

Most of the values of a value's initial byte can be devoted to small unsigned integers. For example, the number 17 is represented by the single byte 17. Only a few byte values have special meaning, as described below.

The nil code indicates that the value is nil. (We don't absolutely need this: we could always represent the nil value for a type as something that couldn't be mistaken for an encoded value of that type. For instance, we could use 0 for nil in the case of slices (which always begin with the nValues code), and for pointers to numbers like *int, we could use something like "nBytes 0". But it is simpler to have a reserved value for nil.)

The nBytes code indicates that an unsigned integer N is encoded next, followed by N bytes of data. There are optimized codes bytes0, bytes1, etc. for values of N from 0 to 4. These are used to represent strings and byte slices, as well numbers bigger than can fit into the initial byte.

The nValues code is for sequences of values whose size is known beforehand, like a Go slice or array.

The ptr and refPtr codes indicate a pointer to the encoded value. The latter signals to the decoder that it should remember the pointer because it will be referred to later in the stream.

The ref code is used to refer to an earlier encoded pointer. It is followed by a uint denoting the relative offset to the position of the corresponding refPtr code.

The start and end codes delimit a value whose length is unknown beforehand. They are used for structs.

Encoding Values

Small unsigned integers are encoded in a single byte, as described above. Those that can't fit into the initial byte are encoded as byte sequences of length 1, 2, 4 or 8, holding big-endian values. For example, 255 is encoded as bytes1 255.

Signed integers are encoded as unsigned integers using zig-zag encoding: positive numbers are encoded as twice their value, and negative numbers are encoded as twice their negated value minus 1. This maps small negative values to small unsigned numbers, since they tend to occur more frequently than large values of either sign.

A boolean true is encoded as 1, false as 0.

Strings, byte slices and byte arrays are encoded as sequences of bytes. For example, the string "hello" is represented as nBytes 5 'h' 'e' 'l' 'l' 'o'.

Floating-point values are encoded as unsigned integers, after reversing the bits. Reversing makes small integer-valued floats take less space.

Complex values are encoded as two-element lists of floats, using the nValues code.

Nil values are of course encoded with the nil code.

Arrays and slices of type other than byte are encoded with nValues. For example, the slice []string{"hi", "bye"} is encoded as

nValues 2 bytes2 'h' 'i' bytes3 'b' 'y' 'e'

Non-nil pointers are initially encoded with ptr followed by the encoding of the value. For instance, the encoding of p in

i := 3
p := &i

is ptr 3. If pointer tracking is enabled and the pointer is encountered again, then it is encoded with ref and the ptr code is backpatched to refPtr.

Interface values are encoded as a pair of a type number and the value. The type numbers are assigned during encoding and stored at the beginning of the output, so the decoder can set up the mapping before it begins.

To encode structs, the generator assigns a unique number to each field. An encoded struct begins with the start code and ends with end. Each non-zero field is encoded as its number followed by its value.

If a struct's fields are changed, the numbers can change. The encoder saves the numbers assigned to to each field name in the encoded data, and the decoder maps those numbers to the numbers assigned by the generated code. For example, if F is the second field of a struct, it will be assigned 1. A struct value is encoded, along with the association of "F" to 1. Then the struct is re-arranged and F becomes the third field, where it is assigned 2. When the encoded data is decoded, the decoder will map an encoded field value of 1 to 2.

The encoder recognizes types that implement encoding.BinaryMarshaler and encoding.TextMarshaler, and uses those methods.

Comparison with Other Encoders

This encoder uses code generation instead of reflection, so it is usually faster than reflection-based encoders like encoding/gob and encoding/json. It is also faster than github.com/ugorji/go/codec, even when that uses code generation. See internal/benchmarks for comparison with the gob and ugorji codecs on a suite of benchmarks.

Those benchmarks turn off this codec's ability to handle pointer sharing. Turning on that feature slows it down noticeably. But the other encoders can't handle sharing at all.

The gvisor project has their own encoder, which does handle sharing. See https://pkg.go.dev/gvisor.dev/gvisor/pkg/state. I haven't benchmarked it. From reading the code:

  • It does have some features this encoder lacks, like some custom hooks.

  • It doesn't seem to provide for skipping unknown struct fields.

  • It seems that only types under programmer control can be encoded, because they have to implement certain methods.

  • It appears that values must be converted to and from a set of types used by the wire protocol (in state/wire). For example, a []float64 must be first transformed into a wire.Slice that holds a []wire.Float64 before encoding, and the reverse must happen on decoding. It seems this is done cheaply with some unsafe code (pkg/state/encode_unsafe.go). But it's not clear why the extra level of abstraction is necessary.

  • It handles more forms of sharing than this encoder. For example, consider a pointer to a struct field like

    type S struct { X int }
    s := &S{X: 1}
    p := &s.X
    x := []interface{}{s, p}
    

    Encoding x with gvisor.dev/gvisor/pkg/state will maintain the relationship between p and s.X. This encoder will not; it only recognizes sharing when the pointers are explicit, like

    s := &S{X: 1}
    p := s
    

Documentation

Overview

Package codec implements an encoder for Go values. It relies on code generation rather than reflection, so it is significantly faster than reflection-based encoders like gob. It can also preserve sharing among pointers (but not other forms of sharing, like sub-slices).

Encodings with maps are not deterministic, due to the non-deterministic order of map iteration.

Generating Code

The package supports Go built-in types (int, string and so on) out of the box, but for any other type you must generate code by calling GenerateFile. This can be done with a small program in your project's directory:

    // file generate.go
    //+build ignore

	package main

	import (
	   "mypkg"
	   "github.com/jba/codec"
	)

	func main() {
		err := codec.GenerateFile("types.gen.go", "mypkg", nil,
			[]mypkg.Type1{}, &mypkg.Type2{})
		if err != nil {
			log.Fatal(err)
		}
	}

Code will be generated for each type listed and for all types they contain. So this program will generate code for []mypkg.Type1, mypkg.Type1, *mypkg.Type2, and mypkg.Type2.

The "//+build ignore" tag prevents the program from being compiled as part of your package. Instead, invoke it directly with "go run". Use "go generate" to do so if you like:

//go:generate go run generate.go

On subsequent runs, the generator reads the generated file to get the names and order of all struct fields. It uses this information to generate correct code when fields are moved or added. Make sure the old generated files remain available to the generator, or changes to your structs may result in existing encoded data being decoded incorrectly.

Encoding and Decoding

Create an Encoder by passing it an io.Writer:

var buf bytes.Buffer
e := codec.NewEncoder(&buf, nil)

Then use it to encode one or more values:

if err := e.Encode(x); err != nil { ... }

To decode, pass an io.Reader to NewDecoder, and call Decode:

f, err := os.Open(filename)
...
d := codec.NewDecoder(f, nil)
var value interface{}
err := d.Decode(&value)
...

Sharing and Cycles

By default, if two pointers point to the same value, that value will be duplicated upon decoding. If there is a cycle, where a value directly or indirectly points to itself, then the encoder will crash by exceeding available stack space. This is the same behavior as encoding/gob and many other encoders.

Set EncodeOptions.TrackPointers to true to preserve pointer sharing and cycles, at the cost of slower encoding.

Other forms of memory sharing are not preserved. For example, if two slices refer to the same underlying array during encoding, they will refer to separate arrays after decoding.

Struct Tags

Struct tags in the style of encoding/json are supported, under the name "codec". You can easily generate code for structs designed for the encoding/json package by changing the name to "json" in an option to GenerateFile.

An example:

type T struct {
    A int `codec:"B"`
    C int `codec:"-"`
}

Here, field A will use the name "B" and field C will be omitted. There is no need for the omitempty option because the encoder always omits zero values.

Since the encoding uses numbers for fields instead of names, renaming a field doesn't actually affect the encoding. It does matter if subsequent changes are made to the struct, however. For example, say that originally T was

type T struct {
    A int
}

but you rename the field to "B":

type T struct {
    B int
}

The generator will treat "B" as a new field. Data encoded with "A" will not be decoded into "B". So you should use a tag to express that it is a renaming:

type T struct {
    B int `codec:"A"`
}
Example
package main

import (
	"bytes"
	"fmt"
	"log"

	"github.com/jba/codec"
)

func main() {
	var buf bytes.Buffer
	e := codec.NewEncoder(&buf, nil)
	for _, x := range []interface{}{1, "hello", true} {
		if err := e.Encode(x); err != nil {
			log.Fatal(err)
		}
	}

	d := codec.NewDecoder(bytes.NewReader(buf.Bytes()), nil)
	for i := 0; i < 3; i++ {
		var got interface{}
		err := d.Decode(&got)
		if err != nil {
			log.Fatal(err)
		}
		fmt.Println(got)
	}

}
Output:

1
hello
true

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func GenerateFile

func GenerateFile(filename, packagePath string, opts *GenerateOptions, values ...interface{}) error

GenerateFile writes encoders and decoders to filename. It generates code for the type of each given value, as well as any types they depend on. packagePath is the output package path.

Example
package main

import (
	"fmt"
	"log"
	"os"

	"github.com/jba/codec"
)

func main() {
	err := codec.GenerateFile("types.gen.go", "mypkg", nil, []int{}, map[string]bool{})
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(err)
	os.Remove("types.gen.go")

}
Output:

<nil>

Types

type DecodeOptions

type DecodeOptions struct {
	// DisallowUnknownFields configures whether unknown struct fields are skipped
	// (the default) or cause decoding to fail immediately.
	DisallowUnknownFields bool
}

DecodeOptions holds options for Decoding.

type Decoder

type Decoder struct {
	// contains filtered or unexported fields
}

A Decoder decodes a Go value encoded by an Encoder. To use a Decoder: - Pass NewDecoder the return value of Encoder.Bytes. - Call the Decode method once for each call to Encoder.Encode.

func NewDecoder

func NewDecoder(r io.Reader, opts *DecodeOptions) *Decoder

NewDecoder creates a Decoder that reads from r.

func (*Decoder) Decode

func (d *Decoder) Decode(p interface{}) error

Decode decodes a value encoded with Encoder.Encode and stores the result in the value pointed to by p. The decoded value must be assignable to the pointee's type; no conversions are performed. Decode returns io.EOF if there are no more values.

type EncodeOptions

type EncodeOptions struct {
	// If TrackPointers is true, the encoder will keep track of pointers so it
	// can preserve the pointer topology of the encoded value. Cyclical and
	// shared values will decode to the same representation. If TrackPointers is
	// false, then shared pointers will decode to distinct values, and cycles
	// will result in stack overflow.
	//
	// Setting this to true will significantly slow down encoding.
	TrackPointers bool

	// If non-nil, Encode will use this buffer instead of creating one. If the
	// encoding is large, providing a buffer of sufficient size can speed up
	// encoding by reducing allocation.
	Buffer []byte
}

EncodeOptions holds options for encoding.

type Encoder

type Encoder struct {
	// contains filtered or unexported fields
}

An Encoder encodes Go values into a sequence of bytes.

func NewEncoder

func NewEncoder(w io.Writer, opts *EncodeOptions) *Encoder

NewEncoder returns an Encoder that writes to w.

func (*Encoder) Encode

func (e *Encoder) Encode(x interface{}) (err error)

Encode encodes x.

type GenerateOptions

type GenerateOptions struct {
	// FieldTag is the name that GenerateFile will use to look up
	// field tag information. The default is "codec".
	FieldTag string
}

Directories

Path Synopsis
Package codecapi is used by the codec package and by code generated by codec.GenerateFile.
Package codecapi is used by the codec package and by code generated by codec.GenerateFile.
internal
cmp
This is a package with the same name as github.com/google/go-cmp/cmp.
This is a package with the same name as github.com/google/go-cmp/cmp.
testpkg
This is a package whose name is not the last component of its import path.
This is a package whose name is not the last component of its import path.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL