xml

package module
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 19, 2020 License: Apache-2.0 Imports: 8 Imported by: 0

README

go-xml

Go XML parsing library alternative to encoding/xml

Disclaimer: This is not an official Google product.

This package uses buffers and reusable instances to reduce the amount of time and number of allocations done during parsing for systems that are resource constrained. The package is mostly be a drop-in replacement with the exception of a few variable names changing.

Features

  • Optionally normalizes CharData whitespace
  • Optionally read Comment, ProcInst, and Directive contents
Not implemented yet

The library already allows manually unmarshaling well-formed XML files, but the following features are not implemented yet so this library should be used with caution, using on a critical prod system is not advised.

  • Option to disable whitespace normalization on CharData
  • Option to get ProcInst contents
  • Support attribute values without quotes, like <foo bar=baz>
  • Better Comment end-token (-->) validation
  • Support xml: struct tags
  • Marshal et al
  • Unmarshal et al
  • Encode et al
  • decodeElement
  • Optionally decode html entities like &quot; or &lt;
  • Better error handling - currently assumes proper format with only a few validations
  • Catch mismatching start/close tags.

Comparison

Notes on encoding/xml

The encoding/xml package has a couple proposals based on encoding/json where buffers are added for names. Additionally there are a few tricks to reduce resource use that can be applied on the Unmarshalling implementation by avoiding Unmarshal() and manually decoding the XML. All of the above can lead to 50% reduction on resources just within the standard library's package.

The techniques used in this library are well known to be great performance boosts that could be used on encoding/xml, however Go 1.0 has a backwards compatibility promise that applies to all standard libraries. These techniques would require breaking this compatibilty promise and therefore can't be used on encoding/xml to their full potential.

Changes on go-xml

go-xml is essentially the same implementation (although not all features are supported), but instead of returning new token instances, it returns a buffered value. Additional buffers are also implemented for identifier tag and attribute names, as well as attribute objects themselves. These buffers are based off the proposals mentioned above.

A key difference is go-xml tokens are not supposed to be stored. The token instances are pointers and the values change every time decoder.Token() is called. You must token.Copy() the value if you want to store it, but this should rarely be the case as the token should be evaluated as soon as it's received.

benchstat

Note: These benchmarks compare Raw tokenization between both libraries. Even though the library isn't fully implemented, these improvements will carry over and snowball once implementation is complete.

Reading an XML Message Bundle with 75k entries (30MB). Comparing encoding/xml with the tricks AND patching the performance improvement proposals mentioned above vs go-xml.

name                       time/op
DecodeAll/go-xml-16         398ms ± 1%
DecodeAll/encoding/xml-16   452ms ± 2%

name                       alloc/op
DecodeAll/go-xml-16        18.1MB ± 0%
DecodeAll/encoding/xml-16  76.9MB ± 0%

name                       allocs/op
DecodeAll/go-xml-16          676k ± 0%
DecodeAll/encoding/xml-16   1.99M ± 0%

Code of Conduct

Same as Go

Contributing

Read our contributions doc.

tl;dr: Get Google's CLA and send a PR!

Licence

Apache 2.0

Documentation

Overview

Package xml is an alternative to the standard library `encoding/xml` package.

This package uses of buffers and reusable object instances during unmarshalling to reduce allocations and struct initialization and the copy-by-value behavior of Go. This saves considerable amounts of resources for constrained systems.

The library is still incomplete, see the repository's README. But should be ready to be used in prod assuming you're currently unmarshalling by manually extracting tokens out of the decoder.

10-34% faster
76% less allocated memory
66% less memory allocations
Example (ManualDecodingWithTokens)

This example demonstrates how to decode an XML file using manual tokenization into an object, and how to terminate the read-parse loop.

const data = `
	<msg id="123" desc="flying mammal">
		Bat
	</msg>
	<msg id="456" desc="baseball item">
		Bat
	</msg>
	`

type Msg struct {
	ID       string
	Desc     string
	Contents string
}

var msgs []Msg
var msg Msg
d := xml.NewDecoder(strings.NewReader(data))
for {
	tok, err := d.Token()
	if err != nil {
		// Decoding completes when EOF is returned.
		if errors.Is(err, io.EOF) {
			break
		}
		log.Fatal(err)
		return
	}

	switch tok := tok.(type) {
	case *xml.StartTag:
		if tok.Name.Local() != "msg" {
			log.Fatalf("unexpected start tag: %s", tok.Name.Local())
		}
		for _, attr := range tok.Attr {
			switch attr.Name.Local() {
			case "id":
				msg.ID = attr.Value
			case "desc":
				msg.Desc = attr.Value
			}
		}
	case *xml.CloseTag:
		if tok.Name.Local() != "msg" {
			log.Fatalf("unexpected close tag: %s", tok.Name.Local())
		}
		msgs = append(msgs, msg)
		msg = Msg{}
	case *xml.CharData:
		msg.Contents = string(tok.Data)
	default:
		log.Fatalf("unexpected token: %T", tok)
	}
}

for _, m := range msgs {
	fmt.Printf("Msg{ID: '%s', Desc: '%s', Contents: '%s'}\n", m.ID, m.Desc, m.Contents)
}
Output:

Msg{ID: '123', Desc: 'flying mammal', Contents: ' Bat '}
Msg{ID: '456', Desc: 'baseball item', Contents: ' Bat '}

Index

Examples

Constants

View Source
const (
	// UnexpectedChar is thrown when an unexpected rune or characters appears outside of an attribute
	// value or CharData token.
	UnexpectedChar decodeError = "unexpected char"
)

Variables

This section is empty.

Functions

This section is empty.

Types

type Attr

type Attr struct {
	Name  *Name
	Value string
}

Attr is a tag attribute like <foo bar="baz">. This will store an Attr with name "bar" and value "baz"

type CharData

type CharData struct {
	Data []byte
}

CharData contains a text node

func (*CharData) Copy

func (t *CharData) Copy() Token

type CloseTag

type CloseTag struct {
	Name *Name
}

CloseTag is a closing XML tag </tag>

func (*CloseTag) Copy

func (t *CloseTag) Copy() Token

type Comment

type Comment struct {
	// Data contains the contents of the comment. It is empty by default.
	//
	// Enable `d.ReadComment` to include the contents in the token.
	Data []byte
}

Comment has the format <-- -->

It can have two or more `-` at the beginning, but it must have two `-` at the end.

func (*Comment) Copy

func (t *Comment) Copy() Token

type Decoder

type Decoder struct {
	// ReadComment enables reading and returning back the comment contents. Otherwise returns an empty
	// node. Disabled by default.
	ReadComment bool

	// ReadComment enables reading and returning back the directive contents. Otherwise returns an
	// empty node. Disabled by default.
	//
	// Note that we DO NOT process directives, we simply return back the string within `<! ... >`
	ReadDirective bool
	// contains filtered or unexported fields
}

Decoder processes an XML input and generates tokens or processes into a given struct.

func NewDecoder

func NewDecoder(r io.Reader) *Decoder

NewDecoder instantiates a Decoder to process a Reader input.

func (*Decoder) Token

func (d *Decoder) Token() (Token, error)

Token will decode the next token from the current XML position.

The token is meant to be processed BEFORE the next token is called. Contents of previous tokens can be modified at any time during tokenization.

type Directive

type Directive struct {
	// Data contains the contents of the directive. It is empty by default.
	//
	// Enable `d.ReadDirective` to include the contents in the token.
	Data []byte
}

Directive has the format <! ... >

Note: We do NOT process the directive token. We only read it.

func (*Directive) Copy

func (t *Directive) Copy() Token

type Name

type Name struct {
	// contains filtered or unexported fields
}

Name stores an identifier name from either a tag or an attribute like <foo bar="baz"> This will generate the names "foo" for the tag, and "bar" for the attribute.

func (*Name) Local

func (n *Name) Local() string

Local returns the identifier name without XML namespace.

For example <a:b> generates the local name "b" with namespace "a" This method will return "b".

func (*Name) Space added in v0.0.1

func (n *Name) Space() string

Space returns the identifier name without XML namespace.

For example <a:b> generates the local name "b" with namespace "a" This method will return "a".

type ProcInst

type ProcInst struct{}

ProcInst has the format <? ... ?>

func (*ProcInst) Copy

func (t *ProcInst) Copy() Token

type StartTag

type StartTag struct {
	Name *Name
	Attr []*Attr
}

StartTag is an opening XML tag <tag>

func (*StartTag) Copy

func (s *StartTag) Copy() Token

type Token

type Token interface {

	// Copy the token into a new instance.
	//
	// Tokens instances are constantly modified by the decoding process, this function makes a copy
	// for the unlikely case when the token value must be stored, and for testing!
	Copy() Token
	// contains filtered or unexported methods
}

Token represents an XML Token:

StartTag:  <foo> or <foo />
CloseTag:  </foo> implicitly </foo> too
Comment:   <-- foo -->
ProcInst:  <? foo ?>
Directive: <! foo >
CharData:  Any string outside of angle brackets <>

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL