xml

package module

v0.0.1 Latest Latest Go to latest Published: Jun 19, 2020 License: Apache-2.0 Imports: 8 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/Goodwine/go-xml

Links

Open Source Insights

README ¶

go-xml

Go XML parsing library alternative to encoding/xml

Disclaimer: This is not an official Google product.

This package uses buffers and reusable instances to reduce the amount of time and number of allocations done during parsing for systems that are resource constrained. The package is mostly be a drop-in replacement with the exception of a few variable names changing.

Features

Optionally normalizes CharData whitespace
Optionally read Comment, ProcInst, and Directive contents

Not implemented yet

The library already allows manually unmarshaling well-formed XML files, but the following features are not implemented yet so this library should be used with caution, using on a critical prod system is not advised.

Option to disable whitespace normalization on CharData
Option to get ProcInst contents
Support attribute values without quotes, like <foo bar=baz>
Better Comment end-token (-->) validation
Support xml: struct tags
Marshal et al
Unmarshal et al
Encode et al
decodeElement
Optionally decode html entities like " or <
Better error handling - currently assumes proper format with only a few validations
Catch mismatching start/close tags.

Comparison

Notes on encoding/xml

The encoding/xml package has a couple proposals based on encoding/json where buffers are added for names. Additionally there are a few tricks to reduce resource use that can be applied on the Unmarshalling implementation by avoiding Unmarshal() and manually decoding the XML. All of the above can lead to 50% reduction on resources just within the standard library's package.

The techniques used in this library are well known to be great performance boosts that could be used on encoding/xml, however Go 1.0 has a backwards compatibility promise that applies to all standard libraries. These techniques would require breaking this compatibilty promise and therefore can't be used on encoding/xml to their full potential.

Changes on go-xml

go-xml is essentially the same implementation (although not all features are supported), but instead of returning new token instances, it returns a buffered value. Additional buffers are also implemented for identifier tag and attribute names, as well as attribute objects themselves. These buffers are based off the proposals mentioned above.

A key difference is go-xml tokens are not supposed to be stored. The token instances are pointers and the values change every time decoder.Token() is called. You must token.Copy() the value if you want to store it, but this should rarely be the case as the token should be evaluated as soon as it's received.

benchstat

Note: These benchmarks compare Raw tokenization between both libraries. Even though the library isn't fully implemented, these improvements will carry over and snowball once implementation is complete.

Reading an XML Message Bundle with 75k entries (30MB). Comparing encoding/xml with the tricks AND patching the performance improvement proposals mentioned above vs go-xml.

name                       time/op
DecodeAll/go-xml-16         398ms ± 1%
DecodeAll/encoding/xml-16   452ms ± 2%

name                       alloc/op
DecodeAll/go-xml-16        18.1MB ± 0%
DecodeAll/encoding/xml-16  76.9MB ± 0%

name                       allocs/op
DecodeAll/go-xml-16          676k ± 0%
DecodeAll/encoding/xml-16   1.99M ± 0%

Code of Conduct

Same as Go

Contributing

Read our contributions doc.

tl;dr: Get Google's CLA and send a PR!

Licence

Apache 2.0

Documentation ¶

Overview ¶

Package xml is an alternative to the standard library `encoding/xml` package.

This package uses of buffers and reusable object instances during unmarshalling to reduce allocations and struct initialization and the copy-by-value behavior of Go. This saves considerable amounts of resources for constrained systems.

The library is still incomplete, see the repository's README. But should be ready to be used in prod assuming you're currently unmarshalling by manually extracting tokens out of the decoder.

10-34% faster
76% less allocated memory
66% less memory allocations

Example (ManualDecodingWithTokens) ¶

This example demonstrates how to decode an XML file using manual tokenization into an object, and how to terminate the read-parse loop.

const data = `
	<msg id="123" desc="flying mammal">
		Bat
	</msg>
	<msg id="456" desc="baseball item">
		Bat
	</msg>
	`

type Msg struct {
	ID       string
	Desc     string
	Contents string
}

var msgs []Msg
var msg Msg
d := xml.NewDecoder(strings.NewReader(data))
for {
	tok, err := d.Token()
	if err != nil {
		// Decoding completes when EOF is returned.
		if errors.Is(err, io.EOF) {
			break
		}
		log.Fatal(err)
		return
	}

	switch tok := tok.(type) {
	case *xml.StartTag:
		if tok.Name.Local() != "msg" {
			log.Fatalf("unexpected start tag: %s", tok.Name.Local())
		}
		for _, attr := range tok.Attr {
			switch attr.Name.Local() {
			case "id":
				msg.ID = attr.Value
			case "desc":
				msg.Desc = attr.Value
			}
		}
	case *xml.CloseTag:
		if tok.Name.Local() != "msg" {
			log.Fatalf("unexpected close tag: %s", tok.Name.Local())
		}
		msgs = append(msgs, msg)
		msg = Msg{}
	case *xml.CharData:
		msg.Contents = string(tok.Data)
	default:
		log.Fatalf("unexpected token: %T", tok)
	}
}

for _, m := range msgs {
	fmt.Printf("Msg{ID: '%s', Desc: '%s', Contents: '%s'}\n", m.ID, m.Desc, m.Contents)
}

Output:

Msg{ID: '123', Desc: 'flying mammal', Contents: ' Bat '}
Msg{ID: '456', Desc: 'baseball item', Contents: ' Bat '}

Examples ¶

Package (ManualDecodingWithTokens)

Constants ¶

View Source

const (
	// UnexpectedChar is thrown when an unexpected rune or characters appears outside of an attribute
	// value or CharData token.
	UnexpectedChar decodeError = "unexpected char"
)

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Attr ¶

type Attr struct {
	Name  *Name
	Value string
}

Attr is a tag attribute like <foo bar="baz">. This will store an Attr with name "bar" and value "baz"

type CharData ¶

type CharData struct {
	Data []byte
}

CharData contains a text node

func (*CharData) Copy ¶

func (t *CharData) Copy() Token

type CloseTag ¶

type CloseTag struct {
	Name *Name
}

CloseTag is a closing XML tag </tag>

func (*CloseTag) Copy ¶

func (t *CloseTag) Copy() Token

type Comment ¶

type Comment struct {
	// Data contains the contents of the comment. It is empty by default.
	//
	// Enable `d.ReadComment` to include the contents in the token.
	Data []byte
}

Comment has the format <-- -->

It can have two or more `-` at the beginning, but it must have two `-` at the end.

func (*Comment) Copy ¶

func (t *Comment) Copy() Token

type Decoder ¶

type Decoder struct {
	// ReadComment enables reading and returning back the comment contents. Otherwise returns an empty
	// node. Disabled by default.
	ReadComment bool

	// ReadComment enables reading and returning back the directive contents. Otherwise returns an
	// empty node. Disabled by default.
	//
	// Note that we DO NOT process directives, we simply return back the string within `<! ... >`
	ReadDirective bool
	// contains filtered or unexported fields
}

Decoder processes an XML input and generates tokens or processes into a given struct.

func NewDecoder ¶

func NewDecoder(r io.Reader) *Decoder

NewDecoder instantiates a Decoder to process a Reader input.

func (*Decoder) Token ¶

func (d *Decoder) Token() (Token, error)

Token will decode the next token from the current XML position.

The token is meant to be processed BEFORE the next token is called. Contents of previous tokens can be modified at any time during tokenization.

type Directive ¶

type Directive struct {
	// Data contains the contents of the directive. It is empty by default.
	//
	// Enable `d.ReadDirective` to include the contents in the token.
	Data []byte
}

Directive has the format <! ... >

Note: We do NOT process the directive token. We only read it.

func (*Directive) Copy ¶

func (t *Directive) Copy() Token

type Name ¶

type Name struct {
	// contains filtered or unexported fields
}

Name stores an identifier name from either a tag or an attribute like <foo bar="baz"> This will generate the names "foo" for the tag, and "bar" for the attribute.

func (*Name) Local ¶

func (n *Name) Local() string

Local returns the identifier name without XML namespace.

For example <a:b> generates the local name "b" with namespace "a" This method will return "b".

func (*Name) Space ¶ added in v0.0.1

func (n *Name) Space() string

Space returns the identifier name without XML namespace.

For example <a:b> generates the local name "b" with namespace "a" This method will return "a".

type ProcInst ¶

type ProcInst struct{}

ProcInst has the format <? ... ?>

func (*ProcInst) Copy ¶

func (t *ProcInst) Copy() Token

type StartTag ¶

type StartTag struct {
	Name *Name
	Attr []*Attr
}

StartTag is an opening XML tag <tag>

func (*StartTag) Copy ¶

func (s *StartTag) Copy() Token

type Token ¶

type Token interface {

	// Copy the token into a new instance.
	//
	// Tokens instances are constantly modified by the decoding process, this function makes a copy
	// for the unlikely case when the token value must be stored, and for testing!
	Copy() Token
	// contains filtered or unexported methods
}

Token represents an XML Token:

StartTag:  <foo> or <foo />
CloseTag:  </foo> implicitly </foo> too
Comment:   <-- foo -->
ProcInst:  <? foo ?>
Directive: <! foo >
CharData:  Any string outside of angle brackets <>

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL