zek

package module
v0.1.24 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 19, 2024 License: GPL-3.0 Imports: 12 Imported by: 4

README

zek

Zek is a prototype for creating a Go struct from an XML document. The resulting struct works best for reading XML (see also #14), to create XML, you might want to use something else.

It was developed at Leipzig University Library to shorten the time to go from raw XML to a struct that allows to access XML data in Go programs.

Skip the fluff, just the code.

Given some XML, run:

$ curl -s https://raw.githubusercontent.com/miku/zek/master/fixtures/e.xml | zek -e
// Rss was generated 2018-08-30 20:24:14 by tir on sol.
type Rss struct {
    XMLName xml.Name `xml:"rss"`
    Text    string   `xml:",chardata"`
    Rdf     string   `xml:"rdf,attr"`
    Dc      string   `xml:"dc,attr"`
    Geoscan string   `xml:"geoscan,attr"`
    Media   string   `xml:"media,attr"`
    Gml     string   `xml:"gml,attr"`
    Taxo    string   `xml:"taxo,attr"`
    Georss  string   `xml:"georss,attr"`
    Content string   `xml:"content,attr"`
    Geo     string   `xml:"geo,attr"`
    Version string   `xml:"version,attr"`
    Channel struct {
        Text          string `xml:",chardata"`
        Title         string `xml:"title"`         // ESS New Releases (Display...
        Link          string `xml:"link"`          // http://tinyurl.com/ESSNew...
        Description   string `xml:"description"`   // New releases from the Ear...
        LastBuildDate string `xml:"lastBuildDate"` // Mon, 27 Nov 2017 00:06:35...
        Item          []struct {
            Text        string `xml:",chardata"`
            Title       string `xml:"title"`       // Surficial geology, Aberde...
            Link        string `xml:"link"`        // https://geoscan.nrcan.gc....
            Description string `xml:"description"` // Geological Survey of Cana...
            Guid        struct {
                Text        string `xml:",chardata"` // 304279, 306212, 306175, 3...
                IsPermaLink string `xml:"isPermaLink,attr"`
            } `xml:"guid"`
            PubDate       string   `xml:"pubDate"`      // Fri, 24 Nov 2017 00:00:00...
            Polygon       []string `xml:"polygon"`      // 64.0000 -98.0000 64.0000 ...
            Download      string   `xml:"download"`     // https://geoscan.nrcan.gc....
            License       string   `xml:"license"`      // http://data.gc.ca/eng/ope...
            Author        string   `xml:"author"`       // Geological Survey of Cana...
            Source        string   `xml:"source"`       // Geological Survey of Cana...
            SndSeries     string   `xml:"SndSeries"`    // Bedford Institute of Ocea...
            Publisher     string   `xml:"publisher"`    // Natural Resources Canada,...
            Edition       string   `xml:"edition"`      // prelim., surficial data m...
            Meeting       string   `xml:"meeting"`      // Geological Association of...
            Documenttype  string   `xml:"documenttype"` // serial, open file, serial...
            Language      string   `xml:"language"`     // English, English, English...
            Maps          string   `xml:"maps"`         // 1 map, 5 maps, Publicatio...
            Mapinfo       string   `xml:"mapinfo"`      // surficial geology, surfic...
            Medium        string   `xml:"medium"`       // on-line; digital, digital...
            Province      string   `xml:"province"`     // Nunavut, Northwest Territ...
            Nts           string   `xml:"nts"`          // 066B, 095J; 095N; 095O; 0...
            Area          string   `xml:"area"`         // Aberdeen Lake, Mackenzie ...
            Subjects      string   `xml:"subjects"`
            Program       string   `xml:"program"`       // GEM2: Geo-mapping for Ene...
            Project       string   `xml:"project"`       // Rae Province Project Mana...
            Projectnumber string   `xml:"projectnumber"` // 340521, 343202, 340557, 3...
            Abstract      string   `xml:"abstract"`      // This new surficial geolog...
            Links         string   `xml:"links"`         // Online - En ligne (PDF, 9...
            Readme        string   `xml:"readme"`        // readme | https://geoscan....
            PPIid         string   `xml:"PPIid"`         // 34532, 35096, 35438, 2563...
        } `xml:"item"`
    } `xml:"channel"`
}

Online

About

Upsides:

  • it works fine for non-recursive structures,
  • does not need XSD or DTD,
  • it is relatively convenient to access attributes, children and text,
  • will generate a single struct, which make for a quite compact representation,
  • simple user interface,
  • comments with examples,
  • schema inference across multiple files.

Downsides:

  • experimental, early, buggy, unstable prototype,
  • no support for recursive types (similar to Russian Doll strategy, [1])
  • no type inference, everything is accessible as string (without a schema, type inference may fail if the type guess is wrong)

Bugs:

Mapping between XML elements and data structures is inherently flawed: an XML element is an order-dependent collection of anonymous values, while a data structure is an order-independent collection of named values.

https://golang.org/pkg/encoding/xml/#pkg-note-BUG

Related projects:

And other awesome XML utilities.

Presentations:

Install

$ go install github.com/miku/zek/cmd/zek@latest

Debian and RPM packages:

It's in AUR, too.

Usage

$ zek -h
Usage of zek:
  -B    use a fixed banner string (e.g. for CI)
  -C    emit less compact struct
  -F    skip formatting
  -P string
        if set, write out struct within a package with the given name
  -S int
        read at most this many tags, approximately (0=unlimited)
  -c    emit more compact struct (noop, as this is the default since 0.1.7)
  -d    debug output
  -e    add comments with example
  -j    add JSON tags
  -m    omit empty Text fields
  -max-examples int
        limit number of examples (default 10)
  -n string
        use a different name for the top-level struct
  -o string
        if set, write to output file, not stdout
  -p    write out an example program
  -s    strict parsing and writing
  -t string
        emit struct for tag matching this name
  -u    filter out duplicated examples
  -version
        show version
  -x int
        max chars for example (default 25)

Examples:

$ cat fixtures/a.xml
<a></a>

$ zek -C < fixtures/a.xml
type A struct {
    XMLName xml.Name `xml:"a"`
    Text    string   `xml:",chardata"`
}

Debug output dumps the internal tree as JSON to stdout.

$ zek -d < fixtures/a.xml
{"name":{"Space":"","Local":"a"}}

Example program:

package main

import (
	"encoding/json"
	"encoding/xml"
	"fmt"
	"log"
	"os"
)

// A was generated 2017-12-05 17:35:21 by tir on apollo.
type A struct {
	XMLName xml.Name `xml:"a"`
	Text    string   `xml:",chardata"`
}

func main() {
	dec := xml.NewDecoder(os.Stdin)
	var doc A
	if err := dec.Decode(&doc); err != nil {
		log.Fatal(err)
	}
	b, err := json.Marshal(doc)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(b))
}

$ zek -C -p < fixtures/a.xml > sample.go && go run sample.go < fixtures/a.xml | jq . && rm sample.go
{
  "XMLName": {
    "Space": "",
    "Local": "a"
  },
  "Text": ""
}

More complex example:

$ zek < fixtures/d.xml
// Root was generated 2019-06-11 16:27:04 by tir on hayiti.
type Root struct {
        XMLName xml.Name `xml:"root"`
        Text    string   `xml:",chardata"`
        A       []struct {
                Text string `xml:",chardata"`
                B    []struct {
                        Text string `xml:",chardata"`
                        C    string `xml:"c"`
                        D    string `xml:"d"`
                } `xml:"b"`
        } `xml:"a"`
}

$ zek -p < fixtures/d.xml > sample.go && go run sample.go < fixtures/d.xml | jq . && rm sample.go
{
  "XMLName": {
    "Space": "",
    "Local": "root"
  },
  "Text": "\n\n\n\n",
  "A": [
    {
      "Text": "\n  \n  \n",
      "B": [
        {
          "Text": "\n    \n  ",
          "C": "Hi",
          "D": ""
        },
        {
          "Text": "\n    \n    \n  ",
          "C": "World",
          "D": ""
        }
      ]
    },
    {
      "Text": "\n  \n",
      "B": [
        {
          "Text": "\n    \n  ",
          "C": "Hello",
          "D": ""
        }
      ]
    },
    {
      "Text": "\n  \n",
      "B": [
        {
          "Text": "\n    \n  ",
          "C": "",
          "D": "World"
        }
      ]
    }
  ]
}

Annotate with comments:

$ zek -e < fixtures/l.xml
// Records was generated 2019-06-11 16:29:35 by tir on hayiti.
type Records struct {
        XMLName xml.Name `xml:"Records"`
        Text    string   `xml:",chardata"` // \n
        Xsi     string   `xml:"xsi,attr"`
        Record  []struct {
                Text   string `xml:",chardata"`
                Header struct {
                        Text       string `xml:",chardata"`
                        Status     string `xml:"status,attr"`
                        Identifier string `xml:"identifier"` // oai:ojs.localhost:article...
                        Datestamp  string `xml:"datestamp"`  // 2009-06-24T14:48:23Z, 200...
                        SetSpec    string `xml:"setSpec"`    // eppp:ART, eppp:ART, eppp:...
                } `xml:"header"`
                Metadata struct {
                        Text    string `xml:",chardata"`
                        Rfc1807 struct {
                                Text           string   `xml:",chardata"`
                                Xmlns          string   `xml:"xmlns,attr"`
                                Xsi            string   `xml:"xsi,attr"`
                                SchemaLocation string   `xml:"schemaLocation,attr"`
                                BibVersion     string   `xml:"bib-version"`  // v2, v2, v2...
                                ID             string   `xml:"id"`           // http://jou...
                                Entry          string   `xml:"entry"`        // 2009-06-24...
                                Organization   []string `xml:"organization"` // Proceeding...
                                Title          string   `xml:"title"`        // Introducti...
                                Type           string   `xml:"type"`
                                Author         []string `xml:"author"`       // KRAMPEN, G..
                                Copyright      string   `xml:"copyright"`    // Das Urhebe...
                                OtherAccess    string   `xml:"other_access"` // url:http:/...
                                Keyword        string   `xml:"keyword"`
                                Period         []string `xml:"period"`
                                Monitoring     string   `xml:"monitoring"`
                                Language       string   `xml:"language"` // en, en, en, e...
                                Abstract       string   `xml:"abstract"` // After a short...
                                Date           string   `xml:"date"`     // 2009-06-22 12...
                        } `xml:"rfc1807"`
                } `xml:"metadata"`
                About string `xml:"about"`
        } `xml:"Record"`
}

Only consider a nested element

$ zek -t metadata fixtures/z.xml
// Metadata was generated 2019-06-11 16:33:26 by tir on hayiti.
type Metadata struct {
        XMLName xml.Name `xml:"metadata"`
        Text    string   `xml:",chardata"`
        Dc      struct {
                Text  string `xml:",chardata"`
                Xmlns string `xml:"xmlns,attr"`
                Title struct {
                        Text  string `xml:",chardata"`
                        Xmlns string `xml:"xmlns,attr"`
                } `xml:"title"`
                Identifier struct {
                        Text  string `xml:",chardata"`
                        Xmlns string `xml:"xmlns,attr"`
                } `xml:"identifier"`
                Rights struct {
                        Text  string `xml:",chardata"`
                        Xmlns string `xml:"xmlns,attr"`
                        Lang  string `xml:"lang,attr"`
                } `xml:"rights"`
                AccessRights struct {
                        Text  string `xml:",chardata"`
                        Xmlns string `xml:"xmlns,attr"`
                } `xml:"accessRights"`
        } `xml:"dc"`
}

Inference across files

$ zek fixtures/a.xml fixtures/b.xml fixtures/c.xml
// A was generated 2017-12-05 17:40:14 by tir on apollo.
type A struct {
	XMLName xml.Name `xml:"a"`
	Text    string   `xml:",chardata"`
	B       []struct {
		Text string `xml:",chardata"`
	} `xml:"b"`
}

This is also useful, if you deal with archives containing XML files:

$ unzip -p 4082359.zip '*.xml' | zek -e

Given a directory full of zip files, you can combined find, unzip and zek:

$ for i in $(find ftp/b571 -type f -name "*zip"); do unzip -p $i '*xml'; done | zek -e

Another example (tarball with thousands of XML files, seemingly MARC):

$ tar -xOzf /tmp/20180725.125255.tar.gz | zek -e
// OAIPMH was generated 2018-09-26 15:03:29 by tir on sol.
type OAIPMH struct {
        XMLName        xml.Name `xml:"OAI-PMH"`
        Text           string   `xml:",chardata"`
        Xmlns          string   `xml:"xmlns,attr"`
        Xsi            string   `xml:"xsi,attr"`
        SchemaLocation string   `xml:"schemaLocation,attr"`
        ListRecords    struct {
                Text   string `xml:",chardata"`
                Record struct {
                        Text   string `xml:",chardata"`
                        Header struct {
                                Text       string `xml:",chardata"`
                                Identifier struct {
                                        Text string `xml:",chardata"` // aleph-pub:000000001, ...
                                } `xml:"identifier"`
                        } `xml:"header"`
                        Metadata struct {
                                Text   string `xml:",chardata"`
                                Record struct {
                                        Text           string `xml:",chardata"`
                                        Xmlns          string `xml:"xmlns,attr"`
                                        Xsi            string `xml:"xsi,attr"`
                                        SchemaLocation string `xml:"schemaLocation,attr"`
                                        Leader         struct
                                                Text string `xml:",chardata"` // 00001nM2.01200024
                                        } `xml:"leader"`
                                        Controlfield []struct {
                                                Text string `xml:",chardata"` // 00001nM2.01200024
                                                Tag  string `xml:"tag,attr"`
                                        } `xml:"controlfield"`
                                        Datafield []struct {
                                                Text     string `xml:",chardata"`
                                                Tag      string `xml:"tag,attr"`
                                                Ind1     string `xml:"ind1,attr"`
                                                Ind2     string `xml:"ind2,attr"`
                                                Subfield []struct {
                                                        Text string `xml:",chardata"` // KM0000002
                                                        Code string `xml:"code,attr"`
                                                } `xml:"subfield"`
                                        } `xml:"datafield"`
                                } `xml:"record"`
                        } `xml:"metadata"`
                } `xml:"record"`
        } `xml:"ListRecords"`
}

Generate a package

If you want in include generated file in the build process, e.g. with go generate, you may find -P and -o helpful.

$ cat fixtures/b.xml
<a><b></b></a>

Run on the command line or via go generate:

$ zek -P mypkg -o data.go < fixtures/b.xml

This would write out the following in data.go file:

// Code generated by zek; DO NOT EDIT.

package mypkg

import "encoding/xml"

// A was generated 2021-09-16 11:23:06 by tir on trieste.
type A struct {
        XMLName xml.Name `xml:"a"`
        Text    string   `xml:",chardata"`
        B       string   `xml:"b"`
}

Note that any existing file will be overwritten, without any warning.

Misc

As a side effect, zek seems to be a useful for debugging. Example:

This record is emitted from a typical OAI server (OJS, not even uncommon), yet one can quickly spot the flaw in the structure.

Over 30 different struct generated manually in the course of a few hours (around five minutes per source): https://git.io/vbTDo.

-- Current extent leader: 1532 lines struct

Documentation

Index

Constants

View Source
const Version = "0.1.23"

Version of application.

Variables

View Source
var (
	// UppercaseByDefault is used during XML tag name to Go name conversion.
	// This is a opinionated list and could be made configurable.
	UppercaseByDefault = []string{
		"Id",
		"Xml",
		"eissn",
		"http",
		"id",
		"id",
		"isbn",
		"ismn",
		"issn",
		"json",
		"lccn",
		"rfc",
		"rsn",
		"svg",
		"uri",
		"url",
		"urn",
		"xml",
		"zdb",
	}
	// DefaultTextFieldNames list struct field names for chardata, most preferred first.
	DefaultTextFieldNames = []string{
		"Text",
		"Chardata",
	}
	// DefaultAttributePrefixes are used, if there are name clashes.
	DefaultAttributePrefixes = []string{
		"Attr",
		"Attribute",
	}
)

Functions

func CreateNameFunc added in v0.1.2

func CreateNameFunc(upper []string) func(string) string

CreateNameFunc returns a function that converts a tag into a canonical Go name. Given list of strings will be wholly upper cased.

Types

type Node

type Node struct {
	Name        xml.Name   `json:"name,omitempty"`
	Attr        []xml.Attr `json:"attr,omitempty"`
	Examples    []string   `json:"examples,omitempty"`
	Children    []*Node    `json:"children,omitempty"`
	Freqs       []int      `json:"-"` // Collect number of occurrences of this node within parent.
	MaxExamples int        `json:"-"` // Maximum number of examples to keep, gets passed to children.
	// contains filtered or unexported fields
}

Node represents an element in the XML tree. It keeps track of its name, attributes, childnodes and example chardata and basic statistics, e.g. how often a node has been seen within its parent node.

func (*Node) ByName added in v0.1.2

func (node *Node) ByName(name string) *Node

ByName finds a node in the tree (dfs) by name. Comparisons start at the current node. First match is returned. If nothing matches, nil is returned.

func (*Node) CreateOrGetChild

func (node *Node) CreateOrGetChild(name xml.Name, attr []xml.Attr) *Node

CreateOrGetChild creates a child if no child with the same tag name exists, otherwise returns the existing node with that name. We want to collect node and attribute information for a node and not replicate the XML tree.

func (*Node) End

func (node *Node) End()

End signals end of an element.

func (*Node) Height

func (node *Node) Height() int

Height returns the height of the tree. A tree with zero nodes has height zero, a single node tree has height 1.

func (*Node) IsMultivalued

func (node *Node) IsMultivalued() bool

IsMultivalued returns true, if this node appeared more than once.

func (*Node) ReadFrom

func (node *Node) ReadFrom(r io.Reader, opts *ReadOpts) (int64, error)

ReadFrom reads XML from a reader. TODO: pass read options.

type ReadOpts added in v0.1.20

type ReadOpts struct {
	MaxExamples int
	MaxTokens   int64
}

ReadOpts groups options for parsing.

type Stack

type Stack struct {
	sync.Mutex
	// contains filtered or unexported fields
}

Stack is a simple stack for arbitrary types.

func (*Stack) Len

func (s *Stack) Len() int

Len returns number of items on the stack.

func (*Stack) Peek

func (s *Stack) Peek() interface{}

Peek returns the top element without removing it. Panic it stack is empty.

func (*Stack) Pop

func (s *Stack) Pop() interface{}

Pop item from stack. It's a panic if stack is empty.

func (*Stack) Put

func (s *Stack) Put(item interface{})

Put item onto stack.

type StructWriter

type StructWriter struct {
	NameFunc          func(string) string // Turns xml tag names into Go names.
	TextFieldNames    []string            // Field name for chardata.
	AttributePrefixes []string            // In case of a name clash, try these prefixes.
	WithComments      bool                // Annotate struct with comments and examples.
	Banner            string              // Autogenerated note.
	ExampleMaxChars   int                 // Max length of example comment.
	Strict            bool                // Whether to ignore implementation holes.
	WithJSONTags      bool                // Include JSON struct tags.
	Compact           bool                // Emit more compact struct.
	UniqueExamples    bool                // Filter out duplicated examples
	OmitEmptyText     bool                // Don't generate Text fields if no example elements have chardata.
	// contains filtered or unexported fields
}

StructWriter can turn a node into a struct and can be configured. TODO(miku): Use templates.

func NewStructWriter

func NewStructWriter(w io.Writer) *StructWriter

NewStructWriter returns a StructWriter that can write a node to a given writer. Uses a default list of words to wholly uppercase.

func (*StructWriter) WriteNode

func (sw *StructWriter) WriteNode(node *Node) (err error)

WriteNode writes a node to a writer.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL