xdiff

package module

v0.0.1 Latest Latest Go to latest Published: Dec 4, 2018 License: MIT Imports: 5 Imported by: 1

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/ajankovic/xdiff

Links

Open Source Insights

README ¶

X-Diff algorithm implemented in Go

This project should be considered as WORK IN PROGRESS. Breaking changes will most likely happen the API so refrain from using it for anything important until it reaches the stable version. X-Diff algorithm paper can be found here.

Background

By the official XML specification, XML tree model is ordered. In practice however, it's up to the users to determine how they want their data to be interpreted. So for most cases XML can be represented as unordered tree model.

Using an unordered model, change detection is substantially harder than using the ordered model, but the change result that it generates is more accurate.

X-Diff is an algorithm that proposes change detection between XML documents by treating them as unordered trees. The general problem becomes NP-hard, but by reducing the possible mappings between the documents X-Diff algorithm theoretically solves the problem in polynomial time. More precisely:

O(|T₁| * |T₂| * max{ deg(T₁), deg(T₂) } * log₂( max{ deg(T₁), deg(T₂) }))

Where |T₁| and |T₂| denote number of nodes in trees T₁ and T₂, and max{deg(T₁), deg(T₁)} denote maximum child nodes between the trees.

The core proposal and analysis of the paper are solid, but overall quality of the paper is not so good.

For example, proposed Figure 4.2 Matching Algorithm is flawed because it doesn't count for differences where tree levels are different. Also there are vague and unclear statements in the descriptions so lot is left to the interpretation (or correction) by the implementer.

About this implementation

This is a hobby project. The goal, at first, was to implement XML diffing by just following the mentioned paper and doing what's said there verbatim. But as I started understanding (and falling in love with) the problem more I have expanded the design so it's not actually a verbatim X-Diff implementation.

The core of the design mentioned in the paper will always be part of the project:

Create XTree with node signatures and hashes.
Match trees using minimum cost maximum flow and minimum cost matching.
Generate edit script from minimum cost matching.

Additions I intend to add are beyond the goal of just implementing an algorithm:

Implement custom XML parser to efficiently generate XTree. Using standard XML parser is very slow. Reasons for this are multiple, from creating copies of underling buffer to the way it's handling unicode. In my initial tests it takes over 20 seconds to process document of ~850 MB. With some tradeoffs, custom in-place parser brings this down to ~5 seconds. Improving this farther along with adding robustness to the parser will be the goal of the project in the future.
Add support for diffing directories of XML documents. Guess what else can be represented as an unordered tree, directories and files. The idea is to allow parsing of entire directories containing XML documents and representing them as part of the diffing tree. That way edit script will show operations on directories as well. For example this is useful in detecting changes to Salesforce metadata which is just bunch of XML files in directories.
Improve upon proposed design by the paper to improve performance of the algorithm. For example cited paper is vague about defining and describing how to reduce matching space before main matching of the algorithm. Goal of the project will be to define and implement those parts of the design.
Provide both library and CLI application for diffing XML documents.

Installation and Usage

The project is currently not stable enough so there are no official releases. Once design is set those will come.

For now you can use standard Go tooling to install the module:

go install github.com/ajankovic/xdiff/cmd/xdiff
// or
go get github.com/ajankovic/xdiff

CLI Usage

xdiff -left original.xml -right edited.xml

Library Usage

For a more elaborate example please check the cmd/xdiff/main.go.

import (
    "github.com/ajankovic/xdiff"
)

// Instantiate custom parser.
p := parser.New()
// Parse the XTree.
left, err := p.ParseFile(leftSource)
if err != nil {
	// handle error
}
right, err := p.ParseFile(rightSource)
if err != nil {
	// handle error
}
// Run compare algorithm on parsed trees.
diff, err := xdiff.Compare(left, right)
if err != nil {
	// handle error
}
// Output diff in plain text to the STDOUT.
enc := xdiff.PlainTextEncoder(os.Stdout)
if err := enc.Encode(diff); err != nil {
	// handle error
}

Author and Attribution

Owner: Aleksandar Janković (office@ajankovic.com)

Attribution:

Documentation ¶

Overview ¶

Package xdiff provides an implementation of the X-Diff algorithm used for finding minimum edit script between two xml documents.

Index ¶

type Delta
- func Compare(left *xtree.Node, right *xtree.Node) ([]Delta, error)
type Operation
- func (i Operation) String() string
type TextEncoder
- func NewTextEncoder(w io.Writer) *TextEncoder
- func (pte *TextEncoder) Encode(deltas []Delta) error

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Delta ¶

type Delta struct {
	Operation Operation
	Subject   *xtree.Node
	Object    *xtree.Node
}

Delta is a unit of change to the original doc that would change it into edited document.

func Compare ¶

func Compare(left *xtree.Node, right *xtree.Node) ([]Delta, error)

Compare generates slice of deltas that forms minimum-cost edit script to transform the left xtree into the right xtree.

type Operation ¶

type Operation int

Operation defines possible modifying operations of the edit script.

const (
	// Insert leaf node.
	Insert Operation = iota + 1
	// Update leaf node value.
	Update
	// Delete leaf node.
	Delete
	// InsertSubtree operation.
	InsertSubtree
	// DeleteSubtree operation.
	DeleteSubtree
)

func (Operation) String ¶

func (i Operation) String() string

type TextEncoder ¶

type TextEncoder struct {
	// contains filtered or unexported fields
}

TextEncoder knows how to convert edit script to plain text.

func NewTextEncoder ¶

func NewTextEncoder(w io.Writer) *TextEncoder

NewTextEncoder creates new text encoder.

func (*TextEncoder) Encode ¶

func (pte *TextEncoder) Encode(deltas []Delta) error

Encode sends edit script in plain text to the stream.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
xdiff
parser Package parser provides implementations of xml parsers.	Package parser provides implementations of xml parsers.
xtree

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL