xdiff

package module
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 4, 2018 License: MIT Imports: 5 Imported by: 1

README

X-Diff algorithm implemented in Go

GoDoc Badge GoReportCard Badge

This project should be considered as WORK IN PROGRESS. Breaking changes will most likely happen the API so refrain from using it for anything important until it reaches the stable version. X-Diff algorithm paper can be found here.

Background

By the official XML specification, XML tree model is ordered. In practice however, it's up to the users to determine how they want their data to be interpreted. So for most cases XML can be represented as unordered tree model.

Using an unordered model, change detection is substantially harder than using the ordered model, but the change result that it generates is more accurate.

X-Diff is an algorithm that proposes change detection between XML documents by treating them as unordered trees. The general problem becomes NP-hard, but by reducing the possible mappings between the documents X-Diff algorithm theoretically solves the problem in polynomial time. More precisely:

O(|T1| * |T2| * max{ deg(T1), deg(T2) } * log2( max{ deg(T1), deg(T2) }))

Where |T1| and |T2| denote number of nodes in trees T1 and T2, and max{deg(T1), deg(T1)} denote maximum child nodes between the trees.

The core proposal and analysis of the paper are solid, but overall quality of the paper is not so good.

For example, proposed Figure 4.2 Matching Algorithm is flawed because it doesn't count for differences where tree levels are different. Also there are vague and unclear statements in the descriptions so lot is left to the interpretation (or correction) by the implementer.

About this implementation

This is a hobby project. The goal, at first, was to implement XML diffing by just following the mentioned paper and doing what's said there verbatim. But as I started understanding (and falling in love with) the problem more I have expanded the design so it's not actually a verbatim X-Diff implementation.

The core of the design mentioned in the paper will always be part of the project:

  1. Create XTree with node signatures and hashes.
  2. Match trees using minimum cost maximum flow and minimum cost matching.
  3. Generate edit script from minimum cost matching.

Additions I intend to add are beyond the goal of just implementing an algorithm:

  1. Implement custom XML parser to efficiently generate XTree. Using standard XML parser is very slow. Reasons for this are multiple, from creating copies of underling buffer to the way it's handling unicode. In my initial tests it takes over 20 seconds to process document of ~850 MB. With some tradeoffs, custom in-place parser brings this down to ~5 seconds. Improving this farther along with adding robustness to the parser will be the goal of the project in the future.

  2. Add support for diffing directories of XML documents. Guess what else can be represented as an unordered tree, directories and files. The idea is to allow parsing of entire directories containing XML documents and representing them as part of the diffing tree. That way edit script will show operations on directories as well. For example this is useful in detecting changes to Salesforce metadata which is just bunch of XML files in directories.

  3. Improve upon proposed design by the paper to improve performance of the algorithm. For example cited paper is vague about defining and describing how to reduce matching space before main matching of the algorithm. Goal of the project will be to define and implement those parts of the design.

  4. Provide both library and CLI application for diffing XML documents.

Installation and Usage

The project is currently not stable enough so there are no official releases. Once design is set those will come.

For now you can use standard Go tooling to install the module:

go install github.com/ajankovic/xdiff/cmd/xdiff
// or
go get github.com/ajankovic/xdiff
CLI Usage
xdiff -left original.xml -right edited.xml
Library Usage

For a more elaborate example please check the cmd/xdiff/main.go.

import (
    "github.com/ajankovic/xdiff"
)

// Instantiate custom parser.
p := parser.New()
// Parse the XTree.
left, err := p.ParseFile(leftSource)
if err != nil {
	// handle error
}
right, err := p.ParseFile(rightSource)
if err != nil {
	// handle error
}
// Run compare algorithm on parsed trees.
diff, err := xdiff.Compare(left, right)
if err != nil {
	// handle error
}
// Output diff in plain text to the STDOUT.
enc := xdiff.PlainTextEncoder(os.Stdout)
if err := enc.Encode(diff); err != nil {
	// handle error
}

Author and Attribution

Owner: Aleksandar Janković (office@ajankovic.com)

Attribution:

Documentation

Overview

Package xdiff provides an implementation of the X-Diff algorithm used for finding minimum edit script between two xml documents.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Delta

type Delta struct {
	Operation Operation
	Subject   *xtree.Node
	Object    *xtree.Node
}

Delta is a unit of change to the original doc that would change it into edited document.

func Compare

func Compare(left *xtree.Node, right *xtree.Node) ([]Delta, error)

Compare generates slice of deltas that forms minimum-cost edit script to transform the left xtree into the right xtree.

type Operation

type Operation int

Operation defines possible modifying operations of the edit script.

const (
	// Insert leaf node.
	Insert Operation = iota + 1
	// Update leaf node value.
	Update
	// Delete leaf node.
	Delete
	// InsertSubtree operation.
	InsertSubtree
	// DeleteSubtree operation.
	DeleteSubtree
)

func (Operation) String

func (i Operation) String() string

type TextEncoder

type TextEncoder struct {
	// contains filtered or unexported fields
}

TextEncoder knows how to convert edit script to plain text.

func NewTextEncoder

func NewTextEncoder(w io.Writer) *TextEncoder

NewTextEncoder creates new text encoder.

func (*TextEncoder) Encode

func (pte *TextEncoder) Encode(deltas []Delta) error

Encode sends edit script in plain text to the stream.

Directories

Path Synopsis
cmd
Package parser provides implementations of xml parsers.
Package parser provides implementations of xml parsers.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL