deepdiff

package module
v0.2.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 4, 2021 License: LGPL-3.0 Imports: 13 Imported by: 5

README

deepdiff

Qri GoDoc License Codecov CI Go Report Card

deepdiff is a structured data differ that aims for near-linear time complexity. It's intended to calculate differences & apply patches to structured data ranging from 0-500MBish of encoded JSON.

Diffing structured data carries additional complexity when compared to the standard unix diff utility, which operates on lines of text. By using the structure of data itself, deepdiff is able to provide a rich description of changes that maps onto the structure of the data itself. deepdiff ignores semantically irrelevant changes like whitespace, and can isolate changes like column changes to tabular data to only the relevant switches

Most algorithms in this space have quadratic time complexity, which from our testing makes them very slow on 3MB JSON documents and unable to complete on 5MB or more. deepdiff currently hovers around the 0.9Sec/MB range on 4 core processors

Instead of operating on JSON directly, deepdiff operates on document trees consisting of the go types created by unmarshaling from JSON, which are two complex types:

  map[string]interface{}
  []interface{}

and five scalar types:

  string, int, float64, bool, nil

By operating on native go types deepdiff can compare documents encoded in different formats, for example decoded CSV or CBOR.

deepdiff is based off an algorithm designed for diffing XML documents outlined in Detecting Changes in XML Documents by Grégory Cobéna & Amélie Marian

It's been adapted to fit purposes of diffing for Qri: https://github.com/qri-io/qri, folding in parallelism primitives afforded by the go language

deepdiff also includes a tool for applying patches, see documentation for details.

Project Status:

👷♀ 👷♂ This is a very new project that hasn't been properly vetted in testing enviornments. Issues/PRs welcome & appriciated. 👷♀ 👷♂

Benchmarks

Run on a 4 core MacBook Pro:

$ go test -bench . --run XXX -v --benchmem
goos: darwin
goarch: amd64
pkg: github.com/qri-io/deepdiff
BenchmarkDiff1-4          	   20000	     88167 ns/op	   13324 B/op	     496 allocs/op
BenchmarkDiffDatasets-4   	    5000	    241119 ns/op	   53367 B/op	    1614 allocs/op
BenchmarkDiff5MB-4        	       1	4357009141 ns/op	783217944 B/op	29952860 allocs/op
PASS
ok  	github.com/qri-io/deepdiff	8.369s
Getting Involved

We would love involvement from more people! If you notice any errors or would like to submit changes, please see our Contributing Guidelines.

Basic Usage

Here's a quick example pulled from the godoc:

package main

import (
	"encoding/json"
  "fmt"
  
  "github.com/qri-io/deepdiff"
)

// start with two slightly different json documents
var aJSON = []byte(`{
  "a": 100,
  "foo": [1,2,3],
  "bar": false,
  "baz": {
    "a": {
      "b": 4,
      "c": false,
      "d": "apples-and-oranges"
    },
    "e": null,
    "g": "apples-and-oranges"
  }
}`)

var bJSON = []byte(`{
  "a": 99,
  "foo": [1,2,3],
  "bar": false,
  "baz": {
    "a": {
      "b": 5,
      "c": false,
      "d": "apples-and-oranges"
    },
    "e": "thirty-thousand-something-dogecoin",
    "f": false
  }
}`)

func main() {
	// unmarshal the data into generic interfaces
	var a, b interface{}
	if err := json.Unmarshal(aJSON, &a); err != nil {
		panic(err)
	}
	if err := json.Unmarshal(bJSON, &b); err != nil {
		panic(err)
	}

	// Diff will use default configuration to produce a slice of Changes
	// by default Diff will not generate update change only inserts & deletes
	diffs, err := deepdiff.Diff(a, b)
	if err != nil {
		panic(err)
	}

	// Format the changes for terminal output
	change, err := FormatPretty(diffs)
	if err != nil {
		panic(err)
	}

	fmt.Println(change)
	// Output: baz:
	//   + f: false
	//   - g: "apples-and-oranges"
	//   a:
	//     ~ b: 5
	//   ~ e: "thirty-thousand-something-dogecoin"
	// ~ a: 99
}

License

The deepdiff library is licensed under the GNU Lesser General Public License v3.0

Documentation

Overview

Package deepdiff is a structured data differ that aims for near-linear time complexity. It's intended to calculate differences & apply patches to structured data ranging from 0-500MBish of encoded JSON

Diffing structured data carries additional complexity when compared to the standard unix diff utility, which operates on lines of text. By using the structure of data itself, deepdiff is able to provide a rich description of changes that maps onto the structure of the data itself. deepdiff ignores semantically irrelevant changes like whitespace, and can isolate changes like column changes to tabular data to only the relevant switches

Most algorithms in this space have quadratic time complexity, which from our testing makes them very slow on 3MB JSON documents and unable to complete on 5MB or more. deepdiff currently hovers around the 0.9Sec/MB range on 4 core processors

Instead of operating on JSON directly, deepdiff operates on document trees consisting of the go types created by unmarshaling from JSON, which aretwo complex types:

map[string]interface{}
[]interface{}

and five scalar types:

string, int, float64, bool, nil

by operating on native go types deepdiff can compare documents encoded in different formats, for example decoded CSV or CBOR.

deepdiff is based off an algorithm designed for diffing XML documents outlined in: Detecting Changes in XML Documents by Grégory Cobéna & Amélie Marian https://ieeexplore.ieee.org/document/994696 it's been adapted to fit purposes of diffing for Qri: https://github.com/qri-io/qri the guiding use case for this work

deepdiff also includes a tool for applying patches, see documentation for details

Example
// we'll use the background as our execution context
ctx := context.Background()

// start with two slightly different json documents
aJSON := []byte(`{
		"a": 100,
		"baz": {
			"a": {
				"d": "apples-and-oranges"
			}
		}
	}`)

bJSON := []byte(`{
		"a": 99,
		"baz": {
			"a": {
				"d": "apples-and-oranges"
			},
			"e": "thirty-thousand-something-dogecoin"
		}
	}`)

// unmarshal the data into generic interfaces
var a, b interface{}
if err := json.Unmarshal(aJSON, &a); err != nil {
	panic(err)
}
if err := json.Unmarshal(bJSON, &b); err != nil {
	panic(err)
}

// create a differ, using the default configuration
dd := New()

// Diff will produce a slice of Deltas that describe the structured changes.
// by default Diff will not calculate moves, only inserts, deletes, and
// updates
diffs, err := dd.Diff(ctx, a, b)
if err != nil {
	panic(err)
}

// diffs use a custom compact JSON Marshaller
output, err := json.MarshalIndent(diffs, "", "  ")
if err != nil {
	panic(err)
}

fmt.Println(string(output))
Output:

[
  [
    "-",
    "a",
    100
  ],
  [
    "+",
    "a",
    99
  ],
  [
    " ",
    "baz",
    null,
    [
      [
        " ",
        "a",
        {
          "d": "apples-and-oranges"
        }
      ],
      [
        "+",
        "e",
        "thirty-thousand-something-dogecoin"
      ]
    ]
  ]
]

Index

Examples

Constants

View Source
const (
	// DTContext indicates unchanged contextual details present in both A and B
	DTContext = Operation(" ")
	// DTDelete means making the children of a node
	// become the children of a node's parent
	DTDelete = Operation("-")
	// DTInsert is the compliment of deleting, adding
	// children of a parent node to a new node, and making
	// that node a child of the original parent
	DTInsert = Operation("+")
	// DTUpdate is an alteration of a scalar data type (string, bool, float, etc)
	DTUpdate = Operation("~")
)

Variables

View Source
var NewHash = func() hash.Hash {
	return fnv.New64()
}

NewHash returns a new hash interface, wrapped in a function for easy hash algorithm switching, package consumers can override NewHash with their own desired hash.Hash implementation if the value space is particularly large. default is 64-bit FNV 1 for fast, cheap, (non-cryptographic) hashing

Functions

func FormatPretty

func FormatPretty(w io.Writer, changes Deltas, colorTTY bool) error

FormatPretty writes a text report to w. if colorTTY is true it will add red "-" for deletions green "+" for insertions blue "~" for changes (an insert & delete at the same path) This is very much a work in progress

func FormatPrettyStats

func FormatPrettyStats(w io.Writer, diffStat *Stats, colorTTY bool)

FormatPrettyStats writes stats info to a supplied writer destination, optionally adding terminal color tags

func FormatPrettyStatsString added in v0.2.0

func FormatPrettyStatsString(diffStat *Stats, colorTTY bool) string

FormatPrettyStatsString prints a string of stats info

func FormatPrettyString added in v0.2.0

func FormatPrettyString(changes Deltas, colorTTY bool) (string, error)

FormatPrettyString is a convenience wrapper that outputs to a string instead of an io.Writer

func Patch

func Patch(deltas Deltas, target interface{}) error

Patch applies a change script (patch) to a value

Types

type Addr added in v0.2.0

type Addr interface {
	json.Marshaler
	Value() interface{}
	String() string
	Eq(a Addr) bool
}

Addr is a single location within a data structure. Multiple path elements can be stitched together into a single

type Config added in v0.2.0

type Config struct {
	// Setting CalcChanges to true will have diff represent in-place value shifts
	// as changes instead of add-delete pairs
	CalcChanges bool
}

Config are any possible configuration parameters for calculating diffs

type DeepDiff added in v0.2.0

type DeepDiff struct {
	// contains filtered or unexported fields
}

DeepDiff is a configuration for performing diffs

func New added in v0.2.0

func New(opts ...DiffOption) *DeepDiff

New creates a deepdiff struct

func (*DeepDiff) Diff added in v0.2.0

func (dd *DeepDiff) Diff(ctx context.Context, a, b interface{}) (Deltas, error)

Diff computes a slice of deltas that define an edit script for turning a into b. currently Diff will never return an error, error returns are reserved for future use. specifically: bailing before delta calculation based on a configurable threshold

func (*DeepDiff) Stat added in v0.2.0

func (dd *DeepDiff) Stat(ctx context.Context, a, b interface{}) (*Stats, error)

Stat calculates the DiffStats between two documents

func (*DeepDiff) StatDiff added in v0.2.0

func (dd *DeepDiff) StatDiff(ctx context.Context, a, b interface{}) (Deltas, *Stats, error)

StatDiff calculates a diff script and diff stats

type Delta

type Delta struct {
	// the type of change
	Type Operation `json:"type"`
	// Path is a string representation of the patch to where the delta operation
	// begins in the destination documents
	// path should conform to the IETF JSON-pointer specification, outlined
	// in RFC 6901: https://tools.ietf.org/html/rfc6901
	Path Addr `json:"path"`
	// The value to change in the destination document
	Value interface{} `json:"value"`

	// To make delta's revesible, original values are included
	// the original path this change from
	SourcePath string `json:"SourcePath,omitempty"`
	// the original  value this was changed from, will not always be present
	SourceValue interface{} `json:"originalValue,omitempty"`

	// Child Changes
	Deltas `json:"deltas,omitempty"`
}

Delta represents a change between a source & destination document a delta is a single "edit" that describes changes to the destination document

func (*Delta) MarshalJSON added in v0.2.0

func (d *Delta) MarshalJSON() ([]byte, error)

MarshalJSON implements a custom JOSN Marshaller

type Deltas added in v0.2.0

type Deltas []*Delta

Deltas is a sortable slice of changes

func (Deltas) Len added in v0.2.0

func (ds Deltas) Len() int

Len returns the length of the slice

func (Deltas) Less added in v0.2.0

func (ds Deltas) Less(i, j int) bool

Less returns true if the value at index i is a smaller sort quantity than the value at index j

func (Deltas) Swap added in v0.2.0

func (ds Deltas) Swap(i, j int)

Swap reverses the values at indicies i & J

type DiffOption

type DiffOption func(cfg *Config)

DiffOption is a function that adjust a config, zero or more DiffOptions can be passed to the Diff function

type IndexAddr added in v0.2.0

type IndexAddr int

IndexAddr is the address of a location within list-type structures

func (IndexAddr) Eq added in v0.2.0

func (p IndexAddr) Eq(b Addr) bool

Eq tests for equality with another address

func (IndexAddr) MarshalJSON added in v0.2.0

func (p IndexAddr) MarshalJSON() ([]byte, error)

MarshalJSON implements the json.Marshaller interface

func (IndexAddr) String added in v0.2.0

func (p IndexAddr) String() string

String returns this address as a string

func (IndexAddr) Value added in v0.2.0

func (p IndexAddr) Value() interface{}

Value returns IndexAddr as an int, a common go type

type Operation

type Operation string

Operation defines the operation of a Delta item

type RootAddr added in v0.2.0

type RootAddr struct{}

RootAddr is a nihlic address, or a reference to the outside address space

func (RootAddr) Eq added in v0.2.0

func (RootAddr) Eq(b Addr) bool

Eq checks for Root Address equality

func (RootAddr) MarshalJSON added in v0.2.0

func (RootAddr) MarshalJSON() ([]byte, error)

MarshalJSON writes "null". RootAddress is represented as null in JSON

func (RootAddr) String added in v0.2.0

func (RootAddr) String() string

String root address is "/" by convention

func (RootAddr) Value added in v0.2.0

func (RootAddr) Value() interface{}

Value of RootAddr is nil

type Stats

type Stats struct {
	Left  int `json:"leftNodes"`  // count of nodes in the left tree
	Right int `json:"rightNodes"` // count of nodes in the right tree

	LeftWeight  int `json:"leftWeight"`  // byte-ish count of left tree
	RightWeight int `json:"rightWeight"` // byte-ish count of right tree

	Inserts int `json:"inserts,omitempty"` // number of nodes inserted
	Updates int `json:"updates,omitempty"` // number of nodes updated
	Deletes int `json:"deletes,omitempty"` // number of nodes deleted
}

Stats holds statistical metadata about a diff

func (Stats) NodeChange

func (s Stats) NodeChange() int

NodeChange returns a count of the shift between left & right trees

func (Stats) PctWeightChange

func (s Stats) PctWeightChange() float64

PctWeightChange returns a value from -1.0 to max(float64) representing the size shift between left & right trees

type StringAddr added in v0.2.0

type StringAddr string

StringAddr is an arbitrary key representation within a data structure. Most-often used to represent map keys

func (StringAddr) Eq added in v0.2.0

func (p StringAddr) Eq(b Addr) bool

Eq tests for equality with another address

func (StringAddr) MarshalJSON added in v0.2.0

func (p StringAddr) MarshalJSON() ([]byte, error)

MarshalJSON implements the json.Marshaller interface

func (StringAddr) String added in v0.2.0

func (p StringAddr) String() string

String returns this address as a string

func (StringAddr) Value added in v0.2.0

func (p StringAddr) Value() interface{}

Value returns StringAddr as a string, a common go type

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL