dedup

package module
v0.0.0-...-18ff271 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 9, 2017 License: MIT Imports: 17 Imported by: 0

README

dedup - a deduplication tool (+ library)

Build Status GoDoc

Why

Deduplication can be thought of as a coarse grained compression that detects duplicate data over much larger windows than most compressors work across. As a result, deduplication is a good first step and passing deduplicated data into a downstream compressor often results in much better compression performance (in terms of compression ratio and sometimes speed)

Library

dedup is a golang lib that allows for arbitrary data to be deduplicated (input from an io.Reader). The dedup.Deduplicator and dedup.Reduplicator are the workhorses that actually do all the work. For examples on how to use them, see the dedup tool itself (in cmd/dedup/main.go).

Installation

As with any go(lang) pkg, you can go get by

shell> go get github.com/amoghe/dedup
Usage
import (
  "github.com/amoghe/dedup"
)

// somewhere in your code
err := dedup.NewDeduplicator(windowSize, mask).Do(os.Stdin, os.Stdout)

err := dedup.NewReduplicator().Do(os.Stdin, os.Stdout)

Binary

This codebase also builds a cmdline tool named dedup (see cmd/dedup) that can be used to deduplicate data.

Installation

The tool can be installed by either building from source

shell> go get github.com/amoghe/dedup && \
  cd $GOPATH/src/github.com/amoghe/dedup && \
  go install

Alternatively, you can download a release binary from the Releases section of this github project.

Usage

Consider this workload where we save two similar docker containers:

akshay@spitfire:~/$ time docker save redmine bitnami/redmine | gzip | wc --bytes
497548816 # <-- 474.49 MB (or MiB)

real  1m7.900s
user  0m58.536s
sys   0m1.780s

akshay@spitfire:~/$ time docker save redmine bitnami/redmine | dedup | gzip | wc --bytes
295793793 # <-- 282.09 MB (or MiB)

real  0m50.261s
user  0m56.312s
sys   0m3.688s

As you can see, some workloads can benefit greatly from a combination of deduplication + compression (in terms of both compression ratio and speed)

Compression

Note that this lib (and tool) probably won't ever support built-in support for compression of the output stream. You should pick an appropriate compressor "downstream" from this lib/tool. You'll find that standalone compressors such as gzip, bzip2, xz (and their parallel implementations - pigz, pbzip2, pxz) are readily available on most linux distributions. These compressors support pipelining (i.e. i/o can be pipelined via the shell) so there is no need for this library to provide this functionality.

TODO:

  • Currently the deduplication lib consumes memory that is proportional to the size of the input file. (See issue #1)
  • Document the usage and impact of the windowSize and zeroBits parameters used by the Deduplicator
  • Add progress reporting when input is a large file (not stdin)
  • Make cmdline args fully compatible with other compression tools ('-k', '-v')
  • Add tests! (unit tests, fuzz tests)

LICENSE

See the LICENSE file

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Deduplicator

type Deduplicator struct {
	// contains filtered or unexported fields
}

Deduplicator performs deduplication of the specified file

func NewDeduplicator

func NewDeduplicator(winsz, mask uint64) *Deduplicator

NewDeduplicator returns a Deduplicator

func (*Deduplicator) Do

func (d *Deduplicator) Do(input io.Reader, output io.Writer) error

Do runs the deduplication of the specified input stream

func (*Deduplicator) PrintStats

func (d *Deduplicator) PrintStats(out io.Writer) error

PrintStats prints stats to the given writer

type Differ

type Differ struct {
	// contains filtered or unexported fields
}

Differ performs diff computation (and resuscitation)

func NewDiffer

func NewDiffer(winsz, mask uint64) *Differ

NewDiffer returns a Differ

func (*Differ) ApplyPatch

func (d *Differ) ApplyPatch(old, patch io.Reader, new io.Writer) error

ApplyPatch applies the patch file to the 'old' and writes the result to 'new'

func (*Differ) MakePatch

func (d *Differ) MakePatch(old, new io.Reader, out io.Writer) error

MakePatch writes a "patch" file (betweem "old" and "new") to the specified output WriteCloser

type Reduplicator

type Reduplicator struct {
	// contains filtered or unexported fields
}

Reduplicator performs reduplication of the specified file

func NewReduplicator

func NewReduplicator() *Reduplicator

NewReduplicator returns a Reduplicator

func (*Reduplicator) Do

func (r *Reduplicator) Do(input io.Reader, output io.Writer) error

Do runs the reduplication writing the output to the output stream

type SegmentHandler

type SegmentHandler func([]byte) error

SegmentHandler is something capable of processing the segments handed to it

type SegmentStat

type SegmentStat struct {
	ID     uint64 // ID is a (unique) numeric identifier for this segment
	Length int    // Length of segment
	Freq   int    // How many times this segment occurred in the file
}

SegmentStat holds stats for a single segment

type SegmentTracker

type SegmentTracker struct {
	SegHashes map[string]SegmentStat // map[crypto hash of seg] -> SegmentStat
	// contains filtered or unexported fields
}

SegmentTracker tracks segments

func NewSegmentTracker

func NewSegmentTracker() *SegmentTracker

NewSegmentTracker returns an initialized SegmentTracker struct

func (SegmentTracker) PrintMostFrequentSegStats

func (s SegmentTracker) PrintMostFrequentSegStats(out io.Writer, n int) error

PrintMostFrequentSegStats prints 'n' "hottest" segments (SegmentStat)

func (SegmentTracker) PrintSegLengthHistogram

func (s SegmentTracker) PrintSegLengthHistogram(out io.Writer) error

PrintSegLengthHistogram prints histogram (bars in csv) to out

func (SegmentTracker) PrintSegLengths

func (s SegmentTracker) PrintSegLengths(out io.Writer, sep string) error

PrintSegLengths prints segment lengths to the specified output separated by the specified separator

func (SegmentTracker) PrintStats

func (s SegmentTracker) PrintStats(out io.Writer) error

PrintStats prints the segment stats on the given output (io.Writer)

func (*SegmentTracker) Track

func (s *SegmentTracker) Track(segment, seghash []byte) SegmentStat

Track records the stats for the specified segment

type Segmenter

type Segmenter struct {
	WindowSize       uint64
	Mask             uint64
	MaxSegmentLength uint64
}

Segmenter segments a file or stream

func (Segmenter) SegmentFile

func (s Segmenter) SegmentFile(file io.Reader, handler SegmentHandler) error

SegmentFile does the actual work of segmenting the specified file as per the params configure in the Segmenter struct. It reads the io.Reader till EOF, calling the specified handler each time it finds a segment

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL