reedsolomon

package module
v0.0.0-...-9fd2511 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 25, 2015 License: MIT Imports: 8 Imported by: 0

README

Reed-Solomon

Golang version of Reed-Solomon Erasure Coding, with speeds exceeding 1GB/s/cpu core implemented in pure Go.

This is a golang port of the JavaReedSolomon library released by Backblaze, with some additional optimizations.

For an introduction on erasure coding, see the post on the Backblaze blog.

Package home: https://github.com/sbberk/golang-reedsolomon

Installation

To get the package use the standard:

go get github.com/sbberk/golang-reedsolomon

Usage

This section assumes you know the basics of Reed-Solomon encoding. A good start is this Backblaze blog post.

This package performs the calculation of the parity sets. The usage is therefore relatively simple.

First of all, you need to choose your distribution of data and parity shards. A 'good' distribution is very subjective, and will depend a lot on your usage scenario. A good starting point is above 5 and below 257 data shards (the maximum supported number), and the number of parity shards to be 2 or above, and below the number of data shards.

To create an encoder with 10 data shards (where your data goes) and 3 parity shards (calculated):

    enc, err := reedsolomon.New(10, 3)

This encoder will work for all parity sets with this distribution of data and parity shards. The error will only be set if you specify 0 or negative values in any of the parameters, or if you specify more than 256 data shards.

The you send and receive data is a simple slice of byte slices; [][]byte. In the example above, the top slice must have a length of 13.

    data := make([][]byte, 13)

You should then fill the 10 first slices with equally sized data, and create parity shards that will be populated with parity data. In this case we create the data in memory, but you could for instance also use mmap to map files.

    // Create all shards, size them at 50000 each
    for i := range input {
      data[i] := make([]byte, 50000)
    }
    
    
  // Fill some data into the data shards
    for i, in := range data[:10] {
      for j:= range in {
         in[j] = byte((i+j)&0xff)
      }
    }

To populate the parity shards, you simply call Encode() with your data.

    err = enc.Encode(data)

The only cases where you should get an error is, if the data shards aren't of equal size. The last 3 shards now contain parity data. You can verify this by calling Verify():

    ok, err = enc.Verify(data)

The final (and important) part is to be able to reconstruct missing shards. For this to work, you need to know which parts of your data is missing. The encoder does not know which parts are invalid, so if data corruption is a likely scenario, you need to implement a hash check for each shard. If a byte has changed in your set, and you don't know which it is, there is no way to reconstruct the data set.

To indicate missing data, you set the shard to nil before calling Reconstruct():

    // Delete two data shards
    data[3] = nil
    data[7] = nil
    
    // Reconstruct the missing shards
    err := enc.Reconstruct(data)

The missing data and parity shards will be recreated. If more than 3 shards are missing, the reconstruction will fail.

So to sum up reconstruction:

  • The number of data/parity shards must match the numbers used for encoding.
  • The order of shards must be the same as used when encoding.
  • You may only supply data you know is valid.
  • Invalid shards should be set to nil.

For complete examples of and encoder and decoder see the examples folder.

Splitting/Joining Data

You might have a large slice of data. To help you split this, there are some helper functions that can split and join a single byte slice.

   bigfile, _ := ioutil.Readfile("myfile.data")
   
   // Split the file
   split, err := enc.Split(bigfile)

This will split the file into the number of data shards set when creating the encoder and create empty parity shards.

An important thing to note is that you have to keep track of the exact input size. If the size of the input isn't diviable by the number of data shards, extra zeros will be inserted in the last shard.

To join a data set, use the Join() function, which will join the shards and write it to the io.Writer you supply:

   // Join a data set and write it to io.Discard.
   err = enc.Join(io.Discard, data, len(bigfile))

Streaming/Merging

It might seem like a limitation that all data should be in memory, but an important property is that as long as the number of data/parity shards are the same, you can merge/split data sets, and they will remain valid as a separate set.

    // Split the data set of 50000 elements into two of 25000
    splitA := make([][]byte, 13)
    splitB := make([][]byte, 13)
    
    // Merge into a 100000 element set
    merged := make([][]byte, 13)
    
    for i := range data {
      splitA[i] = data[i][:25000]
      splitB[i] = data[i][25000:]
      
      // Concencate it to itself
	  merged[i] = append(make([]byte, 0, len(data[i])*2), data[i]...)
	  merged[i] = append(merged[i], data[i]...)
    }
    
    // Each part should still verify as ok.
    ok, err := enc.Verify(splitA)
    if ok && err == nil {
        log.Println("splitA ok")
    }
    
    ok, err = enc.Verify(splitB)
    if ok && err == nil {
        log.Println("splitB ok")
    }
    
    ok, err = enc.Verify(merge)
    if ok && err == nil {
        log.Println("merge ok")
    }

This means that if you have a data set that may not fit into memory, you can split processing into smaller blocks. For the best throughput, don't use too small blocks.

This also means that you can divide big input up into smaller blocks, and do reconstruction on parts of your data. This doesn't give the same flexibility of a higher number of data shards, but it will be much more performant.

Performance

Performance depends mainly on the number of parity shards. In rough terms, doubling the number of parity shards will double the encoding time.

Here are the throughput numbers with some different selections of data and parity shards. For reference each shard is 1MB random data, and 2 CPU cores are used for encoding.

Data Parity Parity MB/s SSSE3 MB/s SSSE3 Speed Rel. Speed
5 2 40% 576,11 2599,2 451% 100,00%
10 2 20% 587,73 3100,28 528% 102,02%
10 4 40% 298,38 2470,97 828% 51,79%
50 20 40% 59,81 713,28 1193% 10,38%

If runtime.GOMAXPROCS() is set to a value higher than 1, the encoder will use multiple goroutines to perform the calculations in Verify, Encode and Reconstruct.

Example of performance scaling on Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz - 4 physical cores, 8 logical cores. The example uses 10 blocks with 16MB data each and 4 parity blocks.

Threads MB/s Speed
1 1355,11 100%
2 2339,78 172%
4 3179,33 235%
8 4346,18 321%

License

This code, as the original JavaReedSolomon is published under an MIT license. See LICENSE file for more information.

Documentation

Overview

Reed-Solomon Erasure Coding in Go

For usage and examples, see https://github.com/sbberk/golang-reedsolomon

Index

Examples

Constants

This section is empty.

Variables

View Source
var ErrInvShardNum = errors.New("cannot create Encoder with zero or less data/parity shards")

ErrInvShardNum will be returned by New, if you attempt to create an Encoder where either data or parity shards is zero or less, or the number of data shards is higher than 256.

View Source
var ErrShardNoData = errors.New("no shard data")

ErrShardNoData will be returned if there are no shards, or if the length of all shards is zero.

View Source
var ErrShardSize = errors.New("shard sizes does not match")

ErrShardSize is returned if shard length isn't the same for all shards.

View Source
var ErrShortData = errors.New("not enough data to fill the number of requested shards")

ErrShortData will be returned by Split(), if there isn't enough data to fill the number of shards.

View Source
var ErrTooFewShards = errors.New("too few shards given")

ErrTooFewShards is returned if too few shards where given to Encode/Verify/Reconstruct. It will also be returned from Reconstruct if there were too few shards to reconstruct the missing data.

Functions

This section is empty.

Types

type Encoder

type Encoder interface {
	// Encodes parity for a set of data shards.
	// Input is 'shards' containing data shards followed by parity shards.
	// The number of shards must match the number given to New().
	// Each shard is a byte array, and they must all be the same size.
	// The parity shards will always be overwritten and the data shards
	// will remain the same, so it is safe for you to read from the
	// data shards while this is running.
	Encode(shards [][]byte) error

	// Verify returns true if the parity shards contain correct data.
	// The data is the same format as Encode. No data is modified, so
	// you are allowed to read from data while this is running.
	Verify(shards [][]byte) (bool, error)

	// Reconstruct will recreate the missing shards if possible.
	//
	// Given a list of shards, some of which contain data, fills in the
	// ones that don't have data.
	//
	// The length of the array must be equal to the total number of shards.
	// You indicate that a shard is missing by setting it to nil.
	//
	// If there are too few shards to reconstruct the missing
	// ones, ErrTooFewShards will be returned.
	//
	// The reconstructed shard set is complete, but integrity is not verified.
	// Use the Verify function to check if data set is ok.
	Reconstruct(shards [][]byte) error

	// Split a data slice into the number of shards given to the encoder,
	// and create empty parity shards.
	//
	// The data will be split into equally sized shards.
	// If the data size isn't dividable by the number of shards,
	// the last shard will contain extra zeros.
	//
	// There must be at least the same number of bytes as there are data shards,
	// otherwise ErrShortData will be returned.
	//
	// The data will not be copied, except for the last shard, so you
	// should not modify the data of the input slice afterwards.
	Split(data []byte) ([][]byte, error)

	// Join the shards and write the data segment to dst.
	//
	// Only the data shards are considered.
	// You must supply the exact output size you want.
	// If there are to few shards given, ErrTooFewShards will be returned.
	// If the total data size is less than outSize, ErrShortData will be returned.
	Join(dst io.Writer, shards [][]byte, outSize int) error
}

Encoder is an interface to encode Reed-Salomon parity sets for your data.

Example

Simple example of how to use all functions of the Encoder. Note that all error checks have been removed to keep it short.

// Create some sample data
var data = make([]byte, 250000)
fillRandom(data)

// Create an encoder with 17 data and 3 parity slices.
enc, _ := New(17, 3)

// Split the data into shards
shards, _ := enc.Split(data)

// Encode the parity set
_ = enc.Encode(shards)

// Verify the parity set
ok, _ := enc.Verify(shards)
if ok {
	fmt.Println("ok")
}

// Delete two shards
shards[10], shards[11] = nil, nil

// Reconstruct the shards
_ = enc.Reconstruct(shards)

// Verify the data set
ok, _ = enc.Verify(shards)
if ok {
	fmt.Println("ok")
}
Output:

ok
ok
Example (Slicing)

This demonstrates that shards can be arbitrary sliced and merged and still remain valid.

// Create some sample data
var data = make([]byte, 250000)
fillRandom(data)

// Create 5 data slices of 50000 elements each
enc, _ := New(5, 3)
shards, _ := enc.Split(data)
err := enc.Encode(shards)
if err != nil {
	panic(err)
}

// Check that it verifies
ok, err := enc.Verify(shards)
if ok && err == nil {
	fmt.Println("encode ok")
}

// Split the data set of 50000 elements into two of 25000
splitA := make([][]byte, 8)
splitB := make([][]byte, 8)

// Merge into a 100000 element set
merged := make([][]byte, 8)

// Split/merge the shards
for i := range shards {
	splitA[i] = shards[i][:25000]
	splitB[i] = shards[i][25000:]

	// Concencate it to itself
	merged[i] = append(make([]byte, 0, len(shards[i])*2), shards[i]...)
	merged[i] = append(merged[i], shards[i]...)
}

// Each part should still verify as ok.
ok, err = enc.Verify(shards)
if ok && err == nil {
	fmt.Println("splitA ok")
}

ok, err = enc.Verify(splitB)
if ok && err == nil {
	fmt.Println("splitB ok")
}

ok, err = enc.Verify(merged)
if ok && err == nil {
	fmt.Println("merge ok")
}
Output:

encode ok
splitA ok
splitB ok
merge ok
Example (Xor)

This demonstrates that shards can xor'ed and still remain a valid set.

The xor value must be the same for element 'n' in each shard, except if you xor with a similar sized encoded shard set.

// Create some sample data
var data = make([]byte, 25000)
fillRandom(data)

// Create 5 data slices of 5000 elements each
enc, _ := New(5, 3)
shards, _ := enc.Split(data)
err := enc.Encode(shards)
if err != nil {
	panic(err)
}

// Check that it verifies
ok, err := enc.Verify(shards)
if !ok || err != nil {
	fmt.Println("falied initial verify", err)
}

// Create an xor'ed set
xored := make([][]byte, 8)

// We xor by the index, so you can see that the xor can change,
// It should however be constant vertically through your slices.
for i := range shards {
	xored[i] = make([]byte, len(shards[i]))
	for j := range xored[i] {
		xored[i][j] = shards[i][j] ^ byte(j&0xff)
	}
}

// Each part should still verify as ok.
ok, err = enc.Verify(xored)
if ok && err == nil {
	fmt.Println("verified ok after xor")
}
Output:

verified ok after xor

func New

func New(dataShards, parityShards int) (Encoder, error)

New creates a new encoder and initializes it to the number of data shards and parity shards that you want to use. You can reuse this encoder. Note that the maximum number of data shards is 256.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL