rinzler

package module
v0.0.0-...-3d5f8eb Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 18, 2019 License: MIT Imports: 17 Imported by: 0

README

Rinzler

Highly redundant compressed big data record storage and retrieval system.

Rinzler

Rinzler is a high performant big data indexing system that efficiently stores data using both compression and Reed Solomon FEC (Forward error correction) and the Berlekamp-Welch error correction algorithm.

Capabilities of Rinzler:

  • Random I/O Access through compressed data
  • Two levels of error detection with the ability to do error correction for each row of data
  • Variable adjustment of error correction parameters to increase redundancy for each record stored
  • Creation of indexes on specific fields with ultra fast read capability (I/O bound in most cases)

What was the reason for creating Rinzler?

Working with big data presents many challenges. I wanted to learn how to program in Golang, so I picked a project that would help make dealing with Big Data fun and exciting. Rinzler is a program with many different purposes but everything at the heart of the program is designed from the ground up to be a Big Data management tool.

There are several major components of Rinzler that help facilitate working with Big data. I'll review each major component below.

Redundant Storage

Rinzler uses an advanced library for managing records of data. Each chunk of data is encoded using Reed Solomon FEC with a default setting of two redundant error detection and correction shards. Each row of data can withstand several bit flips and other types of corruption and still be recovered. Using FEC with the Berlekamp-Welch error correction algorithm can be CPU intensive when dealing with high sustained I/O, so there is an additional one byte checksum for each data record. This checksum is computed from a CRC32 checksum that has support under SSE4.2 CPU extensions. The 8 LSB bits from the CRC32 checksum is recorded as a one byte checksum for the data record. During reads, if the checksum is correct, the data does not need to be passed through the Reed Solomon decode method.

These two redundant checks allow for extremely fast IO while also still providing strong protection against bit rot and other data degradation.

Random Access Compression

Rinzler strives to give the best of both worlds with regards to compression. Generally, data that is compressed does not allow for efficient random record retrieval. Rinzler employs zstandard (zst) compression and gives the end-user the capability of creating custom compression dictionaries to efficiently compress small objects (usually objects less than 5 kilobytes). During testing and benchmarking, Rinzler typically had compression ratios of 4-5x (including the redundant packets of data for error recovery!). Using Twitter JSON blobs, Rinzler approaches compression levels of 5x using custom dictionaries.

Documentation

Index

Constants

View Source
const (
	BYTE_CHECKSUM  = 1
	LENGTH_MARKER  = 2
	RESERVED_BYTES = 2
)

Variables

This section is empty.

Functions

func ReedSolomonCorrect

func ReedSolomonCorrect(arr []byte, checksumSize ...int) error

This function is currently unavailable (in progress...)

Types

type BinarySearch

type BinarySearch struct {
	// contains filtered or unexported fields
}

func (*BinarySearch) CachePerformance

func (b *BinarySearch) CachePerformance() float64

func (*BinarySearch) Search

func (b *BinarySearch) Search(target string) int64

func (*BinarySearch) SearchLeft

func (b *BinarySearch) SearchLeft(target string) int64

func (*BinarySearch) SearchRight

func (b *BinarySearch) SearchRight(target string) int64

type CommentJSON

type CommentJSON struct {
	Id                 string `json:"id"`
	Subreddit          string `json:"subreddit"`
	Author             string `json:"author"`
	Author_fullname    string `json:"author_fullname"`
	Link               string `json:"link_id"`
	Permalink          string `json:"permalink"`
	Subreddit_id       string `json:"subreddit_id"`
	Score              int32  `json:"score"`
	Created_utc        uint32 `json:"created_utc"`
	Retrieved_on       uint32 `json:"retrieved_on"`
	Author_created_utc uint32 `json:"author_created_utc"`
}

type FileDescription

type FileDescription struct {
	Complete         uint8
	Version          uint8
	ReservedBytes    uint64
	DataSegments     uint8
	CheckSumSegments uint8
	DictionaryLen    uint64
	RecordsStartPos  uint64
	IndexStartPos    uint64
	IndexEndPos      uint64
}

type Rinzler

type Rinzler struct {
	DataFile             *os.File
	Crc32table           *crc32.Table
	ZstdCDict            *gozstd.CDict
	ZstdDDict            *gozstd.DDict
	ZstdCompressionLevel int
	DataSegments         int
	ChecksumSegments     int
	ZstdDictionary       []byte
	ZstdMagicHeader      []byte
	FileDescription      FileDescription
}

func New

func New() *Rinzler

func (*Rinzler) Checksum16

func (r *Rinzler) Checksum16(bs []byte) uint16

Calculate a 16 bit checksum from the 16 LSB bits of a CRC32 checksum

func (*Rinzler) Checksum32

func (r *Rinzler) Checksum32(bs []byte) uint32

Calculate a Castagnoli CRC32 (Optimized for x86 SSE4.2 capable processors)

func (*Rinzler) Checksum8

func (r *Rinzler) Checksum8(bs []byte) uint8

Calculate an 8 bit checksum from the 8 LSB bits of a CRC32 checksum

func (*Rinzler) Compress

func (r *Rinzler) Compress(bs []byte, use_dict bool) []byte

Apply zstandard (zstd) compression to a byte slice

func (*Rinzler) CreateRecord

func (r *Rinzler) CreateRecord(bs []byte) []byte

This function wraps the compression and Reed Solomon encoding functions and creates a compressed record with error detection and correction capabilities

func (*Rinzler) Decompress

func (r *Rinzler) Decompress(bs []byte, use_dict bool) ([]byte, error)

Decompress a zstandard (zstd) compressed byte slice

func (*Rinzler) GetDataPosition

func (r *Rinzler) GetDataPosition(indexPos uint64) uint64

func (*Rinzler) LoadFile

func (r *Rinzler) LoadFile(filename string)

func (*Rinzler) NewBinarySearch

func (r *Rinzler) NewBinarySearch(filename string, recordSize uint64, fieldSize uint64) *BinarySearch

func (*Rinzler) RSDecode

func (r *Rinzler) RSDecode(arr []byte, totalSegments int, checksumSegments int) ([]byte, error)

Decode a Reed Solomon Encoded byte string. This method will first check the available 8 bit checksum and return the record if the checksum matches the calculated checksum. Otherwise, corruption is assumed and the record is processed using the Berlekamp-Welch algorithm to detect the corrupted bits and repair the record

func (*Rinzler) RSEncode

func (r *Rinzler) RSEncode(arr []byte, totalSegments int, checksumSegments int, pad bool) ([]byte, error)

This method encodes a byte string using Reed Solomon FEC. This adds redundant data so that error correction is possible if the record's data becomes corrupted.

func (*Rinzler) ReadRecord

func (r *Rinzler) ReadRecord(pos int64) []byte

func (*Rinzler) SearchLeftB

func (r *Rinzler) SearchLeftB(target int) int64

func (*Rinzler) SetDictionary

func (r *Rinzler) SetDictionary(b []byte) error

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL