multigz

package module
v0.0.0-...-c45a234 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 3, 2015 License: MIT Imports: 7 Imported by: 0

README

multigz

gzip implementation which allows fast seeking within compressed files

Reference Build Status Coverage Status

Documentation

Overview

Multigz - a pure-Go package implementing efficient seeking within gzip files

Abstract

This library allows to create, read and write a special kind of gzip files called "multi-gzip" that allow for efficient seeking. Multi-gzips are fully compatible with existing gzip libraries and tools, so they can be treated as gzip files, but this library is able to also implement efficient seeking within thme. So, if you are manipulating a normal gzip file, you first need to convert it to the multi-gzip format, before being able to seek at random offsets, but the conversion keep it compatible with any other existing software manipulating gzip files.

How to use

Most usages of seeking within gzip files do not require arbitrary random offsets, but only seeking to specific points within the decompressed stream; this library assumes this use-case. Both the Reader and the Writer types have a Offset() method that returns a Offset function, that represents a "pointer" to the current position in the decompressed stream. Reader then also has a Seek() method that received an Offset as argument, and seeks to that point.

Basically, we support two main scenarios:

  • If your application generates the gzip file that you need to seek into, then change it to use multigz.Writer and, as you reach the points where you will need to seek back to, call Offset() and store the offsets into a data structure (that you can even marshal to disk if you want, like an index). Then, open the multi-gzip with multigz.Reader and use Seek to seek at one of previosly-generated offsets.

  • If your application receives an already-compressed multi-gzip, open it with multigz.Reader and scans it. When you reach points that you might need to seek at later, call Offset() and store the Offset. Afterwards, you can call Seek() at any time on the same Reader object to seek back to the saved positions. You can serialize the Offsets to disk so skip the initial indexing phase for the same file.

Command line tool

This package contains a command line tool called "multigz", which can be installed with the following command:

$ go get github.com/rasky/multigz/cmd/multigz

The tool is mostly compatible with "gzip", supporting all its main options. It can be used in automatic scripts to generate multi-gzip files instead of gzip files. For instance, to create a .tar.gz archive where you can later easily seek into, use:

$ tar c <directory> | multigz -c > archive.tar.gz

Description of multi-gzip

Normally, it is impossible to seek at arbitrary offsets within a gzip stream, without decompressing all previous bytes. The only possible workaround is to generate a special gzip, in which the compressor status has been flushed multipled times during the stream; fir instance, if we flush the status every 64k bytes, we will need to decompress at most 64k before getting to any point in the decompressed stream, putting an upper bound to required random seeking.

Flushing a deflate stream is not really supported by the deflate format itself, but fortunately gzip helps here: in fact, it is possible to concatenate multiple gzip files, and the resulting file is a valid gzip file itself: a gzip-compatible decompressor is in fact expected to decompress multiple consecutive gzip stream until EOF is reached. This means that, instead of just flushing the deflate state (which would be incompatible with existing decompressors), we flush and close the gzip stream, and start a new one within the same file. The resulting file is a valid gzip file, compatible with all existing gzip libraries and tools, but it can be efficiently seeked by knowing in advance where each internal gzip file begins.

Index

Constants

View Source
const DefaultBlockSize = 64 * 1024
View Source
const DefaultPeekSize = DefaultBlockSize * 2

Variables

This section is empty.

Functions

func Convert

func Convert(w io.Writer, r io.Reader, mode ConvertMode) error

Convert a whole gzip file into a multi-gzip file. mode can be used to select between using a normal writer, or the rsync-friendly writer.

func IsProbablyMultiGzip

func IsProbablyMultiGzip(r io.ReadSeeker, peeksize int64) bool

Returns true if the file is (statistically) a multi-gzip. It tries to read peeksize bytes of decompressed data, but stopping when it sees a single gzip termination. Returns true if it found at least a termination, false if it didn't (or there is any corruption in decoding the stream). If the stream is full exhausted before peeksize, the function returns true as it it is technically still a single-block multigzip.

Technically, a file is a multi-gzip even if there is just one split near the end of it; but the use-case we're aiming at is getting performance at seeking, and thus we prefer to consider files with large blocks as not proper multi-gzips.

Types

type ConvertMode

type ConvertMode int
const (
	ConvertNormal ConvertMode = iota
	ConvertRsyncable
)

type GzipWriterRsyncable

type GzipWriterRsyncable struct {
	*gzip.Writer
	// contains filtered or unexported fields
}

func (*GzipWriterRsyncable) Offset

func (w *GzipWriterRsyncable) Offset() Offset

func (*GzipWriterRsyncable) Write

func (w *GzipWriterRsyncable) Write(data []byte) (int, error)

type Offset

type Offset struct {
	Block int64
	Off   int64
}

Offset represents a specific point in the decompressed stream where we want to seek at. The normal way to obtain an Offset is to call Reader.Offset() of Writer.Offset() at the specific point in the stream we are interested into; later, it is possible to call Reder.Seek() passing the Offset to efficiently get back to that point.

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

A multigz.Reader is 100% equivalent to a gzip.Reader, but allows to seek within the compressed file to specific positions.

The idea is to use a multi-pass approach; in the first pass, you can go through the file and record the positions of interest by calling Offset(). Then, you can seek to a specific offset by calling Seek().

func NewReader

func NewReader(r io.ReadSeeker) (*Reader, error)

func (*Reader) Close

func (or *Reader) Close() error

func (*Reader) IsProbablyMultiGzip

func (or *Reader) IsProbablyMultiGzip() bool

Return true if we found at least a multi-gzip separtor while reading this file. This function does not take into account the fact that short files can be effectively treated as multigz even if technically they aren't. Unless you know that you've read enough bytes out of this file, you should use the global function IsProbablyMultiGzip() which is a more general solution.

func (*Reader) Offset

func (or *Reader) Offset() Offset

func (*Reader) Read

func (or *Reader) Read(data []byte) (int, error)

func (*Reader) Seek

func (or *Reader) Seek(o Offset) error

type Writer

type Writer interface {
	io.WriteCloser

	// Returns an offset that points to the current point within the
	// decompressed stream.
	Offset() Offset
}

This interface represents an object that generates a multi-gzip file. In addition of implementing a standard WriteCloser, it also gives access to a Offset method for fetching a pointer to the current position in the stream.

In the current version, there are two different implementations of Writer:

  • A writer that segments the multi-gzip file based on a fixed block size. Create it with NewWriterLevel().
  • A writer that segments the multi-gzip file making it more friendly to rsync and binary-diffs. Create it wtih NewRsyncableWriter().

func NewWriterLevel

func NewWriterLevel(w io.Writer, level int, blocksize int) (Writer, error)

Create a new compressing writer that will generate a multi-gzip, segmenting the compressed stream at fixed offsets. This is similar to gzip.NewWriterLevel, but takes an additional argument that specifies the size of each gzip block. You can use multigz.DefaultBlockSize as a reasonable default (64kb) that balances decompression speed and compression overhead.

func NewWriterLevelRsyncable

func NewWriterLevelRsyncable(w io.Writer, level int) (Writer, error)

Create a new compressing writer that will generate a multi-gzip, segmenting the compressed stream in a way to be efficient when transferred over rsync with slight differences in the uncompressed stream.

This function is similar to NewWriterLevel as it creates a multi-gzip file, but segenting happens at data-dependent offsets that make the compressed stream resynchronize after localized changes in the uncompressed stream. In other words, we use the same algorithm of "gzip --rsyncable", but for a multigz file.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL