Documentation ¶
Overview ¶
Multigz - a pure-Go package implementing efficient seeking within gzip files
Abstract ¶
This library allows to create, read and write a special kind of gzip files called "multi-gzip" that allow for efficient seeking. Multi-gzips are fully compatible with existing gzip libraries and tools, so they can be treated as gzip files, but this library is able to also implement efficient seeking within thme. So, if you are manipulating a normal gzip file, you first need to convert it to the multi-gzip format, before being able to seek at random offsets, but the conversion keep it compatible with any other existing software manipulating gzip files.
How to use ¶
Most usages of seeking within gzip files do not require arbitrary random offsets, but only seeking to specific points within the decompressed stream; this library assumes this use-case. Both the Reader and the Writer types have a Offset() method that returns a Offset function, that represents a "pointer" to the current position in the decompressed stream. Reader then also has a Seek() method that received an Offset as argument, and seeks to that point.
Basically, we support two main scenarios:
If your application generates the gzip file that you need to seek into, then change it to use multigz.Writer and, as you reach the points where you will need to seek back to, call Offset() and store the offsets into a data structure (that you can even marshal to disk if you want, like an index). Then, open the multi-gzip with multigz.Reader and use Seek to seek at one of previosly-generated offsets.
If your application receives an already-compressed multi-gzip, open it with multigz.Reader and scans it. When you reach points that you might need to seek at later, call Offset() and store the Offset. Afterwards, you can call Seek() at any time on the same Reader object to seek back to the saved positions. You can serialize the Offsets to disk so skip the initial indexing phase for the same file.
Command line tool ¶
This package contains a command line tool called "multigz", which can be installed with the following command:
$ go get github.com/rasky/multigz/cmd/multigz
The tool is mostly compatible with "gzip", supporting all its main options. It can be used in automatic scripts to generate multi-gzip files instead of gzip files. For instance, to create a .tar.gz archive where you can later easily seek into, use:
$ tar c <directory> | multigz -c > archive.tar.gz
Description of multi-gzip ¶
Normally, it is impossible to seek at arbitrary offsets within a gzip stream, without decompressing all previous bytes. The only possible workaround is to generate a special gzip, in which the compressor status has been flushed multipled times during the stream; fir instance, if we flush the status every 64k bytes, we will need to decompress at most 64k before getting to any point in the decompressed stream, putting an upper bound to required random seeking.
Flushing a deflate stream is not really supported by the deflate format itself, but fortunately gzip helps here: in fact, it is possible to concatenate multiple gzip files, and the resulting file is a valid gzip file itself: a gzip-compatible decompressor is in fact expected to decompress multiple consecutive gzip stream until EOF is reached. This means that, instead of just flushing the deflate state (which would be incompatible with existing decompressors), we flush and close the gzip stream, and start a new one within the same file. The resulting file is a valid gzip file, compatible with all existing gzip libraries and tools, but it can be efficiently seeked by knowing in advance where each internal gzip file begins.
Index ¶
Constants ¶
const DefaultBlockSize = 64 * 1024
const DefaultPeekSize = DefaultBlockSize * 2
Variables ¶
This section is empty.
Functions ¶
func Convert ¶
Convert a whole gzip file into a multi-gzip file. mode can be used to select between using a normal writer, or the rsync-friendly writer.
func IsProbablyMultiGzip ¶
func IsProbablyMultiGzip(r io.ReadSeeker, peeksize int64) bool
Returns true if the file is (statistically) a multi-gzip. It tries to read peeksize bytes of decompressed data, but stopping when it sees a single gzip termination. Returns true if it found at least a termination, false if it didn't (or there is any corruption in decoding the stream). If the stream is full exhausted before peeksize, the function returns true as it it is technically still a single-block multigzip.
Technically, a file is a multi-gzip even if there is just one split near the end of it; but the use-case we're aiming at is getting performance at seeking, and thus we prefer to consider files with large blocks as not proper multi-gzips.
Types ¶
type GzipWriterRsyncable ¶
func (*GzipWriterRsyncable) Offset ¶
func (w *GzipWriterRsyncable) Offset() Offset
type Offset ¶
Offset represents a specific point in the decompressed stream where we want to seek at. The normal way to obtain an Offset is to call Reader.Offset() of Writer.Offset() at the specific point in the stream we are interested into; later, it is possible to call Reder.Seek() passing the Offset to efficiently get back to that point.
type Reader ¶
type Reader struct {
// contains filtered or unexported fields
}
A multigz.Reader is 100% equivalent to a gzip.Reader, but allows to seek within the compressed file to specific positions.
The idea is to use a multi-pass approach; in the first pass, you can go through the file and record the positions of interest by calling Offset(). Then, you can seek to a specific offset by calling Seek().
func (*Reader) IsProbablyMultiGzip ¶
Return true if we found at least a multi-gzip separtor while reading this file. This function does not take into account the fact that short files can be effectively treated as multigz even if technically they aren't. Unless you know that you've read enough bytes out of this file, you should use the global function IsProbablyMultiGzip() which is a more general solution.
type Writer ¶
type Writer interface { io.WriteCloser // Returns an offset that points to the current point within the // decompressed stream. Offset() Offset }
This interface represents an object that generates a multi-gzip file. In addition of implementing a standard WriteCloser, it also gives access to a Offset method for fetching a pointer to the current position in the stream.
In the current version, there are two different implementations of Writer:
- A writer that segments the multi-gzip file based on a fixed block size. Create it with NewWriterLevel().
- A writer that segments the multi-gzip file making it more friendly to rsync and binary-diffs. Create it wtih NewRsyncableWriter().
func NewWriterLevel ¶
Create a new compressing writer that will generate a multi-gzip, segmenting the compressed stream at fixed offsets. This is similar to gzip.NewWriterLevel, but takes an additional argument that specifies the size of each gzip block. You can use multigz.DefaultBlockSize as a reasonable default (64kb) that balances decompression speed and compression overhead.
func NewWriterLevelRsyncable ¶
Create a new compressing writer that will generate a multi-gzip, segmenting the compressed stream in a way to be efficient when transferred over rsync with slight differences in the uncompressed stream.
This function is similar to NewWriterLevel as it creates a multi-gzip file, but segenting happens at data-dependent offsets that make the compressed stream resynchronize after localized changes in the uncompressed stream. In other words, we use the same algorithm of "gzip --rsyncable", but for a multigz file.