xflate

package
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 6, 2019 License: BSD-3-Clause Imports: 11 Imported by: 1

Documentation

Overview

Package xflate implements the XFLATE compressed data format.

The XFLATE format is a backwards compatible extension to DEFLATE (RFC 1951); meaning that any data compressed as XFLATE can also be decompressed by any RFC 1951 compliant decoder. XFLATE extends DEFLATE by enabling efficient random access reading of the stream. This is accomplished by compressing the stream as independent chunks and by encoding an index table at the end of the compressed stream. Since the index table contains information about each chunk, an XFLATE reader can read this index and determine where to seek to in order to satisfy a random access read request.

It is important to remember that all XFLATE streams are DEFLATE streams; but, not all DEFLATE streams are XFLATE streams. Only streams written in the XFLATE format can be read by xflate.Reader. Thus, do not expect xflate.Reader to be able to provide random access to any arbitrary DEFLATE stream.

XFLATE was designed so that current applications of DEFLATE could easily switch with few detriments. While XFLATE offers random access reading, it does cause the compressed stream to be slightly larger than if the input had been compressed as a single DEFLATE stream. The factor that has the most effect on the compression overhead is the choice of chunk size.

This compression overhead occurs because XFLATE requires that chunks be individually compressed (the LZ77 dictionary is reset before every chunk). This means that each chunk cannot benefit from the data that preceded it. There is a trade-off in the choice of an appropriate chunk size; Smaller chunk sizes allow for better random access properties (since it reduces that amount data that a reader may need to discard), but hurts the compression ratio. The default chunk size was chosen so that the compression overhead was about 1% for most workloads.

Format specification:

https://github.com/dsnet/compress/blob/master/doc/xflate-format.pdf
Example (GzipFile)

The Gzip format (RFC 1952) is a framing format for DEFLATE (RFC 1951). For this reason, we can provide random access decompression to Gzip files that are compressed with XFLATE. The example below adds a lightweight header and footer to the XFLATE stream to make it compliant with the Gzip format. This has the advantage that these files remain readable by standard implementations of Gzip. Note that regular Gzip files are not seekable because they are not compressed in the XFLATE format.

package main

import (
	"bytes"
	"compress/gzip"
	"encoding/binary"
	"fmt"
	"hash/crc32"
	"io"
	"io/ioutil"
	"log"

	"github.com/dsnet/compress/internal/testutil"
	"github.com/dsnet/compress/xflate"
)

func main() {
	// Test file of non-trivial size.
	twain := testutil.MustLoadFile("../testdata/twain.txt")

	// The Gzip header without using any extra features is 10 bytes long.
	const header = "\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\xff"

	// Write the Gzip file.
	buffer := new(bytes.Buffer)
	{
		// Write Gzip header.
		buffer.WriteString(header)

		// Instead of using flate.Writer, we use xflate.Writer instead.
		// We choose a relative small chunk size of 64KiB for better
		// random access properties, at the expense of compression ratio.
		xw, err := xflate.NewWriter(buffer, &xflate.WriterConfig{
			Level:     xflate.BestSpeed,
			ChunkSize: 1 << 16,
		})
		if err != nil {
			log.Fatal(err)
		}

		// Write the test data.
		crc := crc32.NewIEEE()
		mw := io.MultiWriter(xw, crc) // Write to both compressor and hasher
		if _, err := io.Copy(mw, bytes.NewReader(twain)); err != nil {
			log.Fatal(err)
		}
		if err := xw.Close(); err != nil {
			log.Fatal(err)
		}

		// Write Gzip footer.
		binary.Write(buffer, binary.LittleEndian, uint32(crc.Sum32()))
		binary.Write(buffer, binary.LittleEndian, uint32(len(twain)))
	}

	// Verify that Gzip file is RFC 1952 compliant.
	{
		gz, err := gzip.NewReader(bytes.NewReader(buffer.Bytes()))
		if err != nil {
			log.Fatal(err)
		}
		buf, err := ioutil.ReadAll(gz)
		if err != nil {
			log.Fatal(err)
		}
		if !bytes.Equal(buf, twain) {
			log.Fatal("gzip content does not match")
		}
	}

	// Read the Gzip file.
	{
		// Parse and discard the Gzip wrapper.
		// This does not work for back-to-back Gzip files.
		var hdr [10]byte
		rd := bytes.NewReader(buffer.Bytes())
		if _, err := rd.ReadAt(hdr[:], 0); err != nil {
			log.Fatal(err)
		}
		if string(hdr[:3]) != header[:3] || rd.Size() < 18 {
			log.Fatal("not a gzip file")
		}
		if hdr[3]&0xfe > 0 {
			log.Fatal("no support for extra gzip features")
		}
		rds := io.NewSectionReader(rd, 10, rd.Size()-18) // Strip Gzip header/footer

		// Since we know that the writer used the XFLATE format, we can open
		// the compressed file as an xflate.Reader. If the file was compressed
		// with regular DEFLATE, then this will return an error.
		xr, err := xflate.NewReader(rds, nil)
		if err != nil {
			log.Fatal(err)
		}

		// Read from the middle of the stream.
		buf := make([]byte, 80)
		pos := int64(len(twain) / 2)
		if _, err := xr.Seek(pos, io.SeekStart); err != nil {
			log.Fatal(err)
		}
		if _, err := io.ReadFull(xr, buf); err != nil {
			log.Fatal(err)
		}

		// Close the Reader.
		if err := xr.Close(); err != nil {
			log.Fatal(err)
		}

		got := string(buf)
		want := string(twain[pos : pos+80])
		fmt.Printf("got:  %q\nwant: %q\n", got, want)
	}

}
Output:

got:  "ver, white with foam, the driving spray of spume-flakes, the dim\noutlines of the"
want: "ver, white with foam, the driving spray of spume-flakes, the dim\noutlines of the"
Example (ZipFile)

Zip archives allow for efficient random access between files, however, they do not easily allow for efficient random access within a given file, especially if compressed. In this example, we use XFLATE to compress each file. This is particularly useful for seeking within a relatively large file in a Zip archive.

package main

import (
	"archive/zip"
	"bytes"
	"fmt"
	"io"
	"io/ioutil"
	"log"

	"github.com/dsnet/compress/internal/testutil"
	"github.com/dsnet/compress/xflate"
)

func init() { log.SetFlags(log.Lshortfile) }

func main() {
	// Test files of non-trivial sizes.
	files := map[string][]byte{
		"twain.txt":   testutil.MustLoadFile("../testdata/twain.txt"),
		"digits.txt":  testutil.MustLoadFile("../testdata/digits.txt"),
		"huffman.txt": testutil.MustLoadFile("../testdata/huffman.txt"),
	}

	// Write the Zip archive.
	buffer := new(bytes.Buffer)
	zw := zip.NewWriter(buffer)
	zw.RegisterCompressor(zip.Deflate, func(wr io.Writer) (io.WriteCloser, error) {
		// Instead of the default DEFLATE compressor, register one that uses
		// XFLATE instead. We choose a relative small chunk size of 64KiB for
		// better random access properties, at the expense of compression ratio.
		return xflate.NewWriter(wr, &xflate.WriterConfig{
			Level:     xflate.BestSpeed,
			ChunkSize: 1 << 16,
		})
	})
	for _, name := range []string{"twain.txt", "digits.txt", "huffman.txt"} {
		body := files[name]
		f, err := zw.Create(name)
		if err != nil {
			log.Fatal(err)
		}
		if _, err = f.Write(body); err != nil {
			log.Fatal(err)
		}
	}
	if err := zw.Close(); err != nil {
		log.Fatal(err)
	}

	// Read the Zip archive.
	rd := bytes.NewReader(buffer.Bytes())
	zr, err := zip.NewReader(rd, rd.Size())
	if err != nil {
		log.Fatal(err)
	}
	for _, f := range zr.File {
		// Verify that the new compression format is backwards compatible with
		// a standard DEFLATE decompressor.
		rc, err := f.Open()
		if err != nil {
			log.Fatal(err)
		}
		buf, err := ioutil.ReadAll(rc)
		if err != nil {
			log.Fatal(err)
		}
		if err := rc.Close(); err != nil {
			log.Fatal(err)
		}
		if !bytes.Equal(buf, files[f.Name]) {
			log.Fatal("file content does not match")
		}
	}
	for _, f := range zr.File {
		// In order for XFLATE to provide random access, it needs to be provided
		// an io.ReadSeeker in order to operate. Thus, get low-level access to
		// the compressed file data in archive.
		off, err := f.DataOffset()
		if err != nil {
			log.Fatal(err)
		}
		rds := io.NewSectionReader(rd, off, int64(f.CompressedSize64))

		// Since we know that the writer used the XFLATE format, we can open
		// the compressed file as an xflate.Reader. If the file was compressed
		// with regular DEFLATE, then this will return an error.
		xr, err := xflate.NewReader(rds, nil)
		if err != nil {
			log.Fatal(err)
		}

		// Read from the middle of the file.
		buf := make([]byte, 80)
		pos := int64(f.UncompressedSize64 / 2)
		if _, err := xr.Seek(pos, io.SeekStart); err != nil {
			log.Fatal(err)
		}
		if _, err := io.ReadFull(xr, buf); err != nil {
			log.Fatal(err)
		}

		// Close the Reader.
		if err := xr.Close(); err != nil {
			log.Fatal(err)
		}

		got := string(buf)
		want := string(files[f.Name][pos : pos+80])
		fmt.Printf("File: %s\n\tgot:  %q\n\twant: %q\n\n", f.Name, got, want)
	}

}
Output:

File: twain.txt
	got:  "ver, white with foam, the driving spray of spume-flakes, the dim\noutlines of the"
	want: "ver, white with foam, the driving spray of spume-flakes, the dim\noutlines of the"

File: digits.txt
	got:  "63955008002334767618706808652687872278317742021406898070341050620023527363226729"
	want: "63955008002334767618706808652687872278317742021406898070341050620023527363226729"

File: huffman.txt
	got:  "E+uXeMsjFSXvhrGmRZCF7ErSVMWoWEzqMdW8uRyjCRxkQxOrWrQgkSdHshJyTbsBajQUoNfPY1zuLRvy"
	want: "E+uXeMsjFSXvhrGmRZCF7ErSVMWoWEzqMdW8uRyjCRxkQxOrWrQgkSdHshJyTbsBajQUoNfPY1zuLRvy"

Index

Examples

Constants

View Source
const (
	// Compression levels to be used with the underlying DEFLATE compressor.
	NoCompression      = -1
	BestSpeed          = 1
	DefaultCompression = 6
	BestCompression    = 9

	// DefaultChunkSize specifies the default uncompressed size for each chunk.
	// This value was chosen so that the overhead of the XFLATE format would be
	// approximately 1% for the typical dataset.
	DefaultChunkSize = 1 << 18 // 256 KiB

	// DefaultIndexSize specifies the default number of records that each index
	// should contain. This value was chosen so that the maximum amount of
	// memory used for the index is comparable to the memory requirements for
	// DEFLATE compression itself.
	DefaultIndexSize = 1 << 12 // 96 KiB
)

Writer configuration constants. The values can be set in WriterConfig and passed to NewWriter to provide finer granularity control.

Variables

This section is empty.

Functions

This section is empty.

Types

type FlushMode

type FlushMode int

The FlushMode constants can be passed to Writer.Flush to control the specific type of flushing performed.

const (
	// FlushSync flushes the write buffer to the underlying writer, but does not
	// reset the dictionary. The intended use case for this is if the underlying
	// writer is a socket or pipe and the caller intends for the recipient to
	// be able to decompress all data written thus far.
	//
	// This is equivalent to SYNC_FLUSH in zlib terminology.
	FlushSync FlushMode = iota

	// FlushFull flushes the write buffer to the underlying writer, and also
	// resets the dictionary. The raw and compressed sizes of the chunk will be
	// inserted into the in-memory index table. The intended use case for this
	// is if the caller wants the current offset to be one that is efficient to
	// seek to by the Reader.
	//
	// Performing this action often can be detrimental to the compression ratio.
	// This is equivalent to FULL_FLUSH in zlib terminology.
	FlushFull

	// FlushIndex flushes the write buffer to the underlying writer, resets the
	// dictionary, write the contents of the current index table, and then
	// clears the index. The intended use case for this is for callers with
	// limited memory to ensure that the index does not grow without limit.
	// Rather than calling this manually, consider using WriterConfig.IndexSize.
	//
	// Performing this action often can be detrimental to the compression ratio
	// and should be use sparingly.
	FlushIndex
)

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

A Reader is an io.ReadSeeker that can read the XFLATE format. Only the stream produced by Writer (or some other valid XFLATE stream) can be read by Reader. Regular DEFLATE streams produced by flate.Writer cannot be read by Reader.

func NewReader

func NewReader(rs io.ReadSeeker, conf *ReaderConfig) (*Reader, error)

NewReader creates a new Reader reading the given reader rs. This reader can only decompress files in the XFLATE format. If the underlying stream is regular DEFLATE and not XFLATE, then this returns error.

Regardless of the current offset in rs, this function Seeks to the end of rs in order to determine the total compressed size. The Reader returned has its offset set to the start of the stream.

If conf is nil, then default configuration values are used. Reader copies all configuration values as necessary and does not store conf.

func (*Reader) Close

func (xr *Reader) Close() error

Close ends the XFLATE stream.

func (*Reader) Read

func (xr *Reader) Read(buf []byte) (int, error)

Read reads decompressed data from the underlying io.Reader. This method automatically proceeds to the next chunk when the current one has been fully read.

func (*Reader) Reset

func (xr *Reader) Reset(rs io.ReadSeeker) error

Reset discards the Reader's state and makes it equivalent to the result of a call to NewReader, but reading from rd instead. This method may return an error if it is unable to parse the index.

This is used to reduce memory allocations.

func (*Reader) Seek

func (xr *Reader) Seek(offset int64, whence int) (int64, error)

Seek sets the offset for the next Read operation, interpreted according to the whence value provided. It is permitted to seek to offsets in the middle of a compressed chunk. The next call to Read will automatically discard some number of bytes before returning the requested data.

type ReaderConfig

type ReaderConfig struct {
	// contains filtered or unexported fields
}

ReaderConfig configures the Reader. There are currently no configuration options for Reader.

type Writer

type Writer struct {
	// These statistics fields are automatically updated by Writer.
	// It is safe to set these values to any arbitrary value.
	InputOffset  int64 // Total number of bytes issued to Write
	OutputOffset int64 // Total number of bytes written to underlying io.Writer
	// contains filtered or unexported fields
}

A Writer is an io.Writer that can write the XFLATE format. The XFLATE stream outputted by this Writer can be read by both Reader and flate.Reader.

func NewWriter

func NewWriter(wr io.Writer, conf *WriterConfig) (*Writer, error)

NewWriter creates a new Writer writing to the given writer. It is the caller's responsibility to call Close to complete the stream.

If conf is nil, then default configuration values are used. Writer copies all configuration values as necessary and does not store conf.

func (*Writer) Close

func (xw *Writer) Close() error

Close ends the XFLATE stream and flushes all buffered data. This method automatically writes an index if any chunks have been written since the last FlushIndex.

func (*Writer) Flush

func (xw *Writer) Flush(mode FlushMode) error

Flush flushes the current write buffer to the underlying writer. Flushing is entirely optional and should be used sparingly.

func (*Writer) Reset

func (xw *Writer) Reset(wr io.Writer) error

Reset discards the Writer's state and makes it equivalent to the result of a call to NewWriter, but writes to wr instead. Any configurations from a prior call to NewWriter will be preserved.

This is used to reduce memory allocations.

func (*Writer) Write

func (xw *Writer) Write(buf []byte) (int, error)

Write writes the compressed form of buf to the underlying io.Writer. This automatically breaks the input into multiple chunks, writes them out, and records the sizes of each chunk in the index table.

type WriterConfig

type WriterConfig struct {
	// Underlying DEFLATE compression level.
	//
	// This compression level will be passed directly to the underlying DEFLATE
	// compressor. Higher values provide better compression ratio at the expense
	// of CPU time.
	Level int

	// Uncompressed size of each independent chunk.
	//
	// Each chunk will be compressed independently. This has that advantage that
	// the chunk can be decompressed without knowledge about the preceding
	// chunks, but has the disadvantage that it reduces the compression ratio.
	// Smaller ChunkSizes provide better random access properties, while larger
	// sizes provide better compression ratio.
	ChunkSize int64

	// The number of records in each index.
	//
	// When this number is reached, the index is automatically flushed. This is
	// done to ensure that there is some limit on the amount of memory needed to
	// represent the index. A negative value indicates that the Writer will
	// not automatically flush the index.
	//
	// The multiplication of the IndexSize and the size of each record (24 B)
	// gives an approximation for how much memory the index will occupy.
	// The multiplication of the IndexSize and the ChunkSize gives an
	// approximation for how much uncompressed data each index represents.
	IndexSize int64
	// contains filtered or unexported fields
}

WriterConfig configures the Writer. The zero value for any field uses the default value for that field type.

Directories

Path Synopsis
internal
meta
Package meta implements the XFLATE meta encoding scheme.
Package meta implements the XFLATE meta encoding scheme.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL