compression

package
v0.0.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 13, 2023 License: Apache-2.0 Imports: 8 Imported by: 0

Documentation

Index

Constants

View Source
const (
	HASH_SHIFT = 5
	HASH_MASK  = 32767
)
View Source
const (
	// Minimum and maximum length that can be encoded in deflate.
	MAX_MATCH = 258
	MIN_MATCH = 3

	// The window size for deflate. Must be a power of two. This should be
	// 32768, the maximum possible by the deflate spec. Anything less hurts
	// compression more than speed.
	WINDOW_SIZE = 32768

	// The window mask used to wrap indices into the window. This is why the
	// window size must be a power of two.
	WINDOW_MASK = (WINDOW_SIZE - 1)

	// A block structure of huge, non-smart, blocks to divide the input into, to allow
	// operating on huge files without exceeding memory, such as the 1GB wiki9 corpus.
	// The whole compression algorithm, including the smarter block splitting, will
	// be executed independently on each huge block.
	// Dividing into huge blocks hurts compression, but not much relative to the size.
	// Set this to, for example, 20MB (20000000). Set it to 0 to disable master blocks.
	MASTER_BLOCK_SIZE = 20000000

	// For longest match cache. max 256. Uses huge amounts of memory but makes it
	// faster. Uses this many times three bytes per single byte of the input data.
	// This is so because longest match finding has to find the exact distance
	// that belongs to each length for the best lz77 strategy.
	// Good values: e.g. 5, 8.
	CACHE_LENGTH = 8

	// limit the max hash chain hits for this hash value. This has an effect only
	// on files where the hash value is the same very often. On these files, this
	// gives worse compression (the value should ideally be 32768, which is the
	// WINDOW_SIZE, while zlib uses 4096 even for best level), but makes it
	// faster on some specific files.
	// Good value: e.g. 8192.
	MAX_CHAIN_HITS = 8192

	// Whether to use the longest match cache for FindLongestMatch. This cache
	// consumes a lot of memory but speeds it up. No effect on compression size.
	LONGEST_MATCH_CACHE = true

	// Enable to remember amount of successive identical bytes in the hash chain for
	// finding longest match
	// required for HASH_SAME_HASH and SHORTCUT_LONG_REPETITIONS
	// This has no effect on the compression result, and enabling it increases speed.
	HASH_SAME = true

	// Switch to a faster hash based on the info from HASH_SAME once the
	// best length so far is long enough. This is way faster for files with lots of
	// identical bytes, on which the compressor is otherwise too slow. Regular files
	// are unaffected or maybe a tiny bit slower.
	// This has no effect on the compression result, only on speed.
	HASH_SAME_HASH = true

	// Enable this, to avoid slowness for files which are a repetition of the same
	// character more than a multiple of MAX_MATCH times. This should not affect
	// the compression result.
	SHORTCUT_LONG_REPETITIONS = true

	// Whether to use lazy matching in the greedy LZ77 implementation. This gives a
	// better result of LZ77Greedy, but the effect this has on the optimal LZ77
	// varies from file to file.
	LAZY_MATCHING = true
)
View Source
const (
	FORMAT_GZIP = iota
	FORMAT_ZLIB
	FORMAT_DEFLATE
)

Output format

View Source
const (
	UNCOMPRESSED_BLOCK = iota
	FIXED_BLOCK
	DYNAMIC_BLOCK
)

Block type

Variables

This section is empty.

Functions

func CalculateEntropy

func CalculateEntropy(count []float64) (bitLengths []float64)

Calculates the entropy of each symbol, based on the counts of each symbol. The result is similar to the result of CalculateBitLengths, but with the actual theoritical bit lengths according to the entropy. Since the resulting values are fractional, they cannot be used to encode the tree specified by DEFLATE.

func Compress

func Compress(options *Options, outputType int, in []byte, out io.Writer) error

func DeflateCompress

func DeflateCompress(options *Options, in []byte, out io.Writer) error

func GzipCompress

func GzipCompress(options *Options, in []byte, out io.Writer) error

Compresses according to the gzip specification and writes the compressed result to the output.

options: global program options out: writer to which the result is appended

func ZlibCompress

func ZlibCompress(options *Options, in []byte, out io.Writer) error

Types

type BlockState

type BlockState struct {
	// contains filtered or unexported fields
}

Some state information for compressing a block. This is currently a bit under-used (with mainly only the longest match cache), but is kept for easy future expansion.

func NewBlockState

func NewBlockState(options *Options, in []byte, inStart, inEnd int) (s BlockState)

func (*BlockState) LZ77Greedy

func (s *BlockState) LZ77Greedy(inStart, inEnd int) (store LZ77Store)

Does LZ77 using an algorithm similar to gzip, with lazy matching, rather than with the slow but better "squeeze" implementation. The result is placed in the LZ77Store. If inStart is larger than 0, it uses values before inStart as starting dictionary.

func (*BlockState) LZ77Optimal

func (s *BlockState) LZ77Optimal(inStart, inEnd int) LZ77Store

Calculates lit/len and dist pairs for given data. If instart is larger than 0, it uses values before instart as starting dictionary.

func (*BlockState) LZ77OptimalFixed

func (s *BlockState) LZ77OptimalFixed(inStart, inEnd int) LZ77Store

Does the same as LZ77Optimal, but optimized for the fixed tree of the deflate standard. The fixed tree never gives the best compression. But this gives the best possible LZ77 encoding possible with the fixed tree. This does not create or output any fixed tree, only LZ77 data optimized for using with a fixed tree. If inStart is larger than 0, it uses values before inStart as starting dictionary.

type Deflator

type Deflator struct {
	// contains filtered or unexported fields
}

func NewDeflator

func NewDeflator(wr io.Writer, options *Options) Deflator

func (*Deflator) Deflate

func (z *Deflator) Deflate(final bool, in []byte) (err error)

Compresses according to the deflate specification and append the compressed result to the output. This function will usually output multiple deflate blocks. If final is 1, then the final bit will be set on the last block.

final: whether this is the last section of the input, sets the final bit to the

last deflate block.

in: the input bytes

func (*Deflator) DeflatePart

func (z *Deflator) DeflatePart(final bool, in []byte, inStart, inEnd int) (err error)

Deflate a part, to allow Deflate() to use multiple master blocks if needed.

Like Deflate, but allows to specify start and end byte with inStart and inEnd. Only that part is compressed, but earlier bytes are still used for the back window.

It is possible to call this function multiple times in a row, shifting inStart and inEnd to next bytes of the data. If inStart is larger than 0, then previous bytes are used as the initial dictionary for LZ77. This function will usually output multiple deflate blocks. If final is 1, then the final bit will be set on the last block.

func (*Deflator) WriteLZ77Block

func (z *Deflator) WriteLZ77Block(blockType byte, final bool, store LZ77Store, expectedDataSize int)

Adds a deflate block with the given LZ77 data to the output. z: the stream to write to blockType: the block type, must be 1 or 2 final: whether to set the "final" bit on this block, must be the last block store: literal/length/distance array of the LZ77 data expectedDataSize: the uncompressed block size, used for panic, but you can

set it to 0 to not do the assertion.

type LZ77Store

type LZ77Store []lz77Pair

Stores lit/length and dist pairs for LZ77.

func (LZ77Store) CalculateBlockSize

func (store LZ77Store) CalculateBlockSize(blockType byte) uint64

Calculates block size in bits. litLens: lz77 lit/lengths dists: ll77 distances

type Options

type Options struct {
	// Whether to print output
	Verbose bool

	// Whether to print more detailed output
	VerboseMore bool

	// Maximum amount of times to rerun forward and backward pass to optimize
	// LZ77 compression cost. Good values: 10, 15 for small files, 5 for files
	// over several MB in size or it will be too slow.
	NumIterations int

	// If true, splits the data in multiple deflate blocks with optimal choice
	// for the block boundaries. Block splitting gives better compression. Default:
	// true.
	BlockSplitting bool

	// If true, chooses the optimal block split points only after doing the iterative
	// LZ77 compression. If false, chooses the block split points first, then does
	// iterative LZ77 on each individual block. Depending on the file, either first
	// or last gives the best compression. Default: false.
	BlockSplittingLast bool

	// Maximum amount of blocks to split into (0 for unlimited, but this can give
	// extreme results that hurt compression on some files). Default value: 15.
	BlockSplittingMax int

	// The deflate block type. Use 2 for best compression.
	//	 -0: non compressed blocks (00)
	//	 -1: blocks with fixed tree (01)
	//	 -2: blocks with dynamic tree (10)
	BlockType byte
}

Options used throughout the program.

func DefaultOptions

func DefaultOptions() (options Options)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL