sloppy

package

v0.0.0-...-e5fa29d Latest Latest Go to latest Published: Aug 27, 2021 License: Apache-2.0 Imports: 1 Imported by: 2

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/attic-labs/noms

Links

Open Source Insights

Documentation ¶

Index ¶

type Sloppy
- func New(f func(b byte) bool) *Sloppy
- func (sl *Sloppy) Reset()
- func (sl *Sloppy) Update(src []byte)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Sloppy ¶

type Sloppy struct {
	// contains filtered or unexported fields
}

Sloppy is a logical variant of Snappy. Its purpose to provide a kind of estimate of how a given byte sequence *will be* compressed by Snappy. It is useful when a byte stream is fed into a rolling hash with the goal of achieving a given average chunk byte length *after compression*. Sloppy is logically similar to snappy, but prefers "copies" which are closer to the repeated byte sequence (snappy prefers to refer to the *first* instance of a repeated byte sequence). This is important for mitigating the likelihood that altering any byte in an input stream will cause chunk boundaries to be redrawn downstream.

The high-level approach is to maintain a logical mapping between four-byte sequences which have been observed in the stream so-far and the integer offset of observed sequence (the mapping is done with a "cheap" hash-function which permits false-positives because they can be trivial filtered out). In the non-matched state, for each new byte consumed, a uint32 is computed from the next 4 bytes and then a look-up is performed to check for a matching 4 bytes earlier in the stream. Snappy and sloppy behave roughly identical thus far.

When in the "matched state" (attempting to extend the current match), Snappy does not re-index new 4-byte sequences, but Sloppy does. The reason for this is that Sloppy would like match the most recent occurence as it moves forward.

Lastly, Sloppy adds two novel heuritics, both aimed at further mitigating the chance of chunk boundaries being redrawn because of byte value changes:

1) During the first 2 bytes of match, it *continues* to look for closer matches (effectively prefering a closer but shorter copy to a further but longer one). The reason for this is that when sequences repeat frequently in a byte stream, randomness provides for a good chance that a one or two byte prefix on a repeated sequence will match "far away". E.g.

"23hello my friend, 12hello my friend, 01hello my friend, 23hello my friend"

In the above sequence, sloppy would prefer to copy the final "hello my friend" 19 bytes backwards rather than "23hello my friend" quite a bit further.

2) Sloppy will only emit copies which are "worth it". I.e. The longer the reference back, the longer the length of the copy must be.

func New ¶

func New(f func(b byte) bool) *Sloppy

New returns a new sloppy encoder which will encode to |f|. If |f| ever returns false, then encoding ends immediately. |f| is a callback because the primary use is that the "encoded" byte stream is fed byte-by-byte into a rolling hash function.

func (*Sloppy) Reset ¶

func (sl *Sloppy) Reset()

func (*Sloppy) Update ¶

func (sl *Sloppy) Update(src []byte)

Update continues the encoding of a given input stream. The caller is expected to call update after having (ONLY) appended bytes to |src|. When |Update| returns, sloppy will have emitted 0 or more literals or copies by calling the |sf.f|. Note that sloppy will ALWAYS buffer the final three bytes of input.

Source Files ¶

View all Source files

sloppy.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL