imohash

package module
v1.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 23, 2024 License: MIT Imports: 5 Imported by: 13

README

Go Reference

imohash

imohash is a fast, constant-time hashing library for Go. It uses file size and sampling to calculate hashes quickly, regardless of file size.

imosum is a sample application to hash files from the command line, similar to md5sum.

Alternative implementations

Installation

go get github.com/kalafut/imohash/...

The API is described in the package documentation.

Uses

Because imohash only reads a small portion of a file's data, it is very fast and well suited to file synchronization and deduplication, especially over a fairly slow network. A need to manage media (photos and video) over Wi-Fi between a NAS and multiple family computers is how the library was born.

If you just need to check whether two files are the same, and understand the limitations that sampling imposes (see below), imohash may be a good fit.

Misuses

Because imohash only reads a small portion of a file's data, it is not suitable for:

  • file verification or integrity monitoring
  • cases where fixed-size files are manipulated
  • anything cryptographic

Design

(Note: a more precise description is provided in the algorithm description.)

imohash works by hashing small chunks of data from the beginning, middle and end of a file. It also incorporates the file size into the final 128-bit hash. This approach is based on a few assumptions which will vary by application. First, file size alone tends1 to be a pretty good differentiator, especially as file size increases. And when people do things to files (such as editing photos), size tends to change. So size is used directly in the hash, and any files that have different sizes will have different hashes.

Size is an effective differentiator but isn't sufficient. It can show that two files aren't the same, but to increase confidence that like-size files are the same, a few segments are hashed using murmur3, a fast and effective hashing algorithm. By default, 16K chunks from the beginning, middle and end of the file are used. The ends of files often contain metadata which is more prone to changing without affecting file size. The middle is for good measure. The sample size can be changed for your application.

1 Try du -a . | sort -nr | less on a sample of your files to check this assertion.

Small file exemption

Small files are more likely to collide on size than large ones. They're also probably more likely to change in subtle ways that sampling will miss (e.g. editing a large text file). For this reason, imohash will simply hash the entire file if it is less than 128K. This parameter is also configurable.

Performance

The standard hash performance metrics make no sense for imohash since it's only reading a limited set of the data. That said, the real-world performance is very good. If you are working with large files and/or a slow network, expect huge speedups. (spoiler: reading 48K is quicker than reading 500MB.)

Name

Inspired by ILS marker beacons.

Credits

  • The "sparseFingerprints" used in TMSU gave me some confidence in this approach to hashing.
  • The twmb/mumur3 library that does all of the heavy lifting.

Documentation

Overview

Package imohash implements a fast, constant-time hash for files. It is based atop murmurhash3 and uses file size and sample data to construct the hash.

For more information, including important caveats on usage, consult https://github.com/kalafut/imohash.

Index

Constants

View Source
const SampleSize = 16 * 1024
View Source
const SampleThreshold = 128 * 1024

Files smaller than this will be hashed in their entirety.

View Source
const Size = 16

Variables

This section is empty.

Functions

func Sum

func Sum(data []byte) [Size]byte

Sum hashes a byte slice using default sample parameters.

func SumFile

func SumFile(filename string) ([Size]byte, error)

SumFile hashes a file using default sample parameters.

Types

type ImoHash

type ImoHash struct {
	// contains filtered or unexported fields
}

func New

func New() ImoHash

New returns a new ImoHash using the default sample size and sample threshhold values.

func NewCustom

func NewCustom(sampleSize, sampleThreshold int) ImoHash

NewCustom returns a new ImoHash using the provided sample size and sample threshhold values. The entire file will be hashed (i.e. no sampling), if sampleSize < 1.

func (*ImoHash) Sum

func (imo *ImoHash) Sum(data []byte) [Size]byte

Sum hashes a byte slice using the ImoHash parameters.

func (*ImoHash) SumFile

func (imo *ImoHash) SumFile(filename string) ([Size]byte, error)

SumFile hashes a file using using the ImoHash parameters.

Directories

Path Synopsis
cmd
imosum
imosum is a sample application using imohash.
imosum is a sample application using imohash.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL