hardlinkable

package module
v1.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 29, 2018 License: MIT Imports: 21 Imported by: 0

README

hardlinkable is a tool to scan directories and report files that could be hardlinked together because they have identical content, and (by default) other matching criteria such as modification time, permissions and ownership. It can optionally perform the linking as well, saving storage space (but by default, it only reports information).

This program is faster and with more accurate reporting of results than other variants that I have tried. It works by gathering full inode information before deciding what action (if any) to take. Full information allows it to produce exact reporting of what will happen before any modifications occur. It also uses content information from previous comparisons to drastically reduce search times.


Help

$ hardlinkable --help
A tool to scan directories and report on the space that could be saved
by hardlinking identical files.  It can also perform the linking.

Usage:
  hardlinkable [OPTIONS] dir1 [dir2...] [files...]

Flags:
  -v, --verbose           Increase verbosity level (up to 3 times)
      --no-progress       Disable progress output while processing
      --json              Output results as JSON
      --enable-linking    Perform the actual linking (implies --quiescence)
  -f, --same-name         Filenames need to be identical
  -t, --ignore-time       File modification times need not match
  -p, --ignore-perm       File permission (mode) need not match
  -o, --ignore-owner      File uid/gid need not match
  -x, --ignore-xattr      Xattrs need not match
  -c, --content-only      Only file contents have to match (ie. -potx)
  -s, --min-size N        Minimum file size (default 1)
  -S, --max-size N        Maximum file size
  -i, --include RE        Regex(es) used to include files (overrides excludes)
  -e, --exclude RE        Regex(es) used to exclude files
  -E, --exclude-dir RE    Regex(es) used to exclude dirs
  -d, --debug             Increase debugging level
      --ignore-walkerr    Continue on file/dir read errs
      --ignore-linkerr    Continue when linking fails
      --quiescence        Abort if filesystem is being modified
      --disable-newest    Disable using newest link mtime/uid/gid
      --search-thresh N   Ino search length before enabling digests (default 1)
  -h, --help              help for hardlinkable
      --version           version for hardlinkable

The include/exclude options can be given multiple times to support multiple regex matches.

--debug outputs additional information about program state in the final stats and the progress information.

--ignore-walkerr allows the program to skip over unreadable files and directories, and continue with the information gathering.

--ignore-linkerr allows the program to skip any links that cannot be made due to permission problems or other errors, and continue with the processing. It is only applicable when linking is enabled, and should be used with caution.

--quiescence checks that the files haven't changed between the initial scan and the attempt to link (such as filesizes or timestamps changing), etc. This would suggest they are being modified, and the program stops when this is detected. Specifying --quiescence during a normal scan, where linking is not enabled, will perform these checks anyway at a small performance cost.

--disable-newest will turn off the default behavior of attempting to set the src inode to the most recent modification time of the linked inodes, and also change the uid/gid to those of the more recent inode. This behavior can be useful for backup programs, so that they see inodes as being newer, and will back them up. Only applicable when linking is enabled.

--search-thresh can be set to (-1) to disable the use of digests, which may save a small amount of memory (at the cost of possibly many more comparisons done). Otherwise this controls the length that inode hashes must grow to before enabling the use of digests. Safe to ignore, this option will not affect results, only possibly the time required to complete a run.


Example output

$ hardlinkable download_dirs
Hard linking statistics
-----------------------
Directories               : 3408
Files                     : 89177
Hardlinkable this run     : 2462
Removable inodes          : 2462
Currently linked bytes    : 23480519   (22.393 MiB)
Additional saveable bytes : 245927685  (234.535 MiB)
Total saveable bytes      : 269408204  (256.928 MiB)
Total run time            : 4.691s

Additional verbosity levels will provide additional stats, a list of linkable files, and previously linked files:

$ hardlinkable -vvv download_dirs
Currently hardlinked files
--------------------------
from: download_dir/testfont/BlackIt/testfont.otf
  to: download_dir/testfont/BoldIt/testfont.otf
  to: download_dir/testfont/ExtraLightIt/testfont.otf
  to: download_dir/testfont/It/testfont.otf
Filesize: 4.146 KiB  Total saved: 12.438 KiB
...

Files that are hardlinkable
-----------------------
from: download_dir/bak1/some_image1.png
  to: download_dir/bak2/some_image1.png
...
from: download_dir/fonts1/some_font.otf
  to: download_dir/other_fonts1/some_font.otf

Hard linking statistics
-----------------------
Directories                 : 3408
Files                       : 89177
Hardlinkable this run       : 2462
Removable inodes            : 2462
Currently linked bytes      : 23480519   (22.393 MiB)
Additional saveable bytes   : 245927685  (234.535 MiB)
Total saveable bytes        : 269408204  (256.928 MiB)
Total run time              : 4.765s
Comparisons                 : 21479
Inodes                      : 80662
Existing links              : 8515
Total old + new links       : 10977
Total too small files       : 71
Total bytes compared        : 246099717  (234.699 MiB)
Total remaining inodes      : 78200

A more detailed breakdown of the various stats can be found in the Results.md.


Methodology

This program is named hardlinkable to indicate that, by default, it does not perform any linking, and the user has to explicitly opt-in to having it perform the linking step. This (to me) is a safer and more-sensible default than the alternatives; it's not unusual to want to run it a few times with different options to see what would result, before actually deciding whether to perform the linking.

The program first gathers all the information from the directory and file walk, and uses this information to execute a linking strategy which minimizes the number of moved links required to reach the final state.

Besides having more accurate statistics, this version can be significantly faster than other versions, due to opportunistically keeping track of simple file content hashes as the inode hash comparison lists grow. It computes these content hashes at first only when comparing files (when the file data will be read anyway), to avoid unnecessary I/O. Using this data and quick set operations, it can drastically reduce the amount of file comparisons attempted as the number of walked files grows.


History

There are a number of programs that will perform hardlinking of identical files, and both Redhat and Debian/Ubuntu each include a hardlink program, with different implementation and capabilities. The Redhat variant is based upon hardlink.c originally written by Jakub Jelinek, which later inspired John Villalovos to write his own version in Python, now known as hardlinkpy with multiple additional contributors (Antti Kaihola, Carl Henrik Lunde, et al.) The Python version inspired Julian Andres Klode to do yet another re-implementation in C, which also added support for Xattrs. There are numerous other variants floating around as well.

The previous versions that I've encountered do the hardlinking while walking the directory tree, before gathering complete information on all the inodes and pathnames. This tends to lead to inaccurate statistics reported during a "dry run", and can also cause links to be needlessly moved from inode to inode multiple times during a run. They also don't use "dry run" mode as the default, so you have to remember to enable "dry run" if you just want to play with different options, or find information on the amount of duplicate files that exist.

This version is written in Go and incorporates ideas from previous versions, as well as it's own innovations, to ensure exactly accurate results when in "dry run" mode and actual linking mode. I expect and intend for it to be the fastest version, due to avoiding unnecessary I/O, minimizing extraneous searches and comparisons, and because it never moves a link more than once during a run.


Contributing

Contributions are welcome, including bug reports, suggestions, and code patches/pull requests/etc. I'm interested in hearing what you use hardlinkable for, and what could make it more useful to you. If you've used other space-recovery hardlinking programs, I'm also interested to know if hardlinkable bests them in speed and report accuracy, or if you've found a regression in performance or capability.

Build

go test ./...
go test -tags slowtests ./...  # Could take a minute
go install ./...  # installs to GOPATH/bin

or

cd cmd/hardlinkable && go build  # builds in cmd/hardlinkable

Install hardlinkable command

go get github.com/chadnetzer/hardlinkable/cmd/hardlinkable

License

hardlinkable is released under the MIT license.

Documentation

Overview

Package hardlinkable determines which files in the given directories have equal content and compatible inode properties, and returns information on the space that would be saved by hardlinking them all together. It can also, optionally, perform the hardlinking.

Index

Constants

View Source
const DefaultMinFileSize = 1
View Source
const DefaultSearchThresh = 1
View Source
const DefaultShowExtendedRunStats = false // Non-cli default
View Source
const DefaultShowRunStats = true // Non-cli default
View Source
const DefaultStoreExistingLinkResults = true // Non-cli default
View Source
const DefaultStoreNewLinkResults = true // Non-cli default
View Source
const DefaultUseNewestLink = true

Variables

This section is empty.

Functions

func CheckQuiescence

func CheckQuiescence(o *Options)

CheckQuiescence enables quiescence checking which can detect changes to the filesystem during the file/directory walk.

func ContentOnly

func ContentOnly(o *Options)

ContentOnly uses only file content to determine equality (not inode parameters like time, permission, ownership, etc.)

func DebugLevel

func DebugLevel(debugLevel uint) func(*Options)

DebugLevel sets the debugging level (1,2,or 3)

func Humanize

func Humanize(n uint64) string

Humanize returns a string with bytecount "humanized" to a shortened amount

func HumanizeWithPrecision

func HumanizeWithPrecision(n uint64, prec int) string

HumanizeWithPrecision allows providing FormatFloat precision value

func HumanizedUint64

func HumanizedUint64(s string) (uint64, error)

HumanizedUint64 converts humanized size strings like "1k" into an unsigned int64 (ie. "1k" -> 1024)

func IgnoreLinkErrors

func IgnoreLinkErrors(o *Options)

IgnoreLinkErrors allows the Run to continue during Link phase errors (typically the actual linking itself)

func IgnoreOwner

func IgnoreOwner(o *Options)

IgnoreOwner allows linked files to have unequal uid or gid

func IgnorePerm

func IgnorePerm(o *Options)

IgnorePerm allows linked files to have unequal mode bits

func IgnoreTime

func IgnoreTime(o *Options)

IgnoreTime allows linked files to have unequal modification times

func IgnoreWalkErrors

func IgnoreWalkErrors(o *Options)

IgnoreWalkErrors allows the Run to continue during Walk phase errors (such as permission errors reading dirs or files)

func IgnoreXAttr

func IgnoreXAttr(o *Options)

IgnoreXAttr allows linked files to have unequal xattrs

func LinkingDisabled

func LinkingDisabled(o *Options)

LinkingDisabled forbids Run() from actually linking the files

func LinkingEnabled

func LinkingEnabled(o *Options)

LinkingEnabled allows Run() to actually perform linking of files

func MaxFileSize

func MaxFileSize(size uint64) func(*Options)

MaxFileSize sets the maximum size of files that can be linked

func MinFileSize

func MinFileSize(size uint64) func(*Options)

MinFileSize sets the minimum size of files that can be linked

func SameName

func SameName(o *Options)

SameName requires linked files to have equal filenames

func ShowExtendedRunStats

func ShowExtendedRunStats(o *Options)

ShowExtendedRunStats enabled prints more in OutputRunStats()

func ValidateDirsAndFiles

func ValidateDirsAndFiles(dirsAndFiles []string) (dirs []string, files []string, err error)

ValidateDirsAndFiles will ensure only dirs are provided, and remove duplicates. It is called by Run() to check the 'dirs' arg.

Types

type Options

type Options struct {
	// SameName enabled ensures only files with matching filenames can be
	// linked
	SameName bool

	// IgnoreTime enabled allows files with different mtime values can be
	// linked
	IgnoreTime bool

	// IgnorePerm enabled allows files with different inode mode values
	// can be linked
	IgnorePerm bool

	// IgnoreOwner enabled allows files with different uid or gid can be
	// linked
	IgnoreOwner bool

	// IgnoreXAttr enabled allows files with different xattrs can be linked
	IgnoreXAttr bool

	// LinkingEnabled causes the Run to perform the linking step
	LinkingEnabled bool

	// MinFileSize controls the minimum size of files that are eligible to
	// be considered for linking.
	MinFileSize uint64

	// MaxFileSize controls the maximum size of files that are eligible to
	// be considered for linking.
	MaxFileSize uint64

	// DebugLevel controls the amount of debug information reported in the
	// results output, as well as debug logging.
	DebugLevel uint

	// UseNewestLink requests setting the inode to the mtime and uid/gid of
	// the more recent inode when files are linked.
	UseNewestLink bool

	// FileIncludes is a slice of regex expressions that control what
	// filenames will be considered for linking.  If given without any
	// FileExcludes, the walked files must match one of the includes.  If
	// FileExcludes are provided, the FileIncludes can override them.
	FileIncludes []string

	// FileExcludes is a slice of regex expressions that control what
	// filenames will be excluded from consideration for linking.
	FileExcludes []string

	// DirExcludes is a slice of regex expressions that control what
	// directories will be excluded from the file discovery walk.
	DirExcludes []string

	// StoreExistingLinkResults allows controlling whether to store
	// discovered existing links in Results. Command line option Verbosity
	// > 2 can override.
	StoreExistingLinkResults bool

	// StoreNewLinkResults allows controlling whether to store discovered
	// new hardlinkable pathnames in Results. Command line option Verbosity
	// > 1 can override.
	StoreNewLinkResults bool

	// ShowExtendedRunStats enabled displays additional Result stats
	// output.  Command line option Verbosity > 0 can override.
	ShowExtendedRunStats bool

	// ShowRunStats enabled displays Result stats output.
	ShowRunStats bool

	// IgnoreWalkErrors allows Run to continue when errors occur during the
	// walk phase, such as not having permission to walk a directory, or
	// being unable to read a file for comparision.
	IgnoreWalkErrors bool

	// IgnoreLinkErrors allows Run to continue when linking fails (or any
	// errors during the Link phase)
	IgnoreLinkErrors bool

	// CheckQuiescence enabled looks for signs of the filesystems changing
	// during walk.  Always enabled when LinkingEnabled is true.
	CheckQuiescence bool

	// SearchThresh determines the length that the lists of files with
	// equivalent inode hashes can grow to, before also enabling content
	// digests (which can drastically reduce the number of compared files
	// when there are many with the same hash, but differing content at the
	// start of the file).  Can be disabled with -1.  May save a small
	// amount of memory, but potentially at greatly increased runtime in
	// worst case scenarios with many, many files.
	SearchThresh int
}

Options is passed to the Run() func, and controls the operation of the hardlinkable algorithm, including what inode parameters much match for files to be compared for equality, what files and directories are included or excluded, and whether linking is actually enabled or not.

func SetupOptions

func SetupOptions(args ...func(*Options)) Options

SetupOptions returns a Options struct with the defaults initialized and the given setup functions also applied.

func (*Options) Validate

func (o *Options) Validate() error

Validate will ensure that contradictory Options aren't set, and that dependent Options are set. An error will be returned if Options is invalid.

type Results

type Results struct {
	// Link member strings are pathnames
	ExistingLinks     map[string][]string `json:"existingLinks"`
	ExistingLinkSizes map[string]uint64   `json:"existingLinkSizes"`
	LinkPaths         [][]string          `json:"linkPaths"`
	SkippedLinkPaths  [][]string          `json:"skippedLinkPaths"` // Skipped when link failed
	RunStats
	StartTime time.Time `json:"startTime"`
	EndTime   time.Time `json:"endTime"`
	RunTime   string    `json:"runTime"`
	Opts      Options   `json:"options"`

	// Set to true when Run() has completed successfully
	RunSuccessful bool `json:"runSuccessful"`

	// Record which 'phase' we've gotten to in the algorithms, in case of
	// early termination of the run.
	Phase RunPhases `json:"phase"`
}

Results contains the RunStats information, as well as the found existing and new links. It also includes a measurement of how long the Run() took to execute, and the Options that were used to perform the Run().

func Run

func Run(dirsAndFiles []string, opts Options) (Results, error)

Run performs a scan of the supplied directories and files, with the given Options, and outputs information on which files could be linked to save space.

func RunWithProgress

func RunWithProgress(dirsAndFiles []string, opts Options) (Results, error)

RunWithProgress performs a scan of the supplied directories and files, with the given Options, and outputs information on which files could be linked to save space. A progress line is continually updated as the directories and files are scanned.

func (r *Results) OutputExistingLinks()

OutputExistingLinks shows in text form the existing links that were found by Run.

func (*Results) OutputJSONResults

func (r *Results) OutputJSONResults()

OutputJSONResults outputs a JSON formatted object with all the information gathered by Run() about existing and new links, and stats on space saved, etc.

func (r *Results) OutputNewLinks()

OutputNewLinks shows in text form the pathnames that were discovered to be linkable.

func (*Results) OutputResults

func (r *Results) OutputResults()

OutputResults prints results in text form, including existing links that were found, new pathnames that were discovered to be linkable, and stats about the run giving information on the amount of data that can be saved (or was saved if linking was enabled).

func (*Results) OutputRunStats

func (r *Results) OutputRunStats()

OutputRunStats show information about how many files could be linked, how much space would be saved, and other information on inodes, comparisons, etc. If linking was enabled, it displays the information on links that were actually made and space actually saved (which should equal the predicted amounts).

func (r *Results) OutputSkippedNewLinks()

OutputSkippedNewLinks shows in text form the pathnames that were skipped due to linking errors.

type RunPhases

type RunPhases int

RunPhases is an enum that indicates which phase of the Run() algorithm is being executed.

const (
	// StartPhase indicates the Run() algorithm hasn't started
	StartPhase RunPhases = iota
	// WalkPhase indicates the directory/file walk which gathers info
	WalkPhase
	// LinkPhase indicates that the pathname link pairs are being computed
	LinkPhase
	// EndPhase indicates the Run() has finished
	EndPhase
)

type RunStats

type RunStats struct {
	DirCount               int64  `json:"dirCount"`
	FileCount              int64  `json:"fileCount"`
	FileTooSmallCount      int64  `json:"fileTooSmallCount"`
	FileTooLargeCount      int64  `json:"fileTooLargeCount"`
	ComparisonCount        int64  `json:"comparisonCount"`
	InodeCount             int64  `json:"inodeCount"`
	InodeRemovedCount      int64  `json:"inodeRemovedCount"`
	NlinkCount             int64  `json:"nlinkCount"`
	ExistingLinkCount      int64  `json:"existingLinkCount"`
	NewLinkCount           int64  `json:"newLinkCount"`
	ExistingLinkByteAmount uint64 `json:"existingLinkByteAmount"`
	InodeRemovedByteAmount uint64 `json:"inodeRemovedByteAmount"`
	BytesCompared          uint64 `json:"bytesCompared"`

	// Some stats on files that compared equal, but which had some
	// mismatching inode parameters.  This can be helpful for tuning the
	// command line options on subsequent runs.
	MismatchedMtimeCount int64  `json:"mismatchedMtimeCount"`
	MismatchedModeCount  int64  `json:"mismatchedModeCount"`
	MismatchedUIDCount   int64  `json:"mismatchedUIDCount"`
	MismatchedGIDCount   int64  `json:"mismatchedGIDCount"`
	MismatchedXAttrCount int64  `json:"mismatchedXAttrCount"`
	MismatchedTotalCount int64  `json:"mismatchedTotalCount"`
	MismatchedMtimeBytes uint64 `json:"mismatchedMtimeBytes"`
	MismatchedModeBytes  uint64 `json:"mismatchedModeBytes"`
	MismatchedUIDBytes   uint64 `json:"mismatchedUIDBytes"`
	MismatchedGIDBytes   uint64 `json:"mismatchedGIDBytes"`
	MismatchedXAttrBytes uint64 `json:"mismatchedXAttrBytes"`
	MismatchedTotalBytes uint64 `json:"mismatchedTotalBytes"`

	// Counts of file I/O errors (reading, linking, etc.)
	SkippedDirErrCount  int64 `json:"skippedDirErrCount"`
	SkippedFileErrCount int64 `json:"skippedFileErrCount"`
	SkippedLinkErrCount int64 `json:"skippedLinkErrCount"`

	// Counts of files and dirs excluded by the Regex matches
	ExcludedDirCount  int64 `json:"excludedDirCount"`
	ExcludedFileCount int64 `json:"excludedFileCount"`
	IncludedFileCount int64 `json:"includedFileCount"`

	// Count of how many setuid and setgid files were encountered (and skipped)
	SkippedSetuidCount int64 `json:"skippedSetuidCount"`
	SkippedSetgidCount int64 `json:"skippedSetgidCount"`

	// Also keep track of files with bits other than the permission bits
	// set (other than setuid/setgid and bits already excluded by "regular"
	// file bits)
	SkippedNonPermBitCount int64 `json:"skippedNonPermBitCount"`

	// Debugging counts
	EqualComparisonCount int64 `json:"equalComparisonCount"`
	FoundHashCount       int64 `json:"foundHashCount"`
	MissedHashCount      int64 `json:"missedHashCount"`
	HashMismatchCount    int64 `json:"hashMismatchCount"`
	InoSeqSearchCount    int64 `json:"inoSeqSearchCount"`
	InoSeqIterationCount int64 `json:"inoSeqIterationCount"`
	DigestComputedCount  int64 `json:"digestComputedCount"`

	// Counts of how many times the hardlinkFiles() func wasn't able to
	// successfully change inode times and/or uid/gid.  Since we ignore
	// such errors and continue anyway (ie. it's a best-effort attempt,
	// rather than a guarantee), the counts are debugging info.
	FailedLinkChtimesCount int64 `json:"failedLinkChtimesCount"`
	FailedLinkChownCount   int64 `json:"failedLinkChownCount"`
}

RunStats holds information about counts, the number of files found to be linkable, the bytes that linking would save (or did save), and a variety of related, useful, or just interesting information gathered during the Run().

Directories

Path Synopsis
cmd
internal
cli

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL