fdups

command module
v0.0.0-...-8685a14 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 20, 2021 License: MIT Imports: 25 Imported by: 0

README

fdups

Сommand line utility that searches for duplicate files in specified directories [roots] matching with [patterns] based on equality of file content as well as various customizable combinations of meta information.

Main priority of this program is fast finding duplicates (trying to get the most out of Go) with flexible settings and at the same time with full monitoring of processing progress / statistics on all stages (which often comes at the expense of speed))

Idea inspired by https://github.com/pauldreik/rdfind

Features:

  • glob patterns with extended support []** and classes { ... [, ...] };
  • glob search is concurrent, found paths are returned asap (not blocked) for further processing;
  • correct path resolving in sudo mode
  • grouping of duplicates can be refined based on coincidence of file meta info combinations such as base name, modification date, owner user / group, permissions (use of size is assumed)
  • to resolve links, program deals with file inodes internally
  • to speed up content filtering (especially for large files) uses multi-stage filters, based on hash file head and / or tail checksums ("pre-filters") that can be enabled by specifying size and hashing algorithm, hashing algo used on final filtering stage (for full file content) also can be specified;
  • during program execution, user receives all necessary processing statistics and can interrupt execution to get intermediate results;
  • results are saved to text file(s) as grouped sorted list of found duplicates;
  • in order to facilitate making further decision on duplicates (program does not delete anything!) multilevel sorting of results is used;
    • at top level, dup groups are sorted by metakey (size, mt, etc),
    • within dup groups, duplicates are grouped by priority ([roots] membership), file type (regular / link), modification time, path depth etc.

Install and usage:

> git clone github.com/nj-eka/fdups
> cd fdups
> go mod tidy
> go build .
> ./fdups --help
Usage of ./fdups:
  -blocks
    Prefilter (head/tail) size is given in file blocks (otherwise in bytes)
  -dry
    Run mode without saving duplications into files
  -full string
    Final hash filter settings in format [algo] (default "sha256")
  -head string
    Head hash filter settings in format [algo;size]
  -l string
    Logging level: panic fatal error warn info debug trace (short) (default "info")
  -log string
    Path to log output file; empty = os.Stdout (default "fdups.log")
  -log_format string
    Logging format: text json (default "text")
  -log_level string
    Logging level: panic fatal error warn info debug trace (default "info")
  -max int
    Max file size to search, -1 = no upper limit (default -1)
  -mg string
    Dup grouping based on meta info; string combination of file base (n)ame - (m)odification time - (p)ermition owner - (u)ser owner - (g)roup
  -min int
    Min file size to search (default 1)
  -output_dir string
    Output dir for found duplication results
  -prefix string
    Base prefix of output file in output dir (default "fdups")
  -groups int
    Maximum number of groups of duplicates per output file (default 100)
  -patterns value
    Glob patterns (including ** and {}) to search in roots. (default **/*)
  -refresh duration
    Statistics update rate (how often stats are printed out to os.Stdout) (default 5s)
  -roots value
    List of dirs to search. Order sets priority of sorting found duplicates. Empty = pwd. (default "")
  -tail string
    Tail hash filter settings in format [algo;size]
  -trace string
    Trace file; tracing is on if LogLevel = trace; empty = os.Stderr (default "fdups.trace.out")

Output example:

os.Stdout:

==========    Final stats   ==============
Time elapsed:  5s
Runtime mem usage:      Alloc = 236 MiB TotalAlloc = 1.3 GiB    Sys = 542 MiB   Mallocs = 7.2 MiB       Frees = 5.4 MiB GCSys = 22 MiB  NumGC = 10
Search & validation:               0/0 files (found/unique)        58956(786 MiB) validated           58944(786 MiB) inodes
sizing (quantiles):
14738:          10-3551        
14737:        3551-8200        
14740:        8200-19351       
14741:       19351-9258642

Hash filters:
[ 0]:    24704(groups)    49207(inodes)      3.0 MiB(read)
[ 1]:     9481(groups)    33312(inodes)      4.1 MiB(read)
[ 2]:    10527(groups)    35859(inodes)      344 MiB(read)

Duplicates found:
8999(groups)    34335(inodes)      100 MiB(unique)      341 MiB(total)      241 MiB(can be freed)
sizing (quantiles):
8581 :          10-882         
8586 :         882-2609        
8583 :        2609-6938        
8585 :        6938-5447984

output files:

#1: 3(3) mid{size:     5447983;mt:*;uid:*;gid:*;perm:*;name:*};cid{-&0:64:md5:affbc761b58b05c7a72c97adeae3541b&1:128:sha1:b8907e8e731c470d9e94e7540824999fe034a4ab&2:5447983:sha256:a78a559398239038f67c5737bc73b3674f74eccfcaa2a0339c49af904495dfee}
10753278( 1)|-r--r--r--|     5447983|Thu, 08 Jul 2021 16:37:09 MSK|UUU:GGG|XXX/go/pkg/mod/golang.org/x/text@v0.3.4/date/tables.go
11671045( 1)|-r--r--r--|     5447983|Fri, 03 Sep 2021 10:49:57 MSK|UUU:GGG|XXX/go/pkg/mod/golang.org/x/text@v0.3.1-0.20181227161524-e6919f6577db/date/tables.go
9982144( 1)|-rw-rw-r--|     5447983|Fri, 03 Sep 2021 11:26:19 MSK|UUU:GGG|XXX/go/src/golang.org/x/text/date/tables.go
...
#8901: 2(2) mid{size:          36;mt:*;uid:*;gid:*;perm:*;name:*};cid{-&2:36:sha256:050cb1c50ab73cd5b3a784564c52de4d5d37a613e7e5bc748e7c6f5f2497c295}
10751052( 1)|-rw-rw-r--|          36|Wed, 07 Jul 2021 14:16:08 MSK|UUU:GGG|XXX/go/pkg/mod/cache/download/github.com/prometheus/common/@v/v0.0.0-20181126121408-4724e9255275.mod
10761423( 1)|-rw-rw-r--|          36|Wed, 21 Jul 2021 19:05:04 MSK|UUU:GGG|XXX/go/pkg/mod/cache/download/github.com/prometheus/common/@v/v0.0.0-20181113130724-41aa239b4cce.mod

links example output:
#1: 3(9) mid{size:       29572;mt:*;uid:*;gid:*;perm:*;name:*};cid{-&0:64:md5:c41e6a113ede3180d4917aa1d725dd82&1:128:sha1:03a26676924b45469937d1abdaeeb3860885b3e5&2:29572:sha256:e6b5118193cc6f9dd061020f86722322b29dac9f94fe897655b589cfcfbc2435}
9843512( 3)|-rwxrwxrwx|       29572|Wed, 05 May 2021 07:43:20 MSK|root:root|XXX/dups/ps1
9843512( 3)|-rwxrwxrwx|       29572|Wed, 05 May 2021 07:43:20 MSK|root:root|XXX/dups/ps1_h1
9843512( 3)|-rwxrwxrwx|       29572|Wed, 05 May 2021 07:43:20 MSK|root:root|XXX/dups/ps1_h2
9843626( 1)|-rw-rw-r--|       29572|Wed, 04 Aug 2021 20:36:48 MSK|user:group|XXX/dups/ps1_1
9843696( 1)|-rwxr-xr-x|       29572|Fri, 06 Aug 2021 20:57:07 MSK|root:root|XXX/dups/ps11
9843524( 1)|-rwxrwxrwx|       29572|Wed, 05 May 2021 07:43:20 MSK|root:root|XXX/dups/Link to ps1 -> XXX/dups/ps1
9843525( 1)|-rwxrwxrwx|       29572|Wed, 05 May 2021 07:43:20 MSK|root:root|XXX/dups/ps1_s1 -> XXX/dups/ps1
9843526( 1)|-rwxrwxrwx|       29572|Wed, 05 May 2021 07:43:20 MSK|root:root|XXX/dups/ps1_s2 -> XXX/dups/ps1
9843618( 1)|-rwxrwxrwx|       29572|Wed, 05 May 2021 07:43:20 MSK|root:root|XXX/dups/ps1_s1_s1 -> XXX/dups/ps1

Ideas for the future:

  • save intermediate results in runtime by sending some (like user1/2) signals to process without interrupting / stopping program
  • if it's not about cross-platform, glob function can be rewritten to use os system calls directly (example: https://habr.com/ru/post/281382/)
  • add support for finding duplicates based on file types (especially media types with parsing media containers, etc.)
  • implement self-tuning on the basis of collected statistics in runtime and os resources, to set optimal parameters (size/algo for pre-filters, number of workers, etc.)

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
Package filestat defines and implements basic file statistics interface
Package filestat defines and implements basic file statistics interface
searching
Package wfs implements concurrent glob functionality with recursive "**" supports "**" in glob represents a recursive wildcard matching zero-or-more directory levels deep base on src: https://github.com/yargevad/filepathx/blob/master/filepathx.go todo: add rglob
Package wfs implements concurrent glob functionality with recursive "**" supports "**" in glob represents a recursive wildcard matching zero-or-more directory levels deep base on src: https://github.com/yargevad/filepathx/blob/master/filepathx.go todo: add rglob

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL