dupefinder

command module
v0.0.0-...-6bfa64d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 12, 2017 License: MIT Imports: 17 Imported by: 0

README

DupeFinder

This is a simple cli tool for finding duplicates.

Duplicates are files with the same content, i.e., files with matching checksums (MD5).

Build

Use the standard commands provided by the Go toolset.

$ go get github.com/c0xc/dupefinder

dupefinder should then be in your $GOPATH.

If you want to use DupeFinder on another operating system, say, on FreeBSD (for example, if you have an HP MicroServer as NAS box), you can simply cross compile it:

$ git clone https://github.com/c0xc/dupefinder.git
$ cd dupefinder
$ GOOS=freebsd go build
$ file dupefinder
dupefinder: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), statically linked, not stripped

Map files

A map file contains the map generated by DupeFinder, which contains metadata for the scanned directory. It is possible to export a map file, which can be imported later to speed up another scan of the same directory. After importing a map file, files that have not changed (same name, size, mtime) are not hashed again. This makes it possible to limit a subsequent scan to the changes (new or modified files) rather than having to scan everything again which may take hours to finish.

Use the help option (-h) for details.

Example

Here is a simple example with 5 files:

$ ls -gi file* */*
7835586 -rw-r--r-- 1 users 12 May 12 16:09 file1
7836166 -rw-r--r-- 1 users 12 May 12 15:52 file1_copy
7835614 -rw-r--r-- 1 users 12 May 12 15:52 file2
7835626 -rw-r--r-- 1 users 12 May 12 15:52 'other directory/foo'
7835600 -rw-r--r-- 1 users 12 May 12 15:51 'some directory/file1'

Change into that directory and run DupeFinder on the current directory (.). By default, DupeFinder shows duplicates in groups, in alphabetical order.

$ ~/bin/dupefinder .
Scanning...

file1
file1_copy
some directory/file1

file2
other directory/foo

Files:                  5
Total size:             60 B (60 B)
Duplicate groups:       2
Duplicate count:        3
Size of duplicates:     36 B (36 B)

As shown in the summary, DupeFinder has found two duplicate groups, one with three identical files and another one with two identical files.

If you decide to delete all duplicates, DupeFinder will keep the first file of each duplicate group:

$ ~/bin/dupefinder -delete-duplicates .
Scanning...

file2
other directory/foo

file1
file1_copy
some directory/file1

Files:                  5
Total size:             60 B (60 B)
Duplicate groups:       2
Duplicate count:        3
Size of duplicates:     36 B (36 B)

Deleted other directory/foo
Deleted file1_copy
Deleted some directory/file1

$ ls -gi *
7835586 -rw-r--r-- 1 users 12 May 12 16:09 file1
7835614 -rw-r--r-- 1 users 12 May 12 15:52 file2

'other directory':
total 0

'some directory':
total 0

Now, DupeFinder will confirm that there are no more duplicates left:

$ ~/bin/dupefinder .
Scanning...

Files:                  2
Total size:             24 B (24 B)
Duplicate groups:       0
Duplicate count:        0
Size of duplicates:     0 B (0 B)

If this is your archive, you might want to keep all your files, but release the space wasted by the duplicates. This is done by linking them together, i.e., keeping the first file and replacing the other files with hardlinks. Note that this won't work for files on other filesystems.

You might want to keep the modification time of the oldest file of each duplicate group in your archive. In this example, file1 is the newest file, so you need to sort by time.

$ ~/bin/dupefinder -sort-time -sort-reversed -link-duplicates .
Scanning...

file2
other directory/foo

some directory/file1
file1_copy
file1

Files:                  5
Total size:             60 B (60 B)
Duplicate groups:       2
Duplicate count:        3
Size of duplicates:     36 B (36 B)

Replaced file1_copy
Replaced file1
Replaced other directory/foo

As you can see, the oldest files have been kept unchanged and the newer ones have been replaced with links:

$ ls -gi file* */*
7839140 -rw-r--r-- 3 users 12 May 12 15:51 file1
7839140 -rw-r--r-- 3 users 12 May 12 15:51 file1_copy
7839138 -rw-r--r-- 2 users 12 May 12 15:52 file2
7839138 -rw-r--r-- 2 users 12 May 12 15:52 'other directory/foo'
7839140 -rw-r--r-- 3 users 12 May 12 15:51 'some directory/file1'

Scanning larger directories will take time, so a map file should be created:

$ time ~/bin/dupefinder -export-map-file ~/100M_files.map .
Scanning...

other_file_1
other_file_10
other_file_11
other_file_12
other_file_13
other_file_14
other_file_15
other_file_16
other_file_17
other_file_2
other_file_3
other_file_4
other_file_5
other_file_6
other_file_7
other_file_8
other_file_9

file_2
file_2_2

file_1
file_1_2
file_1_3
file_1_4
file_1_5

Files:                  39
Total size:             3.8 GiB (4089446400 B)
Duplicate groups:       3
Duplicate count:        21
Size of duplicates:     2.1 GiB (2202009600 B)


real    0m20.137s
user    0m6.546s
sys     0m2.171s

Now, you can import the map file and DupeFinder will only hash new or changed files.

$ mv -vi file_1 file_1_1
'file_1' -> 'file_1_1'
$ touch file_2
$ time ~/bin/dupefinder -import-map-file ~/100M_files.map .
Imported files: 39
Scanning...

file_1_1
file_1_2
file_1_3
file_1_4
file_1_5

file_2
file_2_2

other_file_1
other_file_10
other_file_11
other_file_12
other_file_13
other_file_14
other_file_15
other_file_16
other_file_17
other_file_2
other_file_3
other_file_4
other_file_5
other_file_6
other_file_7
other_file_8
other_file_9

Files:                  39
Total size:             3.8 GiB (4089446400 B)
Duplicate groups:       3
Duplicate count:        21
Size of duplicates:     2.1 GiB (2202009600 B)


real    0m0.518s
user    0m0.306s
sys     0m0.089s

If you know that nothing has changed and you want to skip the scan completely, the program will simply print the contents of the map. This might save you a few minutes if you have hundreds of thousands of files. Don't use this option unless you know what you are doing.

$ time ~/bin/dupefinder -import-map-file ~/100M_files.map -skip-scan .
Imported files: 39
Skipping scan
other_file_1
other_file_10
other_file_11
other_file_12
other_file_13
other_file_14
other_file_15
other_file_16
other_file_17
other_file_2
other_file_3
other_file_4
other_file_5
other_file_6
other_file_7
other_file_8
other_file_9

file_1
file_1_2
file_1_3
file_1_4
file_1_5

file_2
file_2_2

Files:                  39
Total size:             3.8 GiB (4089446400 B)
Duplicate groups:       3
Duplicate count:        21
Size of duplicates:     2.1 GiB (2202009600 B)


real    0m0.009s
user    0m0.001s
sys     0m0.003s

Note that it still shows file_1, which doesn't exist anymore.

It is possible to export a hash file (MD5SUMS). In this case, you could use the -skip-scan option to disable the scan and just copy the imported data to a hash file (even from a different directory).

$ ~/bin/dupefinder -import-map-file ~/100M_files.map -export-md5sums-file /tmp/100M.md5 -list-duplicate-groups=false -show-summary=false . && cat /tmp/100M.md5
Imported files: 39
Scanning...

1a4d8a2535a1111bf656fa316385b1ff  file_2
2f282b84e7e608d5852449ed940bfc51  other_file_16
2f282b84e7e608d5852449ed940bfc51  other_file_6
2f282b84e7e608d5852449ed940bfc51  other_file_4
2f282b84e7e608d5852449ed940bfc51  other_file_15
dc7b3ef428499c47bb86eff4fbf21d8c  file_13
2f282b84e7e608d5852449ed940bfc51  other_file_13
8dd8bcac634b52d26038b59566270c2f  file_4
c3151728913ed69fd8d9293fba6fa6ae  file_5
d499f005ab78958a00ba7e9b80f934e3  file_1_4
2f282b84e7e608d5852449ed940bfc51  other_file_14
5c00bb921d8f7d50243192c25625ad76  file_6
d499f005ab78958a00ba7e9b80f934e3  file_1_5
2f282b84e7e608d5852449ed940bfc51  other_file_7
d499f005ab78958a00ba7e9b80f934e3  file_1_2
d499f005ab78958a00ba7e9b80f934e3  file_1_1
2f282b84e7e608d5852449ed940bfc51  other_file_8
185d6a73385fafee6458007de7aace5b  file_17
2f282b84e7e608d5852449ed940bfc51  other_file_17
3cdaa2bb215c0ab803098f792c92c5a7  file_16
1a4d8a2535a1111bf656fa316385b1ff  file_2_2
2f282b84e7e608d5852449ed940bfc51  other_file_12
2f282b84e7e608d5852449ed940bfc51  other_file_2
79c42d817cac42b5441556c23453390b  file_10
8925c8e1feee71bd0abd6e3aa1080b05  file_15
2f282b84e7e608d5852449ed940bfc51  other_file_9
400687945e95e7d7f84fbee9a3eddd1c  file_3
11c60fb54e571af5cb1bee3dcaff09ec  file_7
96f4950095f98cdb36817b34158da79d  file_9
418a60cee64e85170a642dea73c19ef4  file_12
8333e43c8302fc857c04c3d06c07e03f  file_11
c5961fd5ab54cd97af69593784c5dede  file_14
2f282b84e7e608d5852449ed940bfc51  other_file_11
d499f005ab78958a00ba7e9b80f934e3  file_1_3
0f56081d87d47a6180fc217f931cda3e  file_8
2f282b84e7e608d5852449ed940bfc51  other_file_10
2f282b84e7e608d5852449ed940bfc51  other_file_5
2f282b84e7e608d5852449ed940bfc51  other_file_1
2f282b84e7e608d5852449ed940bfc51  other_file_3

DupeFinder works with huge directories too. However, a minimum of 2 GB available RAM is recommended. In this particular example, more than half the space used by all scanned files is wasted by duplicates and can be freed by linking them together:

Files:                  2342714
Total size:             10 TiB (11203860850082 B)
Duplicate groups:       315596
Duplicate count:        1865709
Size of duplicates:     6.4 TiB (7085231446475 B)

In such extreme cases, you might want to clone the filesystem (zfs clone ..., btrfs subvolume snapshot ...) and use the clone, which you can then use to replace the original filesystem (for example, by creating another clone with the original name after deleting the original filesystem).

Author

Philip Seeger (philip@philip-seeger.de)

License

Please see the file called LICENSE.

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL