dupescout

package module
v0.0.0-...-ed77a38 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 27, 2024 License: MIT Imports: 19 Imported by: 1

README

dupescout

A tiny Go package to concurrently find duplicate file paths in a given directory. By default the determination of whether two files are duplicates is based on different hashing functions, but the logic can be customized by providing a custom dupescout.KeyGeneratorFunc function (see: key-generator).

Installation

go get github.com/ricci2511/riccis-homelab-utils/dupescout

Usage

The package exposes two functions: GetResults and StreamResults. Both take a dupescout.Cfg struct to configure the search.

  • GetResults returns a slice of duplicate file paths once the search is complete.
  • StreamResults takes a channel of type chan []string, to which it sends each duplicate file path as they are found. Useful if you want to process the results as they come in instead of getting them all at once when the search is complete.

Check out dedupsc for an example on how to use this package.

package main

import (
    "fmt"
    "github.com/ricci2511/riccis-homelab-utils/dupescout"
)

func main() {
    filters:= dupescout.Filters{
        HiddenInclude: true,
        DirsExclude: []string{"node_modules"},
        ExtInclude: []string{".txt", ".json", ".go"}, // only search for .txt, .json and .go files
    }
    cfg := dupescout.Cfg{
        Paths: []string{"~/Dev", "~/Documents"},
        Filters: filters,
    }

    fmt.Println("Searching...")

    // Blocks until the search is complete
    dupes := dupescout.GetResults(cfg)

    fmt.Println("Search complete")

    for _, path := range selectedDupes {
        fmt.Println(path)
    }
}

The dupescout.Cfg struct has the following fields as of now:

type Cfg struct {
	Paths                         // paths to search in for duplicates
	Filters                       // various filters for the search (see filters.go)
	KeyGenerator KeyGeneratorFunc // key generator function to use
	Workers      int              // number of workers (defaults to GOMAXPROCS)
}

key-generator

The KeyGenerator field allows you to specify a custom function to generate a key for a given file path that maps to a slice of duplicate file paths.

Some functions are already provided, the default one being dupescout.Crc32HashKeyGenerator which simply hashes the first 16KB of file contents with crc32. The functions prefixed with Full hash the entire file contents instead of just the first 16KB, which is way slower but should be more accurate for rare cases where the first 16KB are not enough. Available KeyGenerator functions are:

  • dupescout.Crc32HashKeyGenerator
  • dupescout.FullCrc32HashKeyGenerator
  • dupescout.Sha256HashKeyGenerator
  • dupescout.FullSha256HashKeyGenerator

In case you want to use custom logic to generate keys, you simply pass a function that satisfies the dupescout.KeyGeneratorFunc. An example can be found here.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// Used to skip a file during key generation.
	//
	// These kind of errors are ignored and not returned to the caller
	// of dupescout.GetResults() or dupescout.StreamResults().
	ErrSkipFile = fmt.Errorf("skip file")
)

Functions

func Crc32HashKeyGenerator

func Crc32HashKeyGenerator(path string) (string, error)

Crc32HashKeyGenerator is the default if no KeyGenerator is specified.

Generates a crc32 hash of the first 16KB of the file contents as the key, which should be enough to achieve a good balance of uniqueness, collision resistance, and performance for most files.

func FullCrc32HashKeyGenerator

func FullCrc32HashKeyGenerator(path string) (string, error)

Generates a crc32 hash of the entire file contents as the key, which is a lot slower than HashKeyGenerator but should be more accurate.

func FullSha256HashKeyGenerator

func FullSha256HashKeyGenerator(path string) (string, error)

Generates a sha256 hash of the entire file contents as the key

func GetResults

func GetResults(c Cfg) ([]string, error)

Runs the duplicate search and returns a slice of all duplicate paths.

func Sha256HashKeyGenerator

func Sha256HashKeyGenerator(path string) (string, error)

Generates a sha256 hash of the first 16KB of the file contents as the key

func StreamResults

func StreamResults(c Cfg, dupesChan chan []string) error

Runs the duplicate search and streams the duplicate paths to the provided channel as they are found.

Types

type Cfg

type Cfg struct {
	KeyGenerator KeyGeneratorFunc // Function to generate a key based on the file path.
	Paths                         // List of paths to search in for duplicates.
	Filters                       // Filters to apply when searching for duplicates.
	Workers      int              // Number of workers to use when searching for duplicates.
}

func (*Cfg) String

func (c *Cfg) String() string

Beauty stringifies the Cfg struct.

type Filters

type Filters struct {
	ExtInclude    FiltersList // List of file extensions to include.
	ExtExclude    FiltersList // List of file extensions to exclude.
	DirsExclude   FiltersList // List of directories or subdirectories to exclude.
	SkipSubdirs   bool        // Skip subdirectories.
	HiddenInclude bool        // Include hidden files and directories.
}

func (*Filters) String

func (f *Filters) String() string

Beauty stringifies the Filters struct.

type FiltersList

type FiltersList []string

Satisfies the flag.Value interface, string values can be provided as a csv or space separated list.

`flag.Var(&cfg.DirsExclude "exclude-dirs", "exclude directories or subdirectories")

func (*FiltersList) Set

func (fl *FiltersList) Set(val string) error

func (*FiltersList) String

func (fl *FiltersList) String() string

type KeyGeneratorFunc

type KeyGeneratorFunc func(path string) (string, error)

KeyGenerator generates a key for a given file path, which then is mapped to a list of file paths that share the same key (duplicates).

The provided KeyGeneratorFuncs hash the file contents to generate the key, but the logic can be anything as long as it's deterministic. For example, you could generate a key based on the file name, size, etc.

type Paths

type Paths []string

Satisfies the flag.Value interface, string values can be provided as a csv or space separated list.

`flag.Var(&cfg.Paths, "p", "list of paths to search in for duplicates")`

func (*Paths) Set

func (p *Paths) Set(val string) error

func (*Paths) String

func (p *Paths) String() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL