pathwalk

package module
v0.0.0-...-9f62fcc Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 12, 2020 License: MIT Imports: 13 Imported by: 0

README

Build Status Coverage Status GoDoc Go Report Card

Overview

This package walks and processes a filesystem tree in parallel. A CLI frontend is also provided.

Features

  • Can set non-default values for the worker-count, queue-length, and batch- size parameters (for technical nit-pickers).
  • Stat errors on directories and files will be ignored.
  • Output can be formatted as JSON.
  • Non-JSON output lines can include a file-type prefix.
  • Both filename/extension- and directory-based filters are supported.
  • Filtering supports both include- and exclude.
  • Directory-based filters support ** for recursive matching.
  • Filters support case-insensitivity.
  • There is full reporting with performance and directory metrics.
  • MIME types can be detected and included in the output (just in the CLI, for convenience).
  • Verbosity can be enabled to provide insight into include/exclude-related disqualifications.

Library Support

For source-code documentation and examples, see the GoDoc badge/link above.

Command-Line Support

Default output format:

$ go run command/go-walk/main.go ~/Downloads/nlp
20news-19997.tar.gz
ICPSR_34802-V1.zip
gdelt_20191018051500
trainingandtestdata.zip
blogs.zip
gdelt_20191018051500/20191018051500.gkg.csv.zip
gdelt_20191018051500/20191018051500.mentions.CSV.zip
gdelt_20191018051500/20191018051500.gkg.csv
gdelt_20191018051500/20191018051500.mentions.CSV
gdelt_20191018051500/20191018051500.export.CSV
gdelt_20191018051500/20191018051500.export.CSV.zip

With types:

$ go run command/go-walk/main.go ~/Downloads/nlp --type
f 20news-19997.tar.gz
f ICPSR_34802-V1.zip
d gdelt_20191018051500
f trainingandtestdata.zip
f blogs.zip
f gdelt_20191018051500/20191018051500.mentions.CSV
f gdelt_20191018051500/20191018051500.export.CSV
f gdelt_20191018051500/20191018051500.gkg.csv.zip
f gdelt_20191018051500/20191018051500.export.CSV.zip
f gdelt_20191018051500/20191018051500.mentions.CSV.zip
f gdelt_20191018051500/20191018051500.gkg.csv

With mime-types:

$ go run command/go-walk/main.go ~/Downloads/nlp --type --mime-type
d - gdelt_20191018051500
f application/zip ICPSR_34802-V1.zip
f application/zip trainingandtestdata.zip
f application/x-gzip 20news-19997.tar.gz
f application/zip gdelt_20191018051500/20191018051500.export.CSV.zip
f application/zip gdelt_20191018051500/20191018051500.mentions.CSV.zip
f text/plain charset=utf-8 gdelt_20191018051500/20191018051500.gkg.csv
f text/plain charset=utf-8 gdelt_20191018051500/20191018051500.mentions.CSV
f application/zip gdelt_20191018051500/20191018051500.gkg.csv.zip
f application/zip blogs.zip
f text/plain charset=utf-8 gdelt_20191018051500/20191018051500.export.CSV

As JSON:

$ go run command/go-walk/main.go ~/Downloads/nlp --type --mime-type --json
[
    {
        "is_directory": false,
        "mime_type": "application/x-gzip",
        "mode": 420,
        "modified_time": "2019-10-17T02:05:47-04:00",
        "path": "20news-19997.tar.gz",
        "size": 17332201
    },
    {
        "is_directory": false,
        "mime_type": "application/zip",
        "mode": 420,
        "modified_time": "2019-10-17T02:19:03-04:00",
        "path": "trainingandtestdata.zip",
        "size": 81363704
    },
    {
        "is_directory": false,
        "mime_type": "application/zip",
        "mode": 420,
        "modified_time": "2019-10-17T01:58:02-04:00",
        "path": "blogs.zip",
        "size": 312949121
    },
    {
        "is_directory": false,
...

Just directories:

$ go run command/go-walk/main.go ~/Downloads/nlp --just-directories
gdelt_20191018051500

Just include the one subdirectory (with verbosity):

$ go run command/go-walk/main.go ~/Downloads/nlp --include-path 'gdelt_20191018051500' --verbose
2020/05/12 04:08:19 pathwalk.walk: [DEBUG]  Directory excluded: []
gdelt_20191018051500
gdelt_20191018051500/20191018051500.export.CSV
gdelt_20191018051500/20191018051500.export.CSV.zip
gdelt_20191018051500/20191018051500.mentions.CSV.zip
gdelt_20191018051500/20191018051500.gkg.csv
gdelt_20191018051500/20191018051500.gkg.csv.zip
gdelt_20191018051500/20191018051500.mentions.CSV

Exclude all ZIP-files (with verbosity):

$ go run command/go-walk/main.go ~/Downloads/nlp --exclude-filename '*.zip' --verbose
20news-19997.tar.gz
2020/05/12 04:09:34 pathwalk.walk: [DEBUG]  File excluded: [ICPSR_34802-V1.zip]
2020/05/12 04:09:34 pathwalk.walk: [DEBUG]  File excluded: [trainingandtestdata.zip]
2020/05/12 04:09:34 pathwalk.walk: [DEBUG]  File excluded: [blogs.zip]
gdelt_20191018051500
2020/05/12 04:09:34 pathwalk.walk: [DEBUG]  File excluded: [20191018051500.export.CSV.zip]
2020/05/12 04:09:34 pathwalk.walk: [DEBUG]  File excluded: [20191018051500.mentions.CSV.zip]
2020/05/12 04:09:34 pathwalk.walk: [DEBUG]  File excluded: [20191018051500.gkg.csv.zip]
gdelt_20191018051500/20191018051500.mentions.CSV
gdelt_20191018051500/20191018051500.gkg.csv
gdelt_20191018051500/20191018051500.export.CSV

Show statistics:

$ time go run command/go-walk/main.go ~/Pictures --stats >/dev/null
Processing Statistics
=====================
JobsDispatchedToNewWorker: (400)
JobsDispatchedToIdleWorker: (1001)
FilesVisited: (1361)
DirectoriesVisited: (15)
EntryBatchesProcessed: (25)
IdleWorkerTime: (2.180) seconds
DirectoriesIgnored: (0)
PathFilterIncludes: (0)
PathFilterExcludes: (0)
FileFilterIncludes: (0)
FileFilterExcludes: (0)



real    0m1.553s
user    0m0.852s
sys 0m0.277s


$ time go run command/go-walk/main.go ~/Downloads --stats >/dev/null
Processing Statistics
=====================
JobsDispatchedToNewWorker: (400)
JobsDispatchedToIdleWorker: (33560)
FilesVisited: (31014)
DirectoriesVisited: (1361)
EntryBatchesProcessed: (1585)
IdleWorkerTime: (172.312) seconds
DirectoriesIgnored: (0)
PathFilterIncludes: (0)
PathFilterExcludes: (0)
FileFilterIncludes: (0)
FileFilterExcludes: (0)



real    0m1.577s
user    0m1.434s
sys 0m0.629s



$ time go run command/go-walk/main.go ~/Downloads --stats --include-filename "*.jpg" >/dev/null
Processing Statistics
=====================
JobsDispatchedToNewWorker: (266)
JobsDispatchedToIdleWorker: (2955)
FilesVisited: (275)
DirectoriesVisited: (1361)
EntryBatchesProcessed: (1585)
IdleWorkerTime: (17.404) seconds
DirectoriesIgnored: (0)
PathFilterIncludes: (1361)
PathFilterExcludes: (0)
FileFilterIncludes: (275)
FileFilterExcludes: (30739)



real    0m1.561s
user    0m0.885s
sys 0m0.430s

The examples above use the "go run" method of calling the tool, but it is obviously recommended to build the tool first and then call the binary.

Documentation

Index

Examples

Constants

This section is empty.

Variables

View Source
var (
	// ErrSkipDirectory can be returned by the visitor if a directory to skip
	// walking its contents.
	ErrSkipDirectory = errors.New("skip directory")
)

Functions

This section is empty.

Types

type Filter

type Filter struct {
	IncludePaths     []string
	ExcludePaths     []string
	IncludeFilenames []string
	ExcludeFilenames []string

	IsCaseInsensitive bool
}

Filter define the parameters that can be provided by the user to control the walk.

type Stats

type Stats struct {
	// JobsDispatchedToNewWorker is the number of workers that were started to
	// process a job.
	JobsDispatchedToNewWorker int

	// JobsDispatchedToIdleWorker is the number of jobs that were dispatched to
	// an available, idle worker rather than starting a new one.
	JobsDispatchedToIdleWorker int

	// FilesVisited is the number of files that were visited.
	FilesVisited int

	// DirectoriesVisited is the number of directories that were visited.
	DirectoriesVisited int

	// EntryBatchesProcessed is the number of batches that directory entries
	// were parceled into while processing.
	EntryBatchesProcessed int

	// IdleWorkerTime is the duration of all between-job time spent by workers.
	// Only includes time between jobs and time between last job and timeout
	// (leading to shutdown). Does not include time between the last job and a
	// closed channel being detected (which is not true idleness).
	IdleWorkerTime time.Duration

	// DirectoriesIgnored is the number of directories that were signaled to be
	// skipped using `ErrSkipDirectory`.
	DirectoriesIgnored int

	// PathFilterIncludes is the number of path include hits or exclude misses
	// if at least one filter rule was provided.
	PathFilterIncludes int

	// PathFilterExcludes is the number of path include misses or exclude hits
	// if at least one filter rule was provided.
	PathFilterExcludes int

	// FileFilterIncludes is the number of file include hits or exclude misses
	// if at least one filter rule was provided.
	FileFilterIncludes int

	// FileFilterExcludes is the number of file include misses or exclude hits
	// if at least one filter rule was provided.
	FileFilterExcludes int
}

Stats describes all stats collected by the walking process.

func (Stats) Dump

func (stats Stats) Dump()

Dump prints all statistics.

type Walk

type Walk struct {
	// contains filtered or unexported fields
}

Walk knows how to traverse a tree in parallel.

func NewWalk

func NewWalk(rootPath string, walkFunc WalkFunc) (walk *Walk)

NewWalk returns a new Walk struct.

func (*Walk) HasFinished

func (walk *Walk) HasFinished() bool

HasFinished returns whether all entries have been visited and processed.

func (*Walk) InitSync

func (walk *Walk) InitSync()

InitSync sets-up the synchronization state. This is isolated as a separate step to support testing.

func (*Walk) Run

func (walk *Walk) Run() (err error)

Run forks workers to process the tree. All workers will have quit by the time we return.

Example
// Stage test directory.

fileCount := 20
tempPath, _ := pwtesting.FillFlatTempPath(fileCount, nil)

// Walk

defer func() {
	os.RemoveAll(tempPath)
}()

walkFunc := func(parentPath string, info os.FileInfo) (err error) {
	// Do your business.

	return nil
}

walk := NewWalk(tempPath, walkFunc)

err := walk.Run()
log.PanicIf(err)
Output:

func (*Walk) SetBatchSize

func (walk *Walk) SetBatchSize(batchSize int)

SetBatchSize sets an alternative size for the parcels of directory entries dispatched into jobs.

func (*Walk) SetBufferSize

func (walk *Walk) SetBufferSize(bufferSize int)

SetBufferSize sets an alternative size for the job channel.

func (*Walk) SetConcurrency

func (walk *Walk) SetConcurrency(concurrency int)

SetConcurrency sets an alternative maximum number of workers.

func (*Walk) SetFilter

func (walk *Walk) SetFilter(filter Filter)

SetFilter sets filtering parameters for the next call to Run(). Behavior is undefined if this is changed *during* a call to `Run()`. The filters will be sorted automatically.

func (*Walk) SetGlobalTimeoutDuration

func (walk *Walk) SetGlobalTimeoutDuration(timeoutDuration time.Duration)

SetGlobalTimeoutDuration sets a non-default duration, after which if no activity has happened than we should consider ourselves dead-locked.

func (*Walk) Stats

func (walk *Walk) Stats() Stats

Stats prints statistics about the last walking operation.

func (*Walk) Stop

func (walk *Walk) Stop()

Stop will signal all of the workers to terminate if Run() has not yet returned. This is provided for the user to call as a result of some logic in the callback that calls for immediate return.

type WalkFunc

type WalkFunc func(parentPath string, info os.FileInfo) (err error)

WalkFunc is the function type for the callback.

Directories

Path Synopsis
command
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL