idu

command module
v0.0.0-...-2c34a8a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 12, 2024 License: Apache-2.0 Imports: 47 Imported by: 0

README

linux macos windows CodeQL

idu - incremental, database backed, du.

idu analyzes a file system to build a database that suports incremental re-scanning to support large local and clould based fileystems. The analysis takes the form of scanning a filesystem, much like du does, to gather information on file counts and sizes. An important difference to du is that idu is heavily optimized for concurrent execution and can easily handle issuing 1000s of simulataneous stat requests and directory scans. For example it can scan an Apple Silicon macbook in around 10 minutes and a 14M+ file lustre filesystem in around 50 minutes. idu is designed to be extensible to cloud based filesystems as AWS' S3 or GCP's Cloud Storage though this is not yet implemented. It can report, from the database, aggregate statistics such as total file counts, disk usage and to generate reports in json, markdown formats. It is also possible to query the database in a variety of means, including a per-user basis.

Note that cloud based filesystems generally do not have a concrete directory structure in the same way that local filesystems do. S3 filenames for example have slash separated components but each component is not a directory in the sense that a user can cd to it and list files relative to it. Instead, the separators are purely a convention and S3 filenames can be accessed independently of the slash separate components. For example, s3:/aa/bb/cc can be listed as aws s3 ls s3:/aa/b or aws s3 ls s3:/aa/bb/. The former will list all files starting with the prefix /aa/b whereas the latter will only list files whose names start with s3:/aa/bb. Since idu is intended to work with cloud based filesystems the term prefix is often used instead of, or along with, directory. Differences in behaviour for different filesystems will be called out as they are added. Currently only local filesystems are supported.

Configuration.

idu is configured using a yaml file, typically $HOME/idu.yml, but this can overriden with --config file. This configuration file is organized as a list of 'prefix' entries, each of which specifies a filesystem tree to be used with idu.

Each prefix entry specifies the tree to be scanned, the location of the database to be used/created and various, optional, configuration parameters. A minimal entry is as follows:

- prefix: /my/home/tree
  database: /my/home/database/location

Common options control the degree of concurrency to use when analyzing a prefix. These are:

  concurrent_scans: 5000 # scan up to 5000 directories concurrently.
  concurrent_stats: 2000 # issue at most 2000 concurrent stat operations.
  concurrent_stats_threshold: 10 # issue asynchronous stats if the number of files in a directory exceeds 10.
  scan_size: 2000 # scan 2000 items at a time from each directory

Additional options are available to specify exclusions and file system specific otions.

Exclusions section can be used to exclude directories/prefixes and/or files that match the supplied regular expression. For MacOS systems for example it may be desirable to ignore the .DS_Store file, and the CloudStorage directory, which can be achieved as follows:

  exclusions:
  - '.DS_Store$'
  - '^/User/someone/Library/CloudStorage'

It is possible to specify the file system separator (/ for Unix, \ for windows).

  separator: \

The layouts section is used to calculate disk usage by taking into account file system block sizes, or more complex structures such as RAID.

  layout:
    type: block
    block_size: 4096

Common Use

Given a valid configuration file (shown below), idu can be used as outlined below.

- prefix: /projects/yourshared-project/
  database: /projects/yourshared-project/.idu/database

Common usage is as follows:

$ idu analyze /projects/yourshared-project/
$ idu errors /projects/yourshared-project/
$ idu stats compute --stats-dir=./stats /projects/yourshared-project/
$ idu stats view ./stats/latest.idustats
$ idu reports generate ./stats/latest.idustats

As idu runs it will print various statistics that follow its progress. idu may be safely interrupted and restarted (see Incremental Updates below).

Once complete, it's good practice to see if idu analyze encountered any errors, which are also written to the database, by running idu errors as show above. Note that errors are common and most often due to permissions problems; idu records errors and leaves it to the user to decide whether they are relevant or not; for example is a lot of disk usage behind an inaccessible due to permissions path?

stats compute <prefix> will compute stats from the database and store them in a timestamped file in --stats-dir (it will create a soft-link, latest.idustats to the file producted). stats view <idustats-file> can be used to read the stats from the database and print them to stdout. reports generate <idustats-file> will generate a markdown report of the stats and write it to stdout.

Per-user or per-group statistics can be viewed as follows:

$ idu stats view --user=<user> <idustats-file>
$ idu stats view --group=<group> <idustats-file>


## Anticipated Changes and Improvements

### Cloud
`idu` was designed with cloud filesystems and support for GCP's Cloud
Storage and AWS S3 will be added in the near future.


Documentation

Overview

Usage of idu

analyze disk usage using a database for incremental updates. Many of the commands
accept an expression that is used to restrict which prefixes/directories and
files are processed.

expression-syntax - display the syntax for the expression language supported by commands such as analyze, find etc.
          analyze - analyze the file system to build a database of directory and file metadata.
             logs - list the log of past operations stored in the database.
           errors - list the errors stored in the database
             find - find prefixes/files in the database that match the supplied expression.
            stats - compute and display statistics from the database.
          reports - generate and manage reports.
           config - describe the current configuration.
         database - database management commands.

global flags: [--config=$HOME/.idu.yml --gcpercent=50 --http= --log-dir= --profile= --stderr=false --units=decimal --v=0]

-config string
  configuration file (default "$HOME/.idu.yml")
-gcpercent int
  value to use for runtime/debug.SetGCPercent (default 50)
-http string
  set to a port to enable http serving of /debug/vars and profiling
-log-dir string
  directory to write log files to
-profile value
  write a profile on exit; the format is <profile-name>:<file> and the
  flag may be repeated to request multiple profile types, use cpu to request
  cpu profiling in addition to predefined profiles in runtime/pprof
-stderr
  write log messages to stderr
-units string
  display usage in decimal (KB) or binary (KiB) formats (default "decimal")
-v int
  lower values show more debugging output

Directories

Path Synopsis
boolexpr
Package boolexpr provides a wrapper for cloudeng.io/cmdutil/boolexpr for use with idu.
Package boolexpr provides a wrapper for cloudeng.io/cmdutil/boolexpr for use with idu.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL