sys-file-indexer
A custom parallel file indexer and hasher
ABOUT
sys-file-indexer
indices the directory specified as last
argument or the current directory by default.
sys-file-indexer
always outputs the result to stdout.
MODES OF OPERATION
sys-file-indexer
has the following modes of operation:
-
Normal mode: outputs a special CSV file that combines the two datasets
to generate and that does not contain unique ID. This needs to be
processed further by split mode to be useful. No options are
necessary.
Normal mode can benefit from a previous run if data is supplied with
the -delta
option. In this case, sys-file-indexer
uses the data
generated by a previous run whenever the modification time of a file
has not changed.
-
Split mode: split mode takes the file generated with the output for
normal mode as input and generates either the CSV for the sys_file
dataset or for sys_file_metadata. See options -ofile
and -ometa
.
-
SQL mode: outputs readily usable SQL INSERT statements that can be
piped directly to the database.
-
SQL transform mode: reads a normal mode CSV and outputs SQL statements.
Can be used to have SQL output and using partitioning (in two steps.)
-
Single mode: outputs one single CSV dataset. Useful for testing onty.
EXAMPLE
Generate the normal mode CSV output:
$ sys-file-indexer . >../normal.csv
Update a previously generated normal mode CSV:
$ sys-file-indexer -delta=../normal.csv >../new-normal.csv
Split normal mode CSV to generate two datasets:
$ sys-file-indexer -ofile=normal.csv >sys_file.csv
$ sys-file-indexer -ometa=normal.csv >sys_file_metadata.csv
Generate metadata directly into the database (cannot use -delta):
$ sys-file-indexer -sql | mysql ...
Transform a normal-mode CSV into SQL:
$ sys-file-indexer -osql sys_file_metadata.csv | mysql ...
Delta mode and output to SQL (use tee
to update the normale file in one go):
$ sys-file-indexer -delta normal.csv | sys-file-indexer -osql - | mysql ...
PARTITIONING
sys-file-indexer can be run on multiple machines if that leads to an
increase in I/O throughput.
host1$ sys-file-indexer -w 1 -wg 3 ... > result1.csv
host2$ sys-file-indexer -w 2 -wg 3 ... > result2.csv
host3$ sys-file-indexer -w 3 -wg 3 ... > result3.csv
host1$ cat result1.csv result2.csv result3.csv > result.csv
TODO
- Can scan multiple directories