filetrove

package module
v1.0.0-DEV-9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 3, 2024 License: AGPL-3.0 Imports: 32 Imported by: 0

README

FileTrove

STATUS: Development

VERSION: v1.0.0-DEV-8

Build Status

FileTrove indexes files and creates metadata from them.

The single binary application walks a directory tree and identifies all regular files by type with siegfried, giving you the

  • MIME type
  • PRONOM identifier
  • Format version
  • Identification proof and note

os.Stat() is giving you the

  • File size
  • File creation time
  • File modification time
  • File access time

Furthermore it creates and calculates

  • UUIDv4s as unique identifiers (not stable across sessions)
  • hash sums (md5, sha1, sha256, sha512 and blake2b-512)
  • the entropy of each file (up to 1GB)

and it extracts some EXIF metadata and you can add your own DublinCore metadata to scans.

FileTrove also checks if the file is in the NSRL (https://www.nist.gov/itl/ssd/software-quality-group/national-software-reference-library-nsrl).

For this check a 3.2GB BoltDB is needed and can be downloaded with FileTrove during the installation.

You can also create your own database for the NSRL check. You just need a text file with SHA1 hashes, one per line and the tool admftrove from this repository. With this tool you can also add your own hashes to an existing database.

All results are written into a SQLite database and can be exported to TSV files.

How to install

  1. Download a release from https://github.com/steffenfritz/FileTrove/releases or compile from source.

  2. Copy the file where you want to install ftrove (the downloaded file has a suffix, omitted in the following documentation)

  3. Run ./ftrove --install . (Mind the period)

    a) If you don't have already a NSRL database, you have to download it. Please be patient.

    b) If you have a NSRL database copy/move it do the "db" directory that ftrove just created.

  4. You are ready to go!

How to run

./ftrove -h gives you all flags ftrove understands.

A run only with necessary flags looks like this:

./ftrove -i $DIRECTORY

where $DIRECTORY is a directory you want to use as a starting point. FileTrove will walk this directory recursively down.

How to see the results

You can export the results via ./ftrove -t $UUID where $UUID is the session id. Every indexing run gets its own session id. You get a list of all sessions using ./ftrove -l.

Example:

  1. ./ftrove -l
  2. ./ftrove -t 926be141-ab75-4106-8236-34edfcf102f2

This will create several TSV files that can be read with Excel, Numbers and your preferred text editor.

You can also work with SQL on the database, using sqlite on the console or a GUI like sqlitebrowser (https://sqlitebrowser.org/). Sqliteviz is also a neat tool to visualize the data (https://sqliteviz.com/app/#/).

Background

FileTrove is the successor of filedriller and based on my iPres 2021 paper Marrying siegfried and the National Software Reference Library

Documentation

Index

Constants

View Source
const (
	// MaxFileSize is the max size file that should be processed. This defaults to 1 GB.
	MaxFileSize = 1073741824
	// MaxEntropyChunk is the max byte size of a chunk read
	MaxEntropyChunk = 256000
)

Variables

This section is empty.

Functions

func CheckInstall

func CheckInstall(version string) error

CheckInstall checks if all necessary file are available

func CheckVersion

func CheckVersion(db *sql.DB, version string) (bool, string, error)

CheckVersion checks if the filetrove version is compatible to the database Compatible means same version for now. This could change in the future.

func ChecksumNSRL

func ChecksumNSRL(nsrldbfile string)

ChecksumNSRL checks a NSRL BoltDB's checksum that is provided with a sidecar file

func ConnectFileTroveDB

func ConnectFileTroveDB(dbpath string) (*sql.DB, error)

ConnectFileTroveDB creates a connection to an existing sqlite database.

func ConnectNSRL

func ConnectNSRL(nsrldbfile string) (*bbolt.DB, error)

ConnectNSRL connects to local bbolt NSRL file

func CreateFileList

func CreateFileList(rootDir string) ([]string, []string, error)

CreateFileList creates a list of file paths and a directory listing

func CreateFileTroveDB

func CreateFileTroveDB(dbpath string, version string, initdate string) error

CreateFileTroveDB creates a new an empty sqlite database for FileTrove. It contains information like configurations, sessions and db versions.

func CreateNSRLBoltDB

func CreateNSRLBoltDB(nsrlsourcefile string, nsrldbfile string) error

func CreateUUID

func CreateUUID() (string, error)

CreateUUID returns a UUID v4 as a string

func Entropy

func Entropy(path string) (entropy float64, err error)

Entropy calculates the entropy of a file up to a hard-coded file size.

func ExportSessionDCTSV

func ExportSessionDCTSV(sessionuuid string) error

ExportSessionDCTSV exports all exif metadata from a session to a TSV file. Filtering is done by session UUID.

func ExportSessionDirectoriesTSV

func ExportSessionDirectoriesTSV(sessionuuid string) error

ExportSessionDirectoriesTSV exports all directory metadata from a session to a TSV file. Filtering is done by session UUID.

func ExportSessionEXIFTSV

func ExportSessionEXIFTSV(sessionuuid string) error

ExportSessionEXIFTSV exports all exif metadata from a session to a TSV file. Filtering is done by session UUID.

func ExportSessionFilesTSV

func ExportSessionFilesTSV(sessionuuid string) error

ExportSessionFilesTSV exports all file metadata from a session to a TSV file. Filtering is done by session UUID.

func ExportSessionSessionTSV

func ExportSessionSessionTSV(sessionuuid string) ([]string, error)

ExportSessionSessionTSV exports all session metadata from a session to a TSV file. Filtering is done by session UUID.

func GetImageFiles

func GetImageFiles(db *sql.DB, sessionuuid string) (map[string]string, error)

GetImageFiles queries all files that have mime type image from a session

func GetNSRL

func GetNSRL() error

GetNSRL downloads a prepared BoltDB database file from an online storage

func GetSiegfriedDB

func GetSiegfriedDB() error

GetSiegfriedDB downloads the signature db

func GetValueNSRL

func GetValueNSRL(db *bbolt.DB, sha1hash []byte) (bool, error)

GetValueNSRL reads bbolt database and checks if a given sha1 hash is present in the database

func Hashit

func Hashit(inFile string, hashalg string) ([]byte, error)

Hashit hashes a file using the provided hash algorithm

func InsertDC

func InsertDC(db *sql.DB, sessionuuid string, dcuuid string, dc DublinCore) error

InsertDC adds DublinCore metadata to the database

func InsertExif

func InsertExif(db *sql.DB, exifuuid string, sessionid string, fileuuid string, e ExifParsed) error

InsertExif inserts exif metadata into the FileTrove database

func InsertSession

func InsertSession(db *sql.DB, s SessionMD) error

InsertSession adds session metadata to the database

func InstallFT

func InstallFT(installPath string, version string, initdate string) (error, error, error, error, error)

InstallFT creates and downloads necessary directories and databases and copies them to installPath

func ListSessions

func ListSessions(db *sql.DB) error

ListSessions lists all sessions from the FileTrove database

func PrepInsertDir

func PrepInsertDir(db *sql.DB) (*sql.Stmt, error)

PrepInsertDir prepares a statement for the addition of a single directory

func PrepInsertFile

func PrepInsertFile(db *sql.DB) (*sql.Stmt, error)

PrepInsertFile prepares a statement for the addition of a single file

func PrintBanner

func PrintBanner()

PrintBanner prints a pre-generated ascii banner with the program name

func PrintLicense

func PrintLicense(version string, build string)

PrintLicense prints a short license text

func ReturnSupportedHashes

func ReturnSupportedHashes() [5]string

ReturnSupportedHashes returns a list of supported hashes

Types

type DublinCore

type DublinCore struct {
	Title       string `json:"title"`
	Creator     string `json:"creator"`
	Contributor string `json:"contributor"`
	Publisher   string `json:"publisher"`
	Subject     string `json:"subject"`
	Description string `json:"description"`
	Date        string `json:"date"`
	Language    string `json:"language"`
	Type        string `json:"type"`
	Format      string `json:"format"`
	Identifier  string `json:"identifier"`
	Source      string `json:"source"`
	Relation    string `json:"relation"`
	Rights      string `json:"rights"`
	Coverage    string `json:"coverage"`
}

DublinCore is a struct that holds 15 core elements of DC https://datatracker.ietf.org/doc/html/rfc5013

func ReadDC

func ReadDC(dcjson string) (DublinCore, error)

ReadDC reads a json file and unmarshals it into the DublinCore struct

type ExifParsed

type ExifParsed struct {
	ExifVersion  string
	DateTime     string
	DateTimeOrig string
	Artist       string
	Copyright    string
	Make         string
	Software     string
	XPTitle      string
	XPComment    string
	XPAuthor     string
	XPKeywords   string
	XPSubject    string
}

func ExifDecode

func ExifDecode(fileName string) (ExifParsed, error)

type FileMD

type FileMD struct {
	Filename            string
	Filesize            int64
	Filemd5             string
	Filesha1            string
	Filesha256          string
	Filesha512          string
	Fileblake2b         string
	Filesffmt           string
	Filesfmime          string
	Filesfformatname    string
	Filesfformatversion string
	Filesfidentnote     string
	Filesfidentproof    string
	Filesfregistry      string
	Filectime           string
	Filemtime           string
	Fileatime           string
	Filensrl            string
	Fileentropy         float64
}

FileMD holds the metadata for each inspected file and that is written to the table files

type FileTime

type FileTime struct {
	Atime time.Time
	Btime time.Time
	Ctime time.Time
	Mtime time.Time
}

FileTime holds all metadata times of a file

func GetFileTimes

func GetFileTimes(filename string) (FileTime, error)

GetFileTimes returns a type that holds the access, change and birth time of a file if available.

type HashSumsFile

type HashSumsFile struct {
	MD5        []byte
	SHA1       []byte
	SHA256     []byte
	SHA512     []byte
	BLAKE2B512 []byte
}

HashSumsFile contains all hashes for a single file

type ResumeInfo

type ResumeInfo struct {
	Rowid          int
	LastFile       string
	Mountpoint     string
	ProcessedFiles int
	NSRLFiles      int
}

ResumeInfo holds information from the database needed for resuming a session

func ResumeLatestEntry

func ResumeLatestEntry(db *sql.DB, sessionuuid string) (ResumeInfo, error)

ResumeLatestEntry gets the rowid and filepath of the latest entry of a session.

type SessionMD

type SessionMD struct {
	UUID           string
	Starttime      string
	Endtime        string
	Project        string
	Archivistname  string
	Mountpoint     string
	ExifFlag       string
	Dublincoreflag string
}

SessionMD holds the metadata written to table sessionsmd

type SiegfriedType

type SiegfriedType struct {
	FileName            string
	SizeInByte          int64
	Registry            string
	FMT                 string
	FormatName          string
	FormatVersion       string
	MIMEType            string
	IdentificationNote  string
	IdentificationProof string
	SiegOutput          string
}

SiegfriedType is a struct for all the strings siegfried returns

func SiegfriedIdent

func SiegfriedIdent(s *siegfried.Siegfried, inFile string) (SiegfriedType, error)

SiegfiredIdent gets PRONOM metadata and the size of a single file

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL