README ¶
timefind_indexer ================ `timefind_indexer' reads in a configuration file describing a source and outputs an index in CSV format containing a list of filenames, timestamp of the earliest record, timestamp of the latest record, and the time that the file was last modified. Using `timefind' in conjunction with these indexes, a user can downselect the number of files based on a time range. Dependencies and Building ========================= 1. Build configuration file. (e.g., SOURCENAME.conf.json) See [Single Source Configuration File]. 2. Run timefind_indexer. ./timefind_indexer -h ``` Usage: timefind_indexer [-huv] [-c PATH] -c, --config=PATH Path to configuration file (can be used multiple times) -h, --help Show this help message and exit -u, --unixtime write Unix time to indexes instead of RFC 3339 -v, --verbose Verbose progress indicators and messages ``` After building your configuration file, you can run the timefind_indexer: ./timefind_indexer -c SOURCENAME.conf.json Single Source Configuration File ================================ Each distinct data source requires its own configuration file. The name of the configuration file (or source) will be the name of the index: source name => source configuration filename => index filename dns => dns.conf.json => dns.csv Note that the configuration filename MUST end in ".conf.json". Some example valid configuration filenames: dns.conf.json great_pcap.conf.json http_traffic.conf.json A basic source file for DNS data (named "dns.conf.json") might look like this: { "indexDir": "/index/pcap", "type": "pcap", "paths": ["/data/pcap"], "include": ["*.gz"], "exclude": [] } The index directory ("indexDir") is where the indexed data will be stored. After the timefind_indexer has finished running, the indexes can be found in the in a .csv file located in "indexDir". The index filename is the same as the source name. Recursive directory support: When the timefind_indexer is run, the directory structure that is traversed when indexing "paths" is created in the "indexDir." One index file is created per directory. For directories containing subdirectories with an index file in them, the index file in the subdirectory is indexed and written to the directory currently being indexed. For example, if we use the above configuration file ("dns.conf.json"), and the directory structure is as follows: /data/pcap/ /data/pcap/a/ /data/pcap/a/b/ /data/pcap/c/ The indexes will be generated in the following format: /index/pcap/example.csv /index/pcap/a/example.csv /index/pcap/a/b/example.csv /index/pcap/c/example.csv See [Index Format] for more details. Each source config file has the components "type", "paths", "include", and "exclude". "type" depends on the file format and which dates you wish to record from each file. See [Data Types and Processors] for the types of data that the timefind_indexer supports. If you don't see your data type listed, you will probably have to write a processor for it. "paths" is one or more filepaths containing files that you wish to index. "include" is a file pattern that specifies which files you wish to index. "exclude" is a file pattern specifies which files you do not want indexed. Index Format ============ Indexes are in CSV format: filename,begin_timestamp,end_timestamp,last_modified_time Timestamps are in Unix timestamp format with nanosecond precision. Recursive directory support: An index can contain entries that are files (absolute path) or directories (relative path). An index entry that is a directory is a pointer to the existence of an index within that directory and the time range it covers. A sample index file "pcap.csv": 2010-01-01, 100, 199, 9999 2010-01-02, 200, 299, 9999 /data/pcap/example.gz, 300, 302, 9999 Since /data/pcap/example.gz is an absolute path, that entry references a particular file. "2010-01-01" and "2010-01-02" are directories, which means we need to look into "2010-01-01/pcap.csv" and "2010-01-02/pcap.csv" to potentially pull out files in a given time range. The index file "2010-01-01/pcap.csv" might look something like: /data/pcap/2010-01-01/ab1.gz, 100, 105, 9999 /data/pcap/2010-01-01/cd2.gz, 103, 107, 9999 /data/pcap/2010-01-01/ef3.gz, 180, 199, 9999 another_directory, 150, 180, 9999 Again, after searching this index, if an index entry that matches our desired time range is a directory (denoted by a relative path), we traverse to that directory's index and recursively process until we find the matching file entries, if any. Data Types and Processors ========================= The timefind_indexer reads data files and indexes the earliest and latest time found in each file. It has the ability to index data classified under the following categories: 1. "cpp": Unix timestamp is the first number listed on each line. Stores timestamp as a string and parses it to time. 2. "bomgar": Searches for the expression "when='Unix timestamp'" on each line. Stores timestamp as a string and parses it to time. 3. "bluecoat": Searches for a date of the format "YYYY-MM-DD HH:MM:SS" on each line. Stores date as a string and parses it to time. 4. "codevision": Searches for the expression "timestamp=YYYY-MM-DDTHH:MM:SS-ZZ:ZZ" on each line. Stores date listed inside the expressison as a string and parses it to time. 5. "cer": Searches for the expression "receieved='YYYY-MM-DD HH:MM:SS.SSSSSS-ZZ:ZZ'" on each line. Stores date listed inside the expression as a string and parses it to time. 6. "sep": Searches for the expression "Event Time: YYYY-MM-DD HH:MM:SS" on each line. Stores date listed inside the expression as a string and parses it to time. If the expression is not found, timefind_indexer searches for the expression "Begin: YYYY-MM-DD HH:MM:SS" on each line. The date listed inside the expression is stored as a string and is parsed to a time. If the expression is not found, timefind_indexer uses the time listed at the beginning of each line. This time is either of the format "Jan 2 2006 15:04:05" or the format "Jan 2 15:04:05" 7. "juniper": Searches for a date of the format "YYYY-MM-DD HH:MM:SS" on each line. Stores date as a string and parses it to time. If a date of this format is not found, timefind_indexer uses the time listed at the beginning of each line. This time is either of the format "Jan 2 2006 15:04:05" or the format "Jan 2 15:04:05" 8. "email": Searches for the expression "[DATETIME]YYYY.MM.DD HH:MM:SS.SSSSSSS" on each line. Stores date listed inside the expression as a string and parses it to time. 9. "text": Stores the time listed at the beginning of each line as a string and parses it to time. This time is either of the format "Jan 2 2006 15:04:05 or the format "Jan 2 15:04:05" 10. "snare": Searches for a date of the format "Mon Jan 02 15:04:05 2006" on each line. Stores date as a string and parses it to time. If a date of this format is not found, the time listed at the beginning of each line is used. This time is of the format "YYYY-MM-DDTHH:MM:SS-ZZZZ" 11. "iod": Searches for a date of the format "YYYY-MM-DDTHH:MM:SS-ZZZZ" on each line. Stores date as a string and parses it to time. 12. "win_messages": Searches for a date of the format "Mon Jan 2 15:04:05 2006" on each line. If a date of this format is not found, timefind_indexer searches for a date of the format "YYYY-MM-DDTHH:MM:SS-ZZ:ZZ" on each line. Stores date as a string and parses it to time. 13. "wireless": Searches for the expression "Time=YYYY-MM-DDTHH:MM:SS" on each line. Stores date listed inside the expression as a string and parses it to time. If the expression is not found, the date listed at the beginning of each line is used. This time is either of the format "Jan 2 15:04:05 2006" or the format "Jan 2 15:04:05" 14. "stealthwatch": Searches for a date of the format "YYYY-MM-DDTHH:MM:SS" on each line. Stores the date listed inside the expression as a string and parses it to time. 15. "pcap": Retrieves time found in pcap file type 16. "fsdb_time_col_1": Retrieves time found in the *first* column of an fsdb-formatted, tab-delimited file. At the moment, this timefind_indexer does not read the fsdb header; it simply ignores it (along with any comments). If you're getting errors with reading timestamps, check to make sure the file is tab-delimited. 16. "fsdb_time_col_2": Retrieves time found in the *second* column of an fsdb-formatted, tab-delimited file. See "fsdb_time_col_1" for additional details.
Documentation ¶
There is no documentation for this package.
Click to show internal directories.
Click to hide internal directories.