datalab

module
v1.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 22, 2019 License: MIT

README

Datalab

Datalab Logo

Datalab is a collection of UNIX-y tools designed for working with CSV data. Read the individual READMEs to find out what they each do, but they are all designed along the following princples:

  • Speed. We want to be able to deal with gigabytes of data quickly.
  • Constant memory usage. The RAM used by the suite should not grow unbounded. We're happy to pay a CPU penalty for this, but we want to be able to run on commodity hardware.
  • UNIX-y. Take data on STDIN and emit data on STDOUT where possible.
  • Parallel. Where possible given the constraints of the operation being performed, use all available cores on the given machine.

You can use these tools individually or stitch them together into an ETL pipeline with some simple bash scripting. eg:

#!/bin/bash

set -e

export DATA_PATH=./big_data

echo "foo,bar,lalala" > $DATA_PATH/colcat_targets.csv

# Union related datasets
marx --quiet --input $DATA_PATH/masterfiles \

  # Add missing values to START_DATE_TIME, END_DATE_TIME, SHIFTEND_DATETIME, SHIFTSTART_DATETIME - 01/JAN/1900 00:00:00
  | gumption --quiet --add-missing "01/JAN/1900 00:00:00" --columns START_DATE_TIME,END_DATE_TIME,SHIFTEND_DATETIME,SHIFTSTART_DATETIME \

  # Create WorkDayID from ACTUAL_DATE
  | gumption --quiet --split " " --columns ACTUAL_DATE \

  | gumption --quiet --rename WorkDayID --columns ACTUAL_DATE \

  # Inner Join in a payperiod ID to based on the workday date
  | stanley --quiet --left $DATA_PATH/pp_id_mapping.csv --join-key WorkDayID \

  # Bring ID fields to the beginning of the file
  | trogdor --quiet --columns "EMPLOYEE_NUMBER,PP ID" \

  # Concatenate PayPeriod_End_Date with PayPeriodID on '/'
  | colcat --quiet --target_file $DATA_PATH/colcat_targets.csv \

  # Replace '-' in COSTCENTER_NUMBER and COSTCENTER_NAME with "Not Provided"
  | gumption --quiet --replace-cell "-,Not Provided" --columns COSTCENTER_NUMBER,COSTCENTER_NAME \

  > $DATA_PATH/cleaned_masterfile.csv

Directories

Path Synopsis
src

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL