datalab

module

v1.0.2 Latest Latest Go to latest Published: Oct 22, 2019 License: MIT

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/paidright/datalab

Links

Open Source Insights

README ¶

Datalab

Datalab Logo

Datalab is a collection of UNIX-y tools designed for working with CSV data. Read the individual READMEs to find out what they each do, but they are all designed along the following princples:

Speed. We want to be able to deal with gigabytes of data quickly.
Constant memory usage. The RAM used by the suite should not grow unbounded. We're happy to pay a CPU penalty for this, but we want to be able to run on commodity hardware.
UNIX-y. Take data on STDIN and emit data on STDOUT where possible.
Parallel. Where possible given the constraints of the operation being performed, use all available cores on the given machine.

You can use these tools individually or stitch them together into an ETL pipeline with some simple bash scripting. eg:

#!/bin/bash

set -e

export DATA_PATH=./big_data

echo "foo,bar,lalala" > $DATA_PATH/colcat_targets.csv

# Union related datasets
marx --quiet --input $DATA_PATH/masterfiles \

  # Add missing values to START_DATE_TIME, END_DATE_TIME, SHIFTEND_DATETIME, SHIFTSTART_DATETIME - 01/JAN/1900 00:00:00
  | gumption --quiet --add-missing "01/JAN/1900 00:00:00" --columns START_DATE_TIME,END_DATE_TIME,SHIFTEND_DATETIME,SHIFTSTART_DATETIME \

  # Create WorkDayID from ACTUAL_DATE
  | gumption --quiet --split " " --columns ACTUAL_DATE \

  | gumption --quiet --rename WorkDayID --columns ACTUAL_DATE \

  # Inner Join in a payperiod ID to based on the workday date
  | stanley --quiet --left $DATA_PATH/pp_id_mapping.csv --join-key WorkDayID \

  # Bring ID fields to the beginning of the file
  | trogdor --quiet --columns "EMPLOYEE_NUMBER,PP ID" \

  # Concatenate PayPeriod_End_Date with PayPeriodID on '/'
  | colcat --quiet --target_file $DATA_PATH/colcat_targets.csv \

  # Replace '-' in COSTCENTER_NUMBER and COSTCENTER_NAME with "Not Provided"
  | gumption --quiet --replace-cell "-,Not Provided" --columns COSTCENTER_NUMBER,COSTCENTER_NAME \

  > $DATA_PATH/cleaned_masterfile.csv

Directories ¶

Path	Synopsis
src
colcat
gumption
marx
shoppadocket
stanley
trogdor
util

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL