forklift

module
v0.0.0-...-0cace24 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 11, 2024 License: MIT

README

Forklift

A simple utility that can take stdin and redirect it to templatized paths on S3.

Installing

go install github.com/dacort/forklift/cmd/forklift@latest

Then pipe the sample file to a bucket!

curl -o - \
    "https://raw.githubusercontent.com/dacort/forklift/main/sample_data.json" \
    | forklift -w 's3://forklift-demo/{{json "event_type"}}/{{today}}.json'

Overview

Usage is pretty simple - pipe some content to forklift and it will upload it to the desired S3 bucket and path.

echo "Hello Damon" | forklift -w s3://bucket/some/file.txt

While that in itself isn't too exciting (you could just use aws s3 cp - !), where it gets interesting is when you want to pipe JSON data and have it uploaded to a dynamic location based on the content of the data itself. For example, imagine a JSON file with the following content:

  • sample_data.json
{"event_type": "click", "data": {"uid": 1234, "path": "/signup"}}
{"event_type": "login", "data": {"uid": 1234, "referer": "yak.shave"}}

And imagine we want to pipe this to S3, but split it by event_type. Well, forklift can do that for us!

cat sample_data.json | forklift -w 's3://bucket/{{json "event_type"}}/{{today}}.json'

That will upload two different files:

  1. s3://bucket/click/2021-02-18.json
  2. s3://bucket/login/2021-02-18.json
Default behavior

Note that the default behavior of forklift is to simply echo whatever is passed to it to stdout. This is partially because I build forklift into another project, as noted in the section below.

Advanced Usage

Again, while not terribly interesting as a standalone CLI, where this becomes particularly useful is with cargo-crates. This is a sample project that makes it easy to captial-e Extract data from third-party services without having to be a data engineering wizard.

For example, I've got an Oura ring and want to extract my sleep data. With the Oura Crate, I can simply do:

docker run -e OURA_PAT ghcr.io/dacort/crates-oura sleep

And that'll return a JSON blob with my sleep data for the past 7 days. But let's say I want to drop that sleep data into a location on S3 based on when I went to bed:

docker run -e OURA_PAT ghcr.io/dacort/crates-oura sleep | forklift  -w 's3://bucket/{{json "bedtime_start" | ymdFromTimestamp }}/sleep_data.json'

Cool. Now imagine I want to drop a single Docker container into an ETL workflow that does both of these for me. Well, forklift is integrated into Cargo Crates.

docker run \
    -e OURA_PAT \
    -e FORKLIFT_URI='s3://bucket/{{json "bedtime_start" | ymdFromTimestamp }}/sleep_data.json' \
    ghcr.io/dacort/crates-oura sleep

That will automatically take any stdout of the Docker container and pipe it to that location!

Why?

This seems like a lot of work to just ... upload a file. Well, a few reasons.

  1. I started playing around with the idea of Docker containers that could very simply extract data from an API giving the consumer nothing else to worry about except having Docker and the proper authentication tokens.
  2. Then I wanted to upload the data to S3. But I wanted the Docker containers to remain as lightweight as possible.
  3. It's just a fun experiment. 🤷

Resources

These resources came in handy while building this:

Directories

Path Synopsis
cmd
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL