Forklift
A simple utility that can take stdin and redirect it to templatized paths on S3.
Installing
go install github.com/dacort/forklift/cmd/forklift@latest
Then pipe the sample file to a bucket!
curl -o - \
"https://raw.githubusercontent.com/dacort/forklift/main/sample_data.json" \
| forklift -w 's3://forklift-demo/{{json "event_type"}}/{{today}}.json'
Overview
Usage is pretty simple - pipe some content to forklift
and it will upload it to the desired S3 bucket and path.
echo "Hello Damon" | forklift -w s3://bucket/some/file.txt
While that in itself isn't too exciting (you could just use aws s3 cp -
!), where it gets interesting is when you want to pipe JSON data and have it uploaded to a dynamic location based on the content of the data itself. For example, imagine a JSON file with the following content:
{"event_type": "click", "data": {"uid": 1234, "path": "/signup"}}
{"event_type": "login", "data": {"uid": 1234, "referer": "yak.shave"}}
And imagine we want to pipe this to S3, but split it by event_type
. Well, forklift
can do that for us!
cat sample_data.json | forklift -w 's3://bucket/{{json "event_type"}}/{{today}}.json'
That will upload two different files:
s3://bucket/click/2021-02-18.json
s3://bucket/login/2021-02-18.json
Default behavior
Note that the default behavior of forklift
is to simply echo whatever is passed to it to stdout. This is partially because I build forklift
into another project, as noted in the section below.
Advanced Usage
Again, while not terribly interesting as a standalone CLI, where this becomes particularly useful is with cargo-crates
. This is a sample project that makes it easy to captial-e Extract data from third-party services without having to be a data engineering wizard.
For example, I've got an Oura ring and want to extract my sleep data. With the Oura Crate, I can simply do:
docker run -e OURA_PAT ghcr.io/dacort/crates-oura sleep
And that'll return a JSON blob with my sleep data for the past 7 days. But let's say I want to drop that sleep data into a location on S3 based on when I went to bed:
docker run -e OURA_PAT ghcr.io/dacort/crates-oura sleep | forklift -w 's3://bucket/{{json "bedtime_start" | ymdFromTimestamp }}/sleep_data.json'
Cool. Now imagine I want to drop a single Docker container into an ETL workflow that does both of these for me. Well, forklift
is integrated into Cargo Crates.
docker run \
-e OURA_PAT \
-e FORKLIFT_URI='s3://bucket/{{json "bedtime_start" | ymdFromTimestamp }}/sleep_data.json' \
ghcr.io/dacort/crates-oura sleep
That will automatically take any stdout of the Docker container and pipe it to that location!
Why?
This seems like a lot of work to just ... upload a file. Well, a few reasons.
- I started playing around with the idea of Docker containers that could very simply extract data from an API giving the consumer nothing else to worry about except having Docker and the proper authentication tokens.
- Then I wanted to upload the data to S3. But I wanted the Docker containers to remain as lightweight as possible.
- It's just a fun experiment. 🤷
Resources
These resources came in handy while building this: