templater

package module
v1.5.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 21, 2022 License: GPL-2.0 Imports: 18 Imported by: 0

README

Go ReferenceGo Report Card

templater

Quick Install

go install github.com/mr-joshcrane/templater/cmd/templater@latest

Usage

$ templater [FIELDS_TO_UNPACK ...]

FIELDS_TO_UNPACK is an optional indications of which fields are JSON objects, capable of further unpacking.


Why would you use templater?

Data Engineering will often require taking some raw, untyped and unsanitised data and running it through a series of preliminary transformations before it can be presented in its final format.

Templater was created with DBT as the transformation tool in mind, and is designed to generate a complete DBT project from a set of CSV files. These CSVs are typically exported Snowflake tables, but any CSVs can be used.

Theoretically templater can be used with any data warehouse solution, but it has only been tested with Snowflake.

Data doesn't always play nice

Sometimes you'll have some untyped data that needs to be typed. This usually isn't difficult, but is very tedious and labour intensive.

Sometimes you will have column names that contain special characters that are not allowed in Snowflake (without special handling).

Sometimes you will have nested JSON data that needs to be flattened before it can be used for any real business application.

Templater can be used to alleviate some of the burden here!


Imagine we had a table in Snowflake called ENERGY and we dumped it out to a CSV file as ENERGY.csv:

sourceAsMetrictarget(as %)value(per million tonnes)statistics
Bio-conversion(natural)Liquid124.729"{ ""attributes"": { ""on_hand"": false, ""available_in"": ""days"", ""componentDemands"": [99,2.5,"{5,2,5}"]}}"
BiofuelImportsSolid35"{ ""attributes"": { ""on_hand"": true, ""available_in"": ""minutes"", ""componentDemands"": [1,0,"{0.1,9,0}"]}}"
coal imports(55%)Coalnull"{ ""attributes"": { ""on_hand"": true, ""available_in"": ""hours"", ""componentDemands"": [1,0,"{0.1,00.1,NA}"]}}"

Gross! As we can see, the format of this data leaves much to be desired. We could munge it with DBT, but that would be a non-trivial amount effort!

We can use templater to generate a DBT project that will transform this data into something more useful.

Lets call templater, and indicate we want to unpack the CSV field called statistics which seems to be a JSON field.

$ templater statistics

Templater will generate the core of a new DBT project, including our transformation below.

output/transform/TRANS01_ENERGY.sql

{{ config(materialized='table') }}
SELECT
  "statistics":"attributes"."available_in"::STRING AS ATTRIBUTES__AVAILABLE_IN
  ,"statistics":"attributes"."componentDemands"::ARRAY AS ATTRIBUTES__COMPONENT_DEMANDS
  ,"statistics":"attributes"."on_hand"::BOOLEAN AS ATTRIBUTES__ON_HAND
  ,"sourceAsMetric"::STRING AS SOURCE_AS_METRIC
  ,"target(idealised)"::STRING AS TARGET_IDEALISED
  ,"value(per million tonnes)"::STRING AS VALUE_PER_MILLION_TONNES
FROM
  {{ source('TEMPLATER', 'ENERGY') }}

We can see that templater has:

  1. Generated our transformation SQL
  2. Tried to infer Snowflake Types for each column
  3. Normalised the column names to be Snowflake friendly

It has also generated some suggested DBT models that you can tweak to your liking in output/transform/_models_schema.yml, which will go a long way when you're trying to generate some dbt docs.

version: 2
models:
  - name: TRANS01_ENERGY
    columns:
      - name: ATTRIBUTES__AVAILABLE_IN
      - name: ATTRIBUTES__COMPONENT_DEMANDS
      - name: ATTRIBUTES__ON_HAND
      - name: SOURCE_AS_METRIC
      - name: TARGET_IDEALISED
      - name: VALUE_PER_MILLION_TONNES

Documentation

Index

Examples

Constants

This section is empty.

Variables

View Source
var SnowflakeTypes = map[string]string{
	"string": "STRING",
	"int":    "INTEGER",
	"float":  "FLOAT",
	"struct": "OBJECT",
	"list":   "ARRAY",
	"null":   "VARCHAR",
	"bool":   "BOOLEAN",
}

SnowflakeTypes is a map of CUE types to Snowflake types.

CUE Lang Type Reference: https://cuelang.org/docs/references/spec/#types.

Snowflake Type Reference: https://docs.snowflake.com/en/sql-reference/data-types.html.

Functions

func CleanTableName

func CleanTableName(path string) string

CleanTableName derives a table name from a file name in a Snowflake-friendly format.

func ContainsNonLeadingArray

func ContainsNonLeadingArray(path string) bool
Example
package main

import (
	"fmt"

	"github.com/mr-joshcrane/templater"
)

func main() {
	path1 := "meta.mass_edit_custom_type_ids[123]"
	path2 := "meta.mass_edit_custom_type_ids"
	path3 := "[123]meta.mass_edit_custom_type_ids"
	fmt.Println(templater.ContainsNonLeadingArray(path1))
	fmt.Println(templater.ContainsNonLeadingArray(path2))
	fmt.Println(templater.ContainsNonLeadingArray(path3))
}
Output:

true
false
false

func EscapePath

func EscapePath(s string) string

EscapePath escapes each section of the path into "delimited identifiers". If we didn't do this, we'd end up with a path like: foo.bar.baz instead of "foo"."bar"."baz", which would cause errors.

Reference: https://docs.snowflake.com/en/sql-reference/identifiers-syntax.html#delimited-identifiers.

func GenerateColumnsSQL

func GenerateColumnsSQL(f map[string]Field) string

Generate the SQL required to declare, rename and typecast the columns in a table in a DBT Project Model.

Example
package main

import (
	"fmt"

	"github.com/mr-joshcrane/templater"
)

func main() {
	fields := map[string]templater.Field{
		"Team": {
			Path:         "Team",
			Node:         "Team",
			InferredType: "STRING",
		},
		"Payroll(millions)": {
			Path:         "Payroll(millions)",
			Node:         "Payroll(millions)",
			InferredType: "FLOAT",
		},
		"Wins": {
			Path:         "Wins",
			Node:         "Wins",
			InferredType: "INTEGER",
		},
	}
	fmt.Println(templater.GenerateColumnsSQL(fields))
}
Output:

  "Payroll(millions)"::FLOAT AS PAYROLL_MILLIONS
  ,"Team"::STRING AS TEAM
  ,"Wins"::INTEGER AS WINS

func GenerateReferenceSQL

func GenerateReferenceSQL(table string) string

GenerateReferenceSQL generates a relation for a source table in a DBT Project Model.

Reference: https://docs.getdbt.com/reference/dbt-jinja-functions/ref.

func GenerateSourceSQL

func GenerateSourceSQL(project, table string) string

GenerateSourceSQL generates a relation for a source table in a DBT Project Model.

Reference: https://docs.getdbt.com/reference/dbt-jinja-functions/source

Example
package main

import (
	"fmt"

	"github.com/mr-joshcrane/templater"
)

func main() {
	PROJECT := "A_ProjectName"
	TABLE := "A_TableName"
	fmt.Println(templater.GenerateSourceSQL(PROJECT, TABLE))
}
Output:

  {{ source('A_PROJECTNAME', 'A_TABLENAME') }}

func GenerateTagsSQL

func GenerateTagsSQL(project, table string) string

GenerateTagsSQL generates the config block tags suitable for use in a DBT Project Model.

Reference: https://docs.getdbt.com/reference/resource-configs/tags

Example
package main

import (
	"fmt"

	"github.com/mr-joshcrane/templater"
)

func main() {
	PROJECT := "A_ProjectName"
	TABLE := "A_TableName"
	fmt.Println(templater.GenerateTagsSQL(PROJECT, TABLE))
}
Output:

{{ config(tags=['A_PROJECTNAME', 'A_TABLENAME']) }}

func Main

func Main() int

Main is the entrypoint for the templater. Working in the context of the current working directory as a fs.FS and taking os.Args as a list of fields to unpack it will generate a the following artifacts:

  • A [transform] directory containing the DBT SQL transformations of the tables.
  • A [public] directory containing the DBT SQL clone transforms.
  • Schemas for source, transform, and public models.

func NormaliseKey

func NormaliseKey(s string) string

NormaliseKey takes a key and attempts to normalise it to a Snowflake-friendly format. It will:

  • Convert from camelCase to SCREAMING_SNAKE_CASE
  • Cast to uppercase
  • Remove any non-underscore/non-alphanumeric characters
  • Remove any leading/trailing spaces
  • Convert spaces to underscores
  • Replace any double underscores with single underscores
  • Replace any dots with double underscores
Example
package main

import (
	"fmt"

	"github.com/mr-joshcrane/templater"
)

func main() {
	rule_uppercased := "thisisakey"
	rule_spaces_to_underscore := "this is a key"
	rule_nonalphanumeric_stripped := "this%^@is``a()*key"
	rule_dot_separaters_to_double_underscore := "json.payload.and_children"
	rule_leading_and_trailing_space_trimmed := "       THISISAKEY          "
	rule_parenthesised_words_considered_word_boundaries := "(THIS)IS(A)KEY"
	rule_camel_case_considered_separate_words := "thisIsAKey"

	fmt.Println(templater.NormaliseKey(rule_uppercased))
	fmt.Println(templater.NormaliseKey(rule_spaces_to_underscore))
	fmt.Println(templater.NormaliseKey(rule_nonalphanumeric_stripped))
	fmt.Println(templater.NormaliseKey(rule_dot_separaters_to_double_underscore))
	fmt.Println(templater.NormaliseKey(rule_leading_and_trailing_space_trimmed))
	fmt.Println(templater.NormaliseKey(rule_parenthesised_words_considered_word_boundaries))
	fmt.Println(templater.NormaliseKey(rule_camel_case_considered_separate_words))

}
Output:

THISISAKEY
THIS_IS_A_KEY
THIS_IS_A_KEY
JSON__PAYLOAD__AND_CHILDREN
THISISAKEY
THIS_IS_A_KEY
THIS_IS_A_KEY

func UnmarshalJSONFromCUE added in v1.4.1

func UnmarshalJSONFromCUE(c cue.Value) (cue.Value, error)

UnmarshalJSONFromCUE takes a cue.Value that is assumed to be a JSON string and attempts to marshal it to JSON, returning an error if unable to do so.

func Unpack

func Unpack(t *Table, c cue.Value, opts ...NameOption)

Unpack constructs a Field from a cue.Value and adds it to the Table. A NameOption can be passed to modify the path of the field. Objects are recursively unpacked. Arrays are not.

Types

type Column

type Column struct {
	Name        string   `yaml:"name"`
	Description *string  `yaml:"description, omitempty"`
	Tests       []string `yaml:"tests, omitempty"`
}

Column: DBT Reference: https://docs.getdbt.com/reference/resource-properties/columns.

type Field

type Field struct {
	Node         string
	Path         string
	InferredType string
}

A Field represents a column and information about how it should be transformed.

Node: Represents the post-transformation target identifier in Snowflake.

Path: Represents the pre-transformation path to the data in the source table.

InferType: Represents the current best guess at Snowflake type inferred from exemplars.

type Model

type Model struct {
	Name        string   `yaml:"name"`
	Description *string  `yaml:"description, omitempty"`
	Tests       []Test   `yaml:"tests, omitempty"`
	Columns     []Column `yaml:"columns"`
}

Models: DBT Reference: https://docs.getdbt.com/docs/dbt-cloud-apis/metadata-schema-model.

type Models

type Models struct {
	Version int     `yaml:"version"`
	Models  []Model `yaml:"models"`
}

Models: DBT Reference: https://docs.getdbt.com/docs/dbt-cloud-apis/metadata-schema-model.

func GenerateProjectModel

func GenerateProjectModel(tables []*Table) Models

GenerateProject: Generate the Models required in _models_schema.yaml files that help define a (potentially multi-table) DBT project.

type NameOption

type NameOption func(string) string

type SQLTemplate

type SQLTemplate struct {
	Tags      string
	Columns   string
	Source    string
	Reference string
}

SQLTemplate is an intermediate data structure that represents the table to be rendered as a SQL Model in a DBT Project.

type Source

type Source struct {
	Name   string   `yaml:"name"`
	Schema string   `yaml:"schema"`
	Tables []Column `yaml:"tables, omitempty"`
}

Sources: DBT Reference: https://docs.getdbt.com/reference/dbt-jinja-functions/source.

type Sources

type Sources struct {
	Version int      `yaml:"version"`
	Sources []Source `yaml:"sources"`
}

Sources: DBT Reference: https://docs.getdbt.com/reference/dbt-jinja-functions/source.

type Table

type Table struct {
	Name    string
	Project string
	Fields  map[string]Field
	// contains filtered or unexported fields
}

A Table represents a source table. It is the intermediate representation of the untyped semi-structured data.

func (*Table) InferFields

func (t *Table) InferFields(iter cue.Iterator, unpackPaths ...string) error

InferFields takes a cue.Iterator and walks through it, adding fields to the table. It will also unpack any JSON fields where the column name matches the (optional) unpackPath.

type Test

type Test struct {
	Test string `yaml:"test, omitempty"`
}

Test: DBT Reference: https://docs.getdbt.com/reference/resource-properties/tests.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL