etl

module
v2.4.3+incompatible Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 5, 2020 License: Apache-2.0

README

etl

branch travis-ci report-card coveralls
master Travis Build Status Coverage Status
integration Travis Build Status Go Report Card Coverage Status

MeasurementLab data ingestion pipeline.

To create e.g., NDT table (should rarely be required!!!): bq mk --time_partitioning_type=DAY --schema=schema/repeated.json mlab-sandbox:mlab_sandbox.ndt

Also see schema/README.md.

Generating Schema Docs

To build a new docker image with the generate_schema_docs command, run:

$ docker build -t measurementlab/generate-schema-docs .
$ docker run -v $PWD:/workspace -w /workspace \
  -it measurementlab/generate-schema-docs

Writing schema_ndtresultrow.md
...

Moving to GKE

The universal parser will run in GKE, using parser-pool node pools, defined like this:

gcloud --project=mlab-sandbox container node-pools create parser-pool-1 \
  --cluster=data-processing   --num-nodes=3   --region=us-east1 \
  --scopes storage-ro,compute-rw,bigquery,datastore \
  --node-labels=parser-node=true   --enable-autorepair --enable-autoupgrade \
  --machine-type=n1-standard-16

The images come from gcr.io, and are built by google cloud build. The build trigger is currently found with:

gcloud beta builds triggers list --filter=m-lab/etl

Deployment requires adding cloud-kubernetes-deployer role to etl-travis-deploy@ in IAM. This is done for sandbox and staging.

Migrating to Sink interface

The parsers currently use etl.Inserter as the backend for writing records. This API is overly shaped by bigquery, and complicates testing and extension.

The row.Sink interface, and row.Buffer define cleaner APIs for the back end and for buffering and annotating. This will streamline migration to Gardener driven table selection, column partitioned tables, and possibly future migration to BigQuery loads instead of streaming inserts.

Factories

The TaskFactory aggregates a number of other factories for the elements required for a Task. Factory injection is used to generalize ProcessGKETask, and simplify testing.

  • SinkFactory produces a Sink for output.
  • SourceFactory produces a Source for the input data.
  • AnnotatorFactory produces an Annotator to be used to annotate rows.

Directories

Path Synopsis
Package active provides code for managing processing of an entire directory of task files.
Package active provides code for managing processing of an entire directory of task files.
appengine
queue_pusher
Package pushqueue provides a microservice that accepts HTTP requests, creates a Task from given parameters, and adds the Task to a TaskQueue.
Package pushqueue provides a microservice that accepts HTTP requests, creates a Task from given parameters, and adds the Task to a TaskQueue.
Package bq includes all code related to BigQuery.
Package bq includes all code related to BigQuery.
cloud
gcs
cmd
etl_worker
Sample
Sample
generate_schema_docs
generate_schema_docs uses ETL schema field descriptions to generate documentation in various formats.
generate_schema_docs uses ETL schema field descriptions to generate documentation in various formats.
web100_cli
web100_cli provides a simple CLI interface to web100 functions.
web100_cli provides a simple CLI interface to web100 functions.
Package etl provides all major interfaces used across packages.
Package etl provides all major interfaces used across packages.
Package factory provides factories for constructing Task components.
Package factory provides factories for constructing Task components.
Package metrics defines prometheus metric types and provides convenience methods to add accounting to various parts of the pipeline.
Package metrics defines prometheus metric types and provides convenience methods to add accounting to various parts of the pipeline.
Package parser defines the Parser interface and implementations for the different test types, NDT, Paris Traceroute, and SideStream.
Package parser defines the Parser interface and implementations for the different test types, NDT, Paris Traceroute, and SideStream.
Package schema generated by go-bindata.// sources: descriptions/NDT5ResultRow.yaml descriptions/NDT7ResultRow.yaml descriptions/PTTest.yaml descriptions/README.md descriptions/TCPRow.yaml descriptions/toplevel.yaml This files contains schema for Paris TraceRoute tests.
Package schema generated by go-bindata.// sources: descriptions/NDT5ResultRow.yaml descriptions/NDT7ResultRow.yaml descriptions/PTTest.yaml descriptions/README.md descriptions/TCPRow.yaml descriptions/toplevel.yaml This files contains schema for Paris TraceRoute tests.
Package task provides the tracking of state for a single task pushed by the external task queue.
Package task provides the tracking of state for a single task pushed by the external task queue.
web100 provides tools for reading web100 snapshot logs, and parsing snapshots.
web100 provides tools for reading web100 snapshot logs, and parsing snapshots.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL