fs-to-bq-installer

command module
v0.0.0-...-9d3fdf7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 21, 2022 License: Apache-2.0 Imports: 17 Imported by: 0

README

Firestore to BigQuery Extension Installer

The Stream Collections to BigQuery extension for Firebase sends realtime, incremental updates from a specified Cloud Firestore Collection to BigQuery. The current implementation of the extension requires it to be installed for every collection that you want to keep in sync. Additionally, the extension does not backload existing data into BigQuery once installed, nor does it convert the data to a typed columnar format. There are instead separate scripts that must be run in order to backload historical data, as well as extrapolate out the nested JSON from your Firestore data. For large scale Firestore instances, this manual and multistep process makes the extension extremely cumbersome to leverage.

This utility will automatically install multiple instances of the extension in parallel, reducing deployment time from ~5m * N (extensions) to just ~5m. It also takes care of backloading existing Firestore data, while creating typed BigQuery Views of your data by auto-generating schema definitions for your existing Collections.

Disclaimer

This is not an officially supported Google product.

How To

This utility can most easily be run as a Docker container within a Cloud Shell instance. Cloud Shell comes pre-installed with Docker, and is already authenticated to GCP.

  • In the GCP console, navigate to the project that contains your Firestore instance (the project that is backing your Firebase project)

Alt text

  • Open Cloud Shell

Alt text

  • Start the utility container
docker run -it -v $CLOUDSDK_CONFIG:/home/installer/.config/gcloud noahmercado/fs-to-bq-installer:latest
  • When the container boots, it will create a service account in your project with the required permissions needed to install the extension (project scoped editor permissions to be specific, in case your InfoSec team is asking 🙂)

  • fs-to-bq-installer --help will show you all the available options that can be passed through as extension configuration parameters

  • fs-to-bq-installer -include ALL -max-workers 20 will deploy the extension & backload data for all top level collections in your current working project's Firestore DB (using 20 parallel workers). A BigQuery Dataset will be created named <YOUR_PROJECT'S_NAME>_firestore_export, with BQ tables matching the collection names.

  • fs-to-bq-installer -include ALL -exclude users -max-workers 20 will deploy the extension & backload data for all top level collections in your current working project's Firestore DB (using 20 parallel workers) EXCEPT for the collection named users

  • fs-to-bq-installer -include metadata,users,topics,/users/noahm/posts will deploy the extension & backload data for the top level collections metadata, users, and topics as well as the nested collection users/noahm/posts in your current working project's Firestore DB

  • Please reference the official extension.yml manifest to understand each individual parameter

  • Once your installs and data loading workers have completed, running cleanup will remove all service account keys for the service account

Usage

Usage of fs-to-bq-installer:
  --wildcard-ids
        Creates a column containing a JSON object of all wildcard ids from a documents path.
  -backup-collection string
        This (optional) parameter will allow you to specify a collection for which failed BigQuery updates will be written to.
  -clustering data,document_id,timestamp
        This parameter will allow you to set up Clustering for the BigQuery Table created by the extension. (for example: data,document_id,timestamp- no whitespaces). You can select up to 4 comma separated fields(order matters). Available schema extensions table fields for clustering: `document_id, timestamp, event_id, operation, data`.
  -dataset-id string
        The ID of the BigQuery dataset (default "<YOUR_PROJECT_ID>_firestore_export")
  -dataset-location string
        Where do you want to deploy the BigQuery dataset created for this extension? For help selecting a location, refer to the [location selection guide](https://cloud.google.com/bigquery/docs/locations). (default "us")
  -exclude string
        A comma separated list of collection names to exclude when include is set to 'ALL'
  -include string
        A comma separated list of collection names to include. Set to 'ALL' to include all collections
  -location string
        Where do you want to deploy the functions created for this extension?  You usually want a location close to your database. For help selecting a location, refer to the [location selection guide](https://firebase.google.com/docs/functions/locations). (default "us-central1")
  -max-workers int
        The maximum number of works to run in parallel when deploying the extension (default 5)
  -project-id string
        The GCP Project ID (default: will look for a local `.firebaserc` file, followed by env var `$GOOGLE_PROJECT_ID`. If neither are found is empty string)
  -table-partitioning string
        This parameter will allow you to partition the BigQuery table and BigQuery view created by the extension based on data ingestion time. You may select the granularity of partitioning based upon one of: HOUR, DAY, MONTH, YEAR. This will      generate one partition per day, hour, month or year, respectively. (default "NONE")
  -time-partitioning-field timestamp
        BigQuery table column/schema field name for TimePartitioning. You can choose schema available as timestamp OR new custom defined column that will be assigned to the selected Firestore Document field below. Defaults to pseudo column _PARTITIONTIME if unspecified. Cannot be changed if Table is already partitioned.
  -time-partitioning-field-type string
        Parameter for BigQuery SQL schema field type for the selected Time Partitioning Firestore Document field option. Cannot be changed if Table is already partitioned. (default "omit")
  -time-partitioning-firestore-field postDate
        This parameter will allow you to partition the BigQuery table  created by the extension based on selected. The Firestore Document field value must be a top-level TIMESTAMP, DATETIME, DATE field BigQuery string format or Firestore timestamp(will be converted to BigQuery TIMESTAMP). Cannot be changed if Table is already partitioned. example: postDate
  -transform-function-url string
        Specify a function URL to call that will transform the payload that will be written to BigQuery. See the pre-install documentation for more details.

TODO:
  • Implement -recursive flag logic
  • Build for multiple archs
  • Cleanup docs

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL