gae_filepublisher

command module
v0.0.0-...-c51c1fb Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 14, 2017 License: BSD-3-Clause Imports: 13 Imported by: 0

README

filepublisher

An AppEngine handler that moves files from one CloudStorage location to another and publishers the names of those files to PubSub.

Why is this useful?

Consider the situation where a cron job is copying files to a bucket on Google Cloud Storage. You want to process those files, and you need some way of knowing which ones are new. You also want to make sure that they only get processed once. One way to achieve that is to move the files from the incoming directory to a processing directory. When a file is moved to the processing directory, an event is published to a Google PubSub Topic with the name of the file (in its new location). This way you can have mutiple subscribers to the queue and each one of them can process a file.

Usage

The handler takes the following query-string parameters:

Paramter Description
topic The name of the PubSub topic that events should be published to
dst_bucket The bucket the files should be moved to
dst_path The path in the destination bucket where files should be copied to
src_bucket The bucket where files will be looked for (can be same as destination bucket)
src_prefix A prefix used to find the files

For example: https://my-project.appstop.com/tasks/filepublisher?topic=freshfiles&dst_bucket=my-log-files&dst_path=processing&src_bucket=my-log-files&src_prefix=inbound

This URL would copy all files matching gs://my-log-files/inbound* to gs://my-log-files/processing/ and publish a message on the PubSub topic named "freshfiles" for each one.

ToDO

  • There is a dry-run parameter that needs to be implemented.
  • Needs unit tests.

Development

The handler can be run locally using "goapp serve" by updating the app.yaml file and adding project information.

Documentation

Overview

An AppEngine service that copies files from one CloudStorage location to another and publishes the names of the files (in their new location) to a PubSub topic.

This could be used as the first step of a data pipline that brings log files into GCP. An external (to GCP) service would copy the logs into a well known location on CloudStorage. An AppEngine cron job would make a call to this service with the right parameters, which would then move the files to a staging area and publish them to a queue to be consumed by other workers.

The handler will listen at:

http://PROJECT_ID.appspot.com/tasks/filepublisher

The paramters required by the handler defined in this service are:

topic      - name of the pubsub topic to which names of staged files should be published
dst_bucket - CloudStorage bucket name where files should be staged
dst_path   - path in the CloudStorage bucket where files should be staged
src_bucket - CloudStorage bucket where files can be found
src_prefix - prefix used to identify files in the source bucket
dry_run    - if true, will only show what action would've been taken (TBD)

An example call could look like:

http://PROJECT_ID.appspot.com/tasks/filepublisher?topic=MY_TOPIC&dst_bucket=MY_BUCKET&dst_path=staged&src_bucket=MY_BUCKET&src_prefix=inbound

To run the service locally for development the project ID must be specified in the environment and 'go run' can be used:

GCLOUD_PROJECT=<PROJECT_ID> go run filepublisher.go

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL