govuk_crawler_worker

command module
v0.0.0-...-f37c7c2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 5, 2014 License: MIT Imports: 22 Imported by: 0

README

GOV.UK Crawler Worker

continuous integration status

This is a worker that will consume GOV.UK URLs from a message queue and crawl them, saving the output to disk.

Requirements

To run this worker you will need:

Development

You can run the tests locally by running the following:

go get -v -t ./...
go test -v ./...

Alternatively to localise the dependencies you can use make. This will use the third_party.go tool to vendorise dependencies into a folder within the project.

Running

To run the worker you'll first need to build it using go build to generate a binary. You can then run the built binary directly using ./govuk_crawler_worker. All configuration is injected using environment varibles. For details on this look at the main.go file.

How it works

This is a message queue worker that will consume URLs from a queue and crawl them, saving the output to disk. Whilst this is the main reason for this worker to exist it has a few activities that it covers before the page gets written to disk.

Workflow

The workflow for the worker can be defined as the following set of steps:

  1. Read a URL from the queue, e.g. https://www.gov.uk/bank-holidays
  2. Crawl the recieved URL
  3. Write the body of the crawled URL to disk
  4. Extract any matching URLs from the HTML body of the crawled URL
  5. Publish the extracted URLs to the worker's own exchange
  6. Acknowledge that the URL has been crawled
The Interface

The public interface for the worker is the exchange labelled govuk_crawler_exchange. When the worker starts it creates this exchange and binds it to it's own queue for consumption.

If you provide user credentials for RabbitMQ that aren't on the root vhost /, you may wish to bind a global exchange yourself for easier publishing by other applications.

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL