ukpr

command module
v0.0.0-...-d4dc811 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 3, 2014 License: AGPL-3.0 Imports: 2 Imported by: 0

README

ukpr (working title)

Ben Campbell (ben@scumways.com), at the Media Standards Trust

Overview

This program:

  1. periodically scrapes a bunch of press release sources
  2. stores them in a database
  3. serves up the press releases to any interested clients via HTTP (as server-sent events).

The idea is eventually it'll be set up to keep an archive of a week or so to let clients have a chance to catch up if they go down.

When ukpr is running, clients can connect to:

http://<host>:<port>/<source>/

where source is one of the scrapers. You can get a list using:

$ ukpr -l

Connected clients receive a stream of press releases, as they are scraped. Clients can send a last-event-id header to access archived press releases, or to resume after a disconnection

You can connect and view the raw stream like using any http client, eg:

$ curl http://localhost:9998/72point/ -H "Last-Event-ID: 0"

Will serve up all the stored 72point press releases.

Without last-event-id, the client will be served only new press releases as they come in.

Usage

ukpr <flags> [scraper1 scraper2 ...]

Specific scrapers can be listed after the flags - only those scrapers will be used. By default all scrapers will be used.

flags:

-l
list available scrapers and exit

-historical
use the history-collecting version of all scrapers which
have one (only 72point at the moment)

-t
test mode. Scrape, but output to stdout and don't touch
the database. Also turns off the SSE serving.

-b
brief output (for test mode only) - just dump out title of press
releases to stdout rather than the whole thing.

It uses glog for logging, so also supports all the standard glog flags.

TODOs

  • we've already got a http server running, so should implement a simple browsing interface for visual sanity-checking of press releases.
  • implement a proper config file system
  • run the scrapers in parallel with proper interval timing

Motivation & Goals

The main aim for this is to provide press releases for use by http://churnalism.com, hence the UK bias.

A major goal is to make it simple enough to customise for coverage of any set of press release sources you like.

It's not designed to be a full historical archive of press releases, merely a conduit to stream them out to interested clients, with a bit of buffering to make it things more fault-tolerant.

Documentation

Overview

This program runs a server which:

  1. scrapes UK press releases
  2. serves them up as HTTP server-sent
  3. stashes them in a database for persistence, keeping a few days worth of history (at least)

For more details, see prscrape (which provides all implementation).

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL