rugburn

command module
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 4, 2017 License: MIT Imports: 24 Imported by: 0

README

rugburn

Go Report Card Build Status MIT licensed

A performant configuration-based caching web scraper

What is rugburn?

Rugburn is a web scraping framework. Unlike other web scraping frameworks which require writing software, rugburn attempts to be a completely configuration-based web scraper.

rugburn is installed as a CLI tool and allows you to create and modify a "sraping" environment using only configuration. To see an example of a scraper for the website Hacker News, simply run rugburn init in a new directory. To see all available options run rugburn help.

Rugburn configuraiton files specify which pages to download (spider) and which elements to extract via XPath (scraper). For cases where additional custom behavior is required, rugburn supports transformations. Transformations are scripts written in LUA to allow for you to "transform" your data into a more desirable format. You can re-use transformations between scrapers.

Rugburn supports caching of requests and responses into a local on-disk database. This is recommended in order to avoid IP bans, improve performance, and in order to preserve the backwards compatability of selectors.

Caching also means you can check your configuration alongside your database into version control. This means your configuration and transforms will remain deterministic and the history of them will be retained.

Caching is optional in the case where it is not desired, or not infeasible due to a larger dataset.

Status

Rugburn is still experimental and likely to contain breaking changes going forward.

Installation

Mac OS X release binaries are available on the releases page. Other platforms coming soon.

Windows and Linux can still manually build and install:

go get github.com/prettymuchbryce/rugburn
cd $GOPATH/src/github.com/prettymuchbryce/rugburn
make deps
make install

# Now make sure it installed successfully
rugburn help

CLI Commands

rugburn init - Initialize a new rugburn project in the current directory.

rugburn run - Run the rugburn project in this directory.

rugburn clean - Clean the rugburn cache of the project in this directory.

rugburn help - Print some help information.

Configuration options

Configuration Example:

{
	"name": "Hacker News Scraper",
	"options": {
		"store": {
			"strategy": "disk"
		},
		"spiders": {
			"concurrency": 3,
			"max": 5
		}
	},
	"spider": {
		"urls": [
			"https://news.ycombinator.com/news"
		],
		"links": [
			"//a[@class=\"morelink\"]/@href"
		]
	},
	"scrapers": [
		{
			"name": "Links",
			"output": "links.jsonl",
			"context": "//a[@class=\"storylink\"]",
			"fields": {
				"title": "/text()"
			},
			"transforms": [
			  "./transforms/UppercaseTitle.lua"
			]
		}
	]
}

Transform Example

function transform (state)
	-- Uppercase the title field
	if state["title"] ~= nil then
		state["title"] = string.upper(state["title"])
	end
	return state
end

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL