getgo

package module
v0.0.0-...-92cf2b6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 25, 2018 License: BSD-2-Clause Imports: 9 Imported by: 0

README

Getgo: a concurrent, simple and extensible web scraping framework

GoDoc Build Status

Getgo is a concurrent, simple and extensible web scraping framework written in Go.

Quick start

###Get Getgo

go get -u github.com/hailiang/getgo

###Define a task This example is under the examples/goblog directory. To use Getgo to scrap structured data from a web page, just define the structured data as a Go struct (golangBlogEntry), and define a corresponding task (golangBlogIndexTask).

type golangBlogEntry struct {
	Title string
	URL   string
	Tags  *string
}

type golangBlogIndexTask struct {
	// Variables in task URL, e.g. page number
}

func (t golangBlogIndexTask) Request() *http.Request {
	return getReq(`http://blog.golang.org/index`)
}

func (t golangBlogIndexTask) Handle(root *query.Node, s getgo.Storer) (err error) {
	root.Div(_Id("content")).Children(_Class("blogtitle")).For(func(item *query.Node) {
		title := item.Ahref().Text()
		url := item.Ahref().Href()
		tags := item.Span(_Class("tags")).Text()
		if url != nil && title != nil {
			store(&golangBlogEntry{Title: *title, URL: *url, Tags: tags}, s, &err)
		}
	})
	return
}

###Run the task Use util.Run to run the task and print all the result to standard output.

	util.Run(golangBlogIndexTask{})

To store the parsed result to a database, a storage backend satisfying getgo.Tx interface should be provided to the getgo.Run method.

Understand Getgo

A getgo.Task is an interface to represent an HTTP crawler task that provides an HTTP request and a method to handle the HTTP response.

type Task interface {
	Requester
	Handle(resp *http.Response) error
}

type Requester interface {
	Request() *http.Request
}

A getgo.Runner is responsible to run a getgo.Task. There are two concrete runners provided: SequentialRunner and ConcurrentRunner.

type Runner interface {
	Run(task Task) error // Run runs a task
	Close()              // Close closes the runner
}

A task that stores data into a storage backend should satisfy getgo.StorableTask interface.

type StorableTask interface {
	Requester
	Handle(resp *http.Response, s Storer) error
}

A storage backend is simply an object satisfying getgo.Tx interface.

type Storer interface {
	Store(v interface{}) error
}

type Tx interface {
	Storer
	Commit() error
	Rollback() error
}

See getgo.Run method to understand how a StorableTask is combined with a storage backend and adapted to become a normal Task to allow a Runner to run it.

There are currently a PostgreSQL storage backend provided by Getgo, and it is not hard to support more backends (See getgo/db package for details).

The easier way to define a task for an HTML page is to define a task satisfying getgo.HTMLTask rather than getgo.Task, there are adapters to convert internally an HTMLTask to a Task so that a Runner can run an HTMLTask. The Handle method of HTMLTask provides an already parsed HTML DOM object (by html-query package).

type HTMLTask interface {
	Requester
	Handle(root *query.Node, s Storer) error
}

Similarly, a task for retrieving a JSON page should satisfy getgo.TextTask interface. An io.Reader is provided to be decoded by the encoding/json package.

type TextTask interface {
	Requester
	Handle(r io.Reader, s Storer) error
}

Documentation

Overview

Package getgo is a concurrent web scrapping framework.

Index

Constants

This section is empty.

Variables

View Source
var RetryNum = 3

RetryNum is the retry number when failed to fetch a page.

Functions

func Run

func Run(runner Runner, tx Tx, tasks ...interface{}) error

Run either HtmlTask, TextTask or Task. tx is commited if successful or rollbacked if failed.

Types

type Atomized

type Atomized struct {
	StorableTask
	Tx
}

Atomized is an adapter that converts a StorableTask to an atomized Task that supports transaction.

func (Atomized) Handle

func (h Atomized) Handle(resp *http.Response) error

Handle implements the Handle method of Task interface.

type ConcurrentRunner

type ConcurrentRunner struct {
	// contains filtered or unexported fields
}

ConcurrentRunner runs tasks concurrently.

func NewConcurrentRunner

func NewConcurrentRunner(workerNum int, client Doer, errHandler ErrorHandler) ConcurrentRunner

NewConcurrentRunner creates a concurrent runner.

func (ConcurrentRunner) Close

func (r ConcurrentRunner) Close()

Close implements the Close method of the Runner interface.

func (ConcurrentRunner) Run

func (r ConcurrentRunner) Run(task Task) error

Run implements the Run method of the Runner interface.

type Doer

type Doer interface {
	Do(req *http.Request) (resp *http.Response, err error)
}

Doer processes an HTTP request and returns an HTTP response.

type ErrorHandler

type ErrorHandler interface {
	HandleError(request *http.Request, err error) error
}

ErrorHandler is used to call back an external error handler when a task fails.

type ErrorHandlerFunc

type ErrorHandlerFunc func(*http.Request, error) error

ErrorHandlerFunc converts a function object to a ErrorHandler interface.

func (ErrorHandlerFunc) HandleError

func (f ErrorHandlerFunc) HandleError(request *http.Request, err error) error

HandleError implements ErrorHandler interface.

type HTMLTask

type HTMLTask interface {
	Requester
	Handle(root *query.Node, s Storer) error
}

HTMLTask is an HTML task should be able to Parse an HTML node tree to a slice of objects.

type HTTPLogger

type HTTPLogger struct {
	// contains filtered or unexported fields
}

HTTPLogger wraps an HTTP client and logs the request and network speed.

func NewHTTPLogger

func NewHTTPLogger(client *http.Client) *HTTPLogger

NewHTTPLogger creates an HTTPLogger by inspecting the connection's Read method of an http.Client.

func (*HTTPLogger) Do

func (l *HTTPLogger) Do(req *http.Request) (resp *http.Response, err error)

Do implements the Doer interface.

type Requester

type Requester interface {
	Request() *http.Request
}

Requester is the interface that returns an HTTP request by Request method. The Request method must be implemented to allow repeated calls.

type RetryDoer

type RetryDoer struct {
	Doer
	RetryTime int
}

RetryDoer wraps a Doer and implements the retry operation for Do method.

func (RetryDoer) Do

func (d RetryDoer) Do(req *http.Request) (resp *http.Response, err error)

Do implements the Doer interface.

type Runner

type Runner interface {
	Run(task Task) error // Run runs a task
	Close()              // Close closes the runner
}

Runner runs Tasks. A Runner gets an HTTP request from a Task, get the HTTP response and pass the response to the Task's Handle method. When a runner failed to get a response object, a nil response must still be passed to the Handle method to notify that a transaction must be rolled back if any.

type SequentialRunner

type SequentialRunner struct {
	Client Doer
	ErrorHandler
}

SequentialRunner is a simple single threaded task runner.

func (SequentialRunner) Close

func (r SequentialRunner) Close()

Close implements the Close method of the Runner interface.

func (SequentialRunner) Run

func (r SequentialRunner) Run(task Task) error

Run implements the Run method of the Runner interface.

type Storable

type Storable struct {
	TextTask
}

Storable is an adapter that converts a TextTask to a StorableTask.

func (Storable) Handle

func (b Storable) Handle(resp *http.Response, s Storer) error

Handle implements the Handle method of StorableTask interface.

type StorableTask

type StorableTask interface {
	Requester
	Handle(resp *http.Response, s Storer) error
}

StorableTask is a task that should be able to store data with a Storer passed to the Handle method.

type Storer

type Storer interface {
	Store(v interface{}) error
}

Storer provides the Store method to store an object parsed from an HTTP response.

type Task

type Task interface {
	Requester
	Handle(resp *http.Response) error
}

Task is an HTTP crawler task. It must provide an HTTP request and a method to handle an HTTP response.

func ToTask

func ToTask(t interface{}, tx Tx) Task

ToTask adapts an HTMLTask, TextTask or Task itself to a Task.

type TaskGroup

type TaskGroup struct {
	Tx
	// contains filtered or unexported fields
}

TaskGroup makes a group of StorableTask as a single transaction.

func NewTaskGroup

func NewTaskGroup(tx Tx) *TaskGroup

NewTaskGroup creates a TaskGroup from a trasaction object.

func (*TaskGroup) Add

func (g *TaskGroup) Add(task StorableTask)

Add a StorableTask to TaskGroup.

func (*TaskGroup) Run

func (g *TaskGroup) Run(runner Runner) error

Run all tasks within a TaskGroup.

type Text

type Text struct {
	HTMLTask
}

Text is an adapter that converts an HTMLTask to a TextTask.

func (Text) Handle

func (t Text) Handle(r io.Reader, s Storer) error

Handle implements the Handle method of TextTask interface.

type TextTask

type TextTask interface {
	Requester
	Handle(r io.Reader, s Storer) error
}

TextTask is a task that only retrieves a Response's body.

type Tx

type Tx interface {
	Storer
	Commit() error
	Rollback() error
}

Tx is a transaction interface that provides methods for storing objects, commit or rollback changes. Notice that there is no Delete method defined. Tx's implementation must allow concurrent use.

Directories

Path Synopsis
db
Package db contains common interface that all implementations under db directory must satisfy.
Package db contains common interface that all implementations under db directory must satisfy.
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL