spider

package module
v0.3.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 20, 2016 License: MIT Imports: 13 Imported by: 2

README

Spider Build Status GoDoc License

This package provides a simple way, yet extensible, to scrape HTML and JSON pages. It uses spiders around the web scheduled at certain configurable intervals to fetch data. It is written in Golang and is MIT licensed.

You can see an example app using this package here: https://github.com/celrenheit/trending-machine

Installation

$ go get -u github.com/celrenheit/spider

Usage

package main

import (
	"fmt"
	"time"

	"github.com/celrenheit/spider"
	"github.com/celrenheit/spider/schedule"
)

// LionelMessiSpider scrape wikipedia's page for LionelMessi
// It is defined below in the init function
var LionelMessiSpider spider.Spider

func main() {
	// Create a new scheduler
	scheduler := spider.NewScheduler()

	// Register the spider to be scheduled every 15 seconds
	scheduler.Add(schedule.Every(15*time.Second), LionelMessiSpider)
	// Alternatively, you can choose a cron schedule
	// This will run every minute of every day
	scheduler.Add(schedule.Cron("* * * * *"), LionelMessiSpider)

	// Start the scheduler
	scheduler.Start()

	// Exit 5 seconds later to let time for the request to be done.
	// Depends on your internet connection
	<-time.After(65 * time.Second)
}

func init() {
	LionelMessiSpider = spider.Get("https://en.wikipedia.org/wiki/Lionel_Messi", func(ctx *spider.Context) error {
		fmt.Println(time.Now())
		// Execute the request
		if _, err := ctx.DoRequest(); err != nil {
			return err
		}

		// Get goquery's html parser
		htmlparser, err := ctx.HTMLParser()
		if err != nil {
			return err
		}
		// Get the first paragraph of the wikipedia page
		summary := htmlparser.Find("#mw-content-text > p").First().Text()

		fmt.Println(summary)
		return nil
	})
}

In order, to create your own spiders you have to implement the spider.Spider interface. It has two functions, Setup and Spin.

Setup gets a Context and returns a new Context with an error if something wrong happened. Usually, it is in this function that you create a new http client and http request.

Spin gets a Context do its work and returns an error if necessarry. It is in this function that you do your work (do a request, handle response, parse HTML or JSON, etc...). It should return an error if something didn't happened correctly.

Documentation

The documentation is hosted on GoDoc.

Examples

$ cd $GOPATH/src/github.com/celrenheit/spider/examples
$ go run wiki.go

Contributing

Contributions are welcome ! Feel free to submit a pull request. You can improve documentation and examples to start. You can also provides spiders and better schedulers.

If you have developed your own spiders or schedulers, I will be pleased to review your code and eventually merge it into the project.

License

MIT License

Inspiration

Dkron for the new in memory scheduler (as of 0.3)

Documentation

Overview

Installation:

go get -u github.com/celrenheit/spider

Usage of this package is around the usage of spiders and passing contexts.

ctx, err := spider.Setup(nil)
err := spider.Spin(ctx)

If you have many spider you can make use of a scheduler. This package provides a basic scheduler.

scheduler := spider.NewScheduler()

scheduler.Add(schedule.Every(20 * time.Second), spider1)

scheduler.Add(schedule.Every(20 * time.Second),spider2)

scheduler.Start()

This will launch 2 spiders every 20 seconds for the first and every 10 seconds for the second.

You can create you own spider by implementing the Spider interface

package main

import (
	"fmt"

	"github.com/celrenheit/spider"
)

func main() {
	wikiSpider := &WikipediaHTMLSpider{
		Title: "Albert Einstein",
	}
	ctx, _ := wikiSpider.Setup(nil)
	wikiSpider.Spin(ctx)
}

type WikipediaHTMLSpider struct {
	Title string
}

func (w *WikipediaHTMLSpider) Setup(ctx *spider.Context) (*spider.Context, error) {
	url := fmt.Sprintf("https://en.wikipedia.org/wiki/%s", w.Title)
	return spider.NewHTTPContext("GET", url, nil)
}

func (w *WikipediaHTMLSpider) Spin(ctx *spider.Context) error {
	if _, err := ctx.DoRequest(); err != nil {
		return err
	}

	html, _ := ctx.HTMLParser()
	summary := html.Find("#mw-content-text p").First().Text()

	fmt.Println(summary)
	return nil
}

Index

Constants

This section is empty.

Variables

View Source
var (
	ErrNoClient  = errors.New("No request has been set")
	ErrNoRequest = errors.New("No request has been set")
)

Functions

func Add added in v0.3.0

func Add(sched Schedule, spider Spider)

Add adds a spider to the standard scheduler

func AddFunc added in v0.3.0

func AddFunc(sched Schedule, url string, fn func(*Context) error)

AddFunc allows to add a spider to the standard scheduler using an url and a closure.

func Delete added in v0.3.0

func Delete(url string, fn spinFunc) *spiderFunc

Delete returns a new DELETE HTTP Spider.

func Get added in v0.3.0

func Get(url string, fn spinFunc) *spiderFunc

Get returns a new GET HTTP Spider.

func NewHTTPSpider added in v0.3.0

func NewHTTPSpider(method, url string, body io.Reader, fn spinFunc) *spiderFunc

NewHTTPSpider creates a new spider according to the http method, url and body. The last argument is a closure for doing the actual work

func NewKVStore

func NewKVStore() *store

NewKVStore returns a new store.

func Post added in v0.3.0

func Post(url string, body io.Reader, fn spinFunc) *spiderFunc

Post returns a new POST HTTP Spider.

func Put added in v0.3.0

func Put(url string, body io.Reader, fn spinFunc) *spiderFunc

Put returns a new PUT HTTP Spider.

func Start added in v0.3.0

func Start()

Start starts the standard scheduler

func Stop added in v0.3.0

func Stop()

Stop stops the standard scheduler

Types

type BackoffCondition

type BackoffCondition func(*http.Response) error

func ErrorIfStatusCodeIsNot

func ErrorIfStatusCodeIsNot(status int) BackoffCondition

type Context

type Context struct {
	Client *http.Client

	Parent   *Context
	Children []*Context
	// contains filtered or unexported fields
}

Context is the element that can be shared accross different spiders. It contains an HTTP Client and an HTTP Request. Context can execute an HTTP Request.

func NewContext

func NewContext() *Context

NewContext returns a new Context.

func NewHTTPContext added in v0.3.0

func NewHTTPContext(method, url string, body io.Reader) (*Context, error)

NewHTTPContext returns a new Context.

It creates a new http.Client and a new http.Request with the provided arguments.

func (*Context) Cookies

func (c *Context) Cookies() []*http.Cookie

Cookies return a list of cookies for the given request URL

func (*Context) DoRequest

func (c *Context) DoRequest() (*http.Response, error)

DoRequest makes an http request using the http.Client and http.Request associated with this context.

This will store the response in this context. To access the response you should do:

ctx.Response() // to get the http.Response

func (*Context) DoRequestWithExponentialBackOff

func (c *Context) DoRequestWithExponentialBackOff(condition BackoffCondition, b backoff.BackOff) (*http.Response, error)

DoRequestWithExponentialBackOff makes an http request using the http.Client and http.Request associated with this context. You can pass a condition and a BackOff configuration. See https://github.com/cenkalti/backoff to know more about backoff. If no BackOff is provided it will use the default exponential BackOff configuration. See also ErrorIfStatusCodeIsNot function that provides a basic condition based on status code.

func (*Context) ExtendWithRequest

func (c *Context) ExtendWithRequest(ctx Context, r *http.Request) *Context

ExtendWithRequest return a new Context child to the provided context associated with the provided http.Request.

func (*Context) Get

func (c *Context) Get(key string) interface{}

Get a value from this context

func (*Context) HTMLParser

func (c *Context) HTMLParser() (*goquery.Document, error)

HTMLParser returns an HTML parser.

It uses PuerkitoBio's awesome goquery package. It can be found an this url: https://github.com/PuerkitoBio/goquery.

func (*Context) JSONParser

func (c *Context) JSONParser() (*simplejson.Json, error)

JSONParser returns a JSON parser.

It uses Bitly's go-simplejson package which can be found in: https://github.com/bitly/go-simplejson

func (*Context) NewClient

func (c *Context) NewClient() (*http.Client, error)

NewClient create a new http.Client

func (*Context) NewCookieJar

func (c *Context) NewCookieJar() (*cookiejar.Jar, error)

NewCookieJar create a new *cookiejar.Jar

func (*Context) RAWContent

func (c *Context) RAWContent() ([]byte, error)

RAWContent returns the raw data of the reponse's body

func (*Context) Request

func (c *Context) Request() *http.Request

Request returns an http.Response

func (*Context) ResetClient

func (c *Context) ResetClient() (*http.Client, error)

ResetClient create a new http.Client and replace the existing one if there is one.

func (*Context) ResetCookies

func (c *Context) ResetCookies() error

ResetCookies create a new cookie jar.

Note: All the cookies previously will be deleted.

func (*Context) Response

func (c *Context) Response() *http.Response

Response returns an http.Response

func (*Context) Set

func (c *Context) Set(key string, value interface{})

Set a value to this context

func (*Context) SetParent

func (c *Context) SetParent(parent *Context)

Set a parent context to the current context. It will also add the current context to the list of children of the parent context.

func (*Context) SetRequest

func (c *Context) SetRequest(req *http.Request)

SetRequest set an http.Request

func (*Context) SetResponse

func (c *Context) SetResponse(res *http.Response)

SetResponse set an http.Response

type Entries added in v0.3.0

type Entries []*Entry

Entries is a collection of Entry. Sortable by time.

func (Entries) Len added in v0.3.0

func (e Entries) Len() int

func (Entries) Less added in v0.3.0

func (e Entries) Less(i, j int) bool

func (Entries) Swap added in v0.3.0

func (e Entries) Swap(i, j int)

type Entry added in v0.3.0

type Entry struct {
	Spider   Spider
	Schedule Schedule
	Ctx      *Context
	Next     time.Time
}

Entry groups a spider, its root context, a Schedule and the Next time the spider must be launched

type InMemory added in v0.3.0

type InMemory struct {
	// contains filtered or unexported fields
}

InMemory is the default scheduler

func NewScheduler added in v0.3.0

func NewScheduler() *InMemory

NewScheduler returns a new InMemory scheduler

func (*InMemory) Add added in v0.3.0

func (in *InMemory) Add(sched Schedule, spider Spider)

Add adds a spider using a nil root Context

func (*InMemory) AddFunc added in v0.3.0

func (in *InMemory) AddFunc(sched Schedule, url string, fn func(*Context) error)

AddFunc allows to add a spider using an url and a closure. It is by default using the GET HTTP method.

func (*InMemory) AddWithCtx added in v0.3.0

func (in *InMemory) AddWithCtx(sched Schedule, spider Spider, ctx *Context)

AddWithCtx adds a spider with a root Context passed in the arguments

func (*InMemory) Start added in v0.3.0

func (in *InMemory) Start()

Start launch the scheduler. It will run in its own goroutine. Your code will continue to be execute after calling this function.

func (*InMemory) Stop added in v0.3.0

func (in *InMemory) Stop()

Stop the scheduler. Should be called after Start.

type Schedule added in v0.2.0

type Schedule interface {
	Next(time.Time) time.Time
}

Schedule is an interface with only a Next method. Next will return the next time it should run given the current time as a parameter.

type Spider

type Spider interface {
	Setup(*Context) (*Context, error)
	Spin(*Context) error
}

Spider is an interface with two methods. It is the primary element of the package

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL