spider: github.com/celrenheit/spider Index | Files | Directories

package spider

import "github.com/celrenheit/spider"

Installation:

go get -u github.com/celrenheit/spider

Usage of this package is around the usage of spiders and passing contexts.

ctx, err := spider.Setup(nil)
err := spider.Spin(ctx)

If you have many spider you can make use of a scheduler. This package provides a basic scheduler.

scheduler := spider.NewScheduler()

scheduler.Add(schedule.Every(20 * time.Second), spider1)

scheduler.Add(schedule.Every(20 * time.Second),spider2)

scheduler.Start()

This will launch 2 spiders every 20 seconds for the first and every 10 seconds for the second.

You can create you own spider by implementing the Spider interface

package main

import (
	"fmt"

	"github.com/celrenheit/spider"
)

func main() {
	wikiSpider := &WikipediaHTMLSpider{
		Title: "Albert Einstein",
	}
	ctx, _ := wikiSpider.Setup(nil)
	wikiSpider.Spin(ctx)
}

type WikipediaHTMLSpider struct {
	Title string
}

func (w *WikipediaHTMLSpider) Setup(ctx *spider.Context) (*spider.Context, error) {
	url := fmt.Sprintf("https://en.wikipedia.org/wiki/%s", w.Title)
	return spider.NewHTTPContext("GET", url, nil)
}

func (w *WikipediaHTMLSpider) Spin(ctx *spider.Context) error {
	if _, err := ctx.DoRequest(); err != nil {
		return err
	}

	html, _ := ctx.HTMLParser()
	summary := html.Find("#mw-content-text p").First().Text()

	fmt.Println(summary)
	return nil
}

Index

Package Files

context.go doc.go http_spiders.go inmemory.go spider.go

Variables

var (
    ErrNoClient  = errors.New("No request has been set")
    ErrNoRequest = errors.New("No request has been set")
)

func Add Uses

func Add(sched Schedule, spider Spider)

Add adds a spider to the standard scheduler

func AddFunc Uses

func AddFunc(sched Schedule, url string, fn func(*Context) error)

AddFunc allows to add a spider to the standard scheduler using an url and a closure.

func Delete Uses

func Delete(url string, fn spinFunc) *spiderFunc

Delete returns a new DELETE HTTP Spider.

func Get Uses

func Get(url string, fn spinFunc) *spiderFunc

Get returns a new GET HTTP Spider.

func NewHTTPSpider Uses

func NewHTTPSpider(method, url string, body io.Reader, fn spinFunc) *spiderFunc

NewHTTPSpider creates a new spider according to the http method, url and body. The last argument is a closure for doing the actual work

func NewKVStore Uses

func NewKVStore() *store

NewKVStore returns a new store.

func Post Uses

func Post(url string, body io.Reader, fn spinFunc) *spiderFunc

Post returns a new POST HTTP Spider.

func Put Uses

func Put(url string, body io.Reader, fn spinFunc) *spiderFunc

Put returns a new PUT HTTP Spider.

func Start Uses

func Start()

Start starts the standard scheduler

func Stop Uses

func Stop()

Stop stops the standard scheduler

type BackoffCondition Uses

type BackoffCondition func(*http.Response) error

func ErrorIfStatusCodeIsNot Uses

func ErrorIfStatusCodeIsNot(status int) BackoffCondition

type Context Uses

type Context struct {
    Client *http.Client

    Parent   *Context
    Children []*Context
    // contains filtered or unexported fields
}

Context is the element that can be shared accross different spiders. It contains an HTTP Client and an HTTP Request. Context can execute an HTTP Request.

func NewContext Uses

func NewContext() *Context

NewContext returns a new Context.

func NewHTTPContext Uses

func NewHTTPContext(method, url string, body io.Reader) (*Context, error)

NewHTTPContext returns a new Context.

It creates a new http.Client and a new http.Request with the provided arguments.

func (*Context) Cookies Uses

func (c *Context) Cookies() []*http.Cookie

Cookies return a list of cookies for the given request URL

func (*Context) DoRequest Uses

func (c *Context) DoRequest() (*http.Response, error)

DoRequest makes an http request using the http.Client and http.Request associated with this context.

This will store the response in this context. To access the response you should do:

ctx.Response() // to get the http.Response

func (*Context) DoRequestWithExponentialBackOff Uses

func (c *Context) DoRequestWithExponentialBackOff(condition BackoffCondition, b backoff.BackOff) (*http.Response, error)

DoRequestWithExponentialBackOff makes an http request using the http.Client and http.Request associated with this context. You can pass a condition and a BackOff configuration. See https://github.com/cenkalti/backoff to know more about backoff. If no BackOff is provided it will use the default exponential BackOff configuration. See also ErrorIfStatusCodeIsNot function that provides a basic condition based on status code.

func (*Context) ExtendWithRequest Uses

func (c *Context) ExtendWithRequest(ctx Context, r *http.Request) *Context

ExtendWithRequest return a new Context child to the provided context associated with the provided http.Request.

func (*Context) Get Uses

func (c *Context) Get(key string) interface{}

Get a value from this context

func (*Context) HTMLParser Uses

func (c *Context) HTMLParser() (*goquery.Document, error)

HTMLParser returns an HTML parser.

It uses PuerkitoBio's awesome goquery package. It can be found an this url: https://github.com/PuerkitoBio/goquery.

func (*Context) JSONParser Uses

func (c *Context) JSONParser() (*simplejson.Json, error)

JSONParser returns a JSON parser.

It uses Bitly's go-simplejson package which can be found in: https://github.com/bitly/go-simplejson

func (*Context) NewClient Uses

func (c *Context) NewClient() (*http.Client, error)

NewClient create a new http.Client

func (*Context) NewCookieJar Uses

func (c *Context) NewCookieJar() (*cookiejar.Jar, error)

NewCookieJar create a new *cookiejar.Jar

func (*Context) RAWContent Uses

func (c *Context) RAWContent() ([]byte, error)

RAWContent returns the raw data of the reponse's body

func (*Context) Request Uses

func (c *Context) Request() *http.Request

Request returns an http.Response

func (*Context) ResetClient Uses

func (c *Context) ResetClient() (*http.Client, error)

ResetClient create a new http.Client and replace the existing one if there is one.

func (*Context) ResetCookies Uses

func (c *Context) ResetCookies() error

ResetCookies create a new cookie jar.

Note: All the cookies previously will be deleted.

func (*Context) Response Uses

func (c *Context) Response() *http.Response

Response returns an http.Response

func (*Context) Set Uses

func (c *Context) Set(key string, value interface{})

Set a value to this context

func (*Context) SetParent Uses

func (c *Context) SetParent(parent *Context)

Set a parent context to the current context. It will also add the current context to the list of children of the parent context.

func (*Context) SetRequest Uses

func (c *Context) SetRequest(req *http.Request)

SetRequest set an http.Request

func (*Context) SetResponse Uses

func (c *Context) SetResponse(res *http.Response)

SetResponse set an http.Response

type Entries Uses

type Entries []*Entry

Entries is a collection of Entry. Sortable by time.

func (Entries) Len Uses

func (e Entries) Len() int

func (Entries) Less Uses

func (e Entries) Less(i, j int) bool

func (Entries) Swap Uses

func (e Entries) Swap(i, j int)

type Entry Uses

type Entry struct {
    Spider   Spider
    Schedule Schedule
    Ctx      *Context
    Next     time.Time
}

Entry groups a spider, its root context, a Schedule and the Next time the spider must be launched

type InMemory Uses

type InMemory struct {
    // contains filtered or unexported fields
}

InMemory is the default scheduler

func NewScheduler Uses

func NewScheduler() *InMemory

NewScheduler returns a new InMemory scheduler

func (*InMemory) Add Uses

func (in *InMemory) Add(sched Schedule, spider Spider)

Add adds a spider using a nil root Context

func (*InMemory) AddFunc Uses

func (in *InMemory) AddFunc(sched Schedule, url string, fn func(*Context) error)

AddFunc allows to add a spider using an url and a closure. It is by default using the GET HTTP method.

func (*InMemory) AddWithCtx Uses

func (in *InMemory) AddWithCtx(sched Schedule, spider Spider, ctx *Context)

AddWithCtx adds a spider with a root Context passed in the arguments

func (*InMemory) Start Uses

func (in *InMemory) Start()

Start launch the scheduler. It will run in its own goroutine. Your code will continue to be execute after calling this function.

func (*InMemory) Stop Uses

func (in *InMemory) Stop()

Stop the scheduler. Should be called after Start.

type Schedule Uses

type Schedule interface {
    Next(time.Time) time.Time
}

Schedule is an interface with only a Next method. Next will return the next time it should run given the current time as a parameter.

type Spider Uses

type Spider interface {
    Setup(*Context) (*Context, error)
    Spin(*Context) error
}

Spider is an interface with two methods. It is the primary element of the package

Directories

PathSynopsis
examples
schedule

Package spider imports 13 packages (graph) and is imported by 4 packages. Updated 2017-12-31. Refresh now. Tools for package owners.