spider

package module
v0.0.0-...-7050784 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 25, 2015 License: MIT Imports: 7 Imported by: 0

README

go-spider Build Status GoDoc

A flexible spider as well as a general perposed task runner.

Go Package Dependencies

See Godeps/Godeps.json

Usage

Workflow
package main

import (
    // Import the main package
    "github.com/ddliu/go-spider"

    // Import a lot of useful pipes
    "github.com/ddliu/go-spider/pipes"
)

func main() {
    // Create a spider
    s := spider.NewSpider()

    // Config it
    s.Concurrency = 3

    // Combine pipes together
    s.
        Pipe(pipes.PipeA).
        Pipe(pipes.PipeB)

    // Let's go!
    s.Run()
}

RPC Server

There is a builtin RPC server which makes it easy to integrite with other systems.

s := spier.NewSpider()

// setup
// ...

s.RunAndServe("127.0.0.1:1234")

Client

After starting the spider and the RPC server, you can take control easily with the cli client.

Install
go get github.com/ddliu/go-spider/gospider
Usage
$ gospider

NAME:
   gospider - The go-spider client (https://github.com/ddliu/go-spider)

USAGE:
   gospider [global options] command [command options] [arguments...]

VERSION:
   0.1.0

AUTHOR:
  Liu Dong - <ddliuhb@gmail.com>

COMMANDS:
   watch    Keep watching the spider
   info     Show spider info
   add      Add tasks
   pause    Pause the spider
   resume   Resume the spider
   stop     Stop the spider
   ping     Ping the RPC service
   help, h  Shows a list of commands or help for one command
   
GLOBAL OPTIONS:
   --server, -s     Server IP and port to connect(127.0.0.1:1234) [$GO_SPIDER_SERVER]
   --help, -h       show help
   --version, -v    print the version
The Watch Example
$ export GO_SPIDER_SERVER=127.0.0.1:1234
$ gospider watch
Status: Running , time: 15s memory: 264KB
>>>>>>>>........................................................................
Total: 100, pending: 86, working: 3
Done: 7, failed: 4, ignored: 0

Examples

See examples/downloader folder:

cd examples/downloader
mkdir download
go run main.go -depth=3 -max=100 -follow=http://tooling.github.io/book-of-modern-frontend-tooling/* -target=./download/ http://tooling.github.io/book-of-modern-frontend-tooling/

Documentation

Index

Constants

View Source
const (
	ON_START = iota
	ON_STOP  = iota
)

Variables

This section is empty.

Functions

func StartRPCServer

func StartRPCServer(spider *Spider, listen string) error

Types

type Data

type Data map[string]interface{}

func (Data) GetBytes

func (this Data) GetBytes(key string) ([]byte, bool)

func (Data) GetInt

func (this Data) GetInt(key string) (int, bool)

func (Data) GetString

func (this Data) GetString(key string) (string, bool)

func (Data) Has

func (this Data) Has(key string) bool

func (Data) MustGet

func (this Data) MustGet(key string) interface{}

func (Data) MustGetBytes

func (this Data) MustGetBytes(key string) []byte

func (Data) MustGetInt

func (this Data) MustGetInt(key string) int

func (Data) MustGetString

func (this Data) MustGetString(key string) string

type Listener

type Listener func(*Spider, *Task)

type Pipe

type Pipe func(spider *Spider, task *Task)

The pipe interface

func Parallel

func Parallel(pipes ...Pipe) Pipe

Run pipes in parallel

func Series

func Series(pipes ...Pipe) Pipe

Run pipes in series

type RPC

type RPC struct {
	// contains filtered or unexported fields
}

func NewRPC

func NewRPC(spider *Spider) *RPC

func (*RPC) Add

func (rpc *RPC) Add(uriList []string, ack *bool) error

func (*RPC) Info

func (rpc *RPC) Info(skip bool, info *SpiderInfo) error

func (*RPC) Pause

func (rpc *RPC) Pause(skip bool, ack *bool) error

func (*RPC) Pong

func (rpc *RPC) Pong(skip bool, message *string) error

func (*RPC) Resume

func (rpc *RPC) Resume(skip bool, ack *bool) error

func (*RPC) Stop

func (rpc *RPC) Stop(skip bool, ack *bool) error

type RPCClient

type RPCClient struct {
	// contains filtered or unexported fields
}

func NewRPCClient

func NewRPCClient(dsn string, timeout time.Duration) (*RPCClient, error)

func (*RPCClient) Add

func (client *RPCClient) Add(uriList ...string) error

func (*RPCClient) Info

func (client *RPCClient) Info() (SpiderInfo, error)

func (*RPCClient) Pause

func (client *RPCClient) Pause() error

func (*RPCClient) Ping

func (client *RPCClient) Ping() error

func (*RPCClient) Resume

func (client *RPCClient) Resume() error

func (*RPCClient) Stop

func (client *RPCClient) Stop() error

type Spider

type Spider struct {
	Concurrency int

	Stats     map[Status]uint64
	IsPaused  bool
	IsStopped bool
	IsDebug   bool
	// contains filtered or unexported fields
}

func NewSpider

func NewSpider() *Spider

Create a spider.

func (*Spider) AddTask

func (this *Spider) AddTask(task *Task) *Spider

Add a task to queue

func (*Spider) AddUri

func (this *Spider) AddUri(uris ...string) *Spider

Add tasks from uri.

func (*Spider) DoneTask

func (this *Spider) DoneTask(task *Task)

Mark a task as done.

func (*Spider) FailTask

func (this *Spider) FailTask(task *Task, reason interface{})

Mark a task as failed.

func (*Spider) IgnoreTask

func (this *Spider) IgnoreTask(task *Task, reason interface{})

Mark a task as ignored.

func (*Spider) IsFinished

func (this *Spider) IsFinished() bool

Check if all tasks have been processed.

func (*Spider) On

func (this *Spider) On(e int, f Listener) *Spider

Register events

func (*Spider) Pause

func (this *Spider) Pause()

func (*Spider) Pipe

func (this *Spider) Pipe(pipe Pipe) *Spider

Chain a pipe.

func (*Spider) Resume

func (this *Spider) Resume()

func (*Spider) Run

func (this *Spider) Run()

Run spider and stop when complete.

func (*Spider) RunAndServe

func (this *Spider) RunAndServe(listen string) error

Run spider and start a RPC server

func (*Spider) RunForever

func (this *Spider) RunForever(quit chan bool)

Run spider forever, and accept a quit channel to close it.

Loop through the task list and run each of them with the help of a buffered channel.

func (*Spider) StartTask

func (this *Spider) StartTask(task *Task)

Mark a task as started.

func (*Spider) Stop

func (this *Spider) Stop()

func (*Spider) Trigger

func (this *Spider) Trigger(e int, t *Task)

Trigger an event

type SpiderInfo

type SpiderInfo struct {
	StartTime   time.Time
	MemoryUsage uint64
	Stats       map[Status]uint64
	IsStopped   bool
	IsPaused    bool
}

type Status

type Status int
const (
	PENDING Status = iota
	WORKING
	FAILED
	IGNORED
	DONE
)

type Task

type Task struct {
	Uri    string
	Status Status
	Depth  uint
	Data   Data
	Spider *Spider
	Parent *Task
}

func NewTask

func NewTask(uri string) *Task

func (*Task) Done

func (this *Task) Done()

func (*Task) Fail

func (this *Task) Fail(reason interface{})

func (*Task) Fork

func (this *Task) Fork(uri string, data Data)

Create a new task from it

func (*Task) ForkUri

func (this *Task) ForkUri(uris ...string)

func (*Task) Ignore

func (this *Task) Ignore(reason interface{})

func (*Task) IsEnded

func (this *Task) IsEnded() bool

func (*Task) Start

func (this *Task) Start()

Directories

Path Synopsis
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL