crawler

package module
v0.0.0-...-522e6a3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 16, 2016 License: ISC Imports: 14 Imported by: 0

README

crawler GoDoc

WARNING: This software is new, experimental, and under heavy development. The documentation is lacking, if any. There are almost no tests.

You have been warned

Package crawler provides a load-balanced, concurrent and flexible crawler that follows the robots.txt policies and crawl delays in its default configuration.

Installation

To install, simply run in a terminal:

go get github.com/mars9/crawler

Documentation

Index

Constants

View Source
const (
	DefaultUserAgent   = "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"
	DefaultRobotsAgent = "Googlebot (crawlbot v1)"
	DefaultTimeToLive  = 3 * DefaultDelay
	DefaultDelay       = 3 * time.Second
)
View Source
const (
	ErrNotAbsoluteURL = Error("not an absolute url")
	ErrRejectedURL    = Error("url rejected")

	ErrQueueClosed  = Error("queue is shut down")
	ErrDuplicateURL = Error("duplicate url")
	ErrEmptyURL     = Error("empty url")
	ErrLimitReached = Error("limit reached")
)

Variables

This section is empty.

Functions

func Accept

func Accept(url *url.URL, host string, reject, accept []*regexp.Regexp) bool

func Get

func Get(url *url.URL, agent string, robots Robots) (io.ReadCloser, error)

Types

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

func New

func New(w *Worker, ttl time.Duration, log *log.Logger) *Crawler

func (*Crawler) Close

func (c *Crawler) Close() error

func (*Crawler) Done

func (c *Crawler) Done() <-chan struct{}

func (*Crawler) Start

func (c *Crawler) Start(sitemap *url.URL, seeds ...*url.URL) error

type Error

type Error string

func (Error) Error

func (e Error) Error() string

type Queue

type Queue struct {
	// contains filtered or unexported fields
}

func NewQueue

func NewQueue(limit int64, ttl time.Duration) *Queue

func (*Queue) Close

func (q *Queue) Close() error

func (*Queue) Pop

func (q *Queue) Pop() <-chan *url.URL

func (*Queue) Push

func (q *Queue) Push(url *url.URL) error

type Robots

type Robots interface {
	Test(*url.URL) bool
}

type Worker

type Worker struct {
	// GetFunc issues a GET request to the specified URL and returns the
	// response body and an error if any.
	GetFunc func(*url.URL) (io.ReadCloser, error)

	// IsAcceptedFunc can be used to control the crawler.
	IsAcceptedFunc func(*url.URL) bool

	// ProcessFunc can be used to scrape data.
	ProcessFunc func(*url.URL, *html.Node, []byte)

	// Host defines the hostname to crawl. Worker is a single-host crawler.
	Host *url.URL

	// UserAgent defines the user-agent string to use for URL fetching.
	UserAgent string
	Accept    []*regexp.Regexp
	Reject    []*regexp.Regexp

	// Delay to use between requests to a same host if there is not
	// robots.txt crawl delay. The delay starts as soon as the response
	// is received from the host.
	Delay time.Duration

	// MaxEnqueue returns the maximum number of pages visited before
	// stopping the crawl. Note that the Crawler will send its stop signal
	// once this number of visits is reached, but workers may be in the
	// process of visiting other pages, so when the crawling stops, the
	// number of pages visited will be at least MaxEnqueues, possibly more.
	MaxEnqueue int64

	Robots Robots

	Concurrent int
}

Worker represents a crawler worker implementation.

func (*Worker) Get

func (w *Worker) Get(url *url.URL) (io.ReadCloser, error)

func (*Worker) IsAccepted

func (w *Worker) IsAccepted(url *url.URL) bool

func (*Worker) Process

func (w *Worker) Process(url *url.URL, node *html.Node, data []byte)

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL