crawler

package module
v0.0.0-...-739935b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 29, 2014 License: ISC Imports: 18 Imported by: 0

Documentation

Overview

Package crawler provides a crawler implementation.

Index

Constants

View Source
const (
	DefaultUserAgent       = "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"
	DefaultRobotsUserAgent = "Googlebot (crawlbot v1)"
	DefaultTimeToLive      = 2 * DefaultCrawlDelay
	DefaultCrawlDelay      = 3 * time.Second
)

Default options.

Variables

This section is empty.

Functions

func Fetch

func Fetch(url *url.URL, c Crawler, push chan<- *url.URL) error

Fetch issues a GET to the specified URL. Fetch follows redirects up to a maximum of 10 redirects. Fetch sends all found links to push and afterwards calls Crawler Parse, if not nil.

func Queue

func Queue(push <-chan *url.URL, pop chan<- *url.URL)

Queue creates an infinite buffered channel. Queue receives input on push and sending output to pop. Queue should be run in its own goroutine. On termination Queue closes pop.

func Start

func Start(ctx context.Context, crawler Crawler, crawlers uint8)

Start starts a new crawl. Crawlers defines the number concurrently working crawlers.

Types

type Bucket

type Bucket interface {
	// Put puts the record for the given key. It overwrites any previous
	// record for that key; a Bucket is not a multi-map.
	Put(key []byte, record *pb.Record) (err error)

	// Get returns the record for the given key. It returns errNotFound if
	// the Bucket does not contain the key.
	Get(key []byte, record *pb.Record) (err error)

	// Delete deletes the record for the given key. It returns errNotFound if
	// the Bucket does not contain the key.
	Delete(key []byte) (err error)

	// Exists returns errNotFound if the Bucket does not contain the key.
	Exists(key []byte) (err error)

	// Bucket returns the name and unique identifier for this Bucket.
	Bucket() (name, uuid []byte)

	// List returns all collected records.
	List() (rec <-chan *pb.Record, err error)
}

Bucket represents a Record store.

type Crawler

type Crawler interface {
	// Fetch issues a GET to the specified URL and returns the response body
	// and an error if any.
	Fetch(url *url.URL) (rc io.ReadCloser, err error)

	// Parse is called when visiting a page. Parse receives a http response
	// body reader and should return an error, if any. Can be nil.
	Parse(url *url.URL, body []byte) (err error)

	// Domain returns the host to crawl.
	Domain() (domain *url.URL)

	// Accept can be used to control the crawler.
	Accept(url *url.URL) (ok bool)

	// Seeds returns the base URLs used to start crawling.
	Seeds() []*url.URL

	// MaxVisit returns the maximum number of pages visited before stopping
	// the crawl. Note that the Crawler will send its stop signal once this
	// number of visits is reached, but workers may be in the process of
	// visiting other pages, so when the crawling stops, the number of pages
	// visited will be at least MaxVisits, possibly more.
	MaxVisit() (max uint32)

	// Delay returns the time to wait between each request to the same host.
	// The delay starts as soon as the response is received from the host.
	Delay() (delay time.Duration)

	// TTL returns the duration that a crawler goroutine can wait without
	// receiving new commands to fetch. If the idle time-to-live is reached,
	// the crawler goroutine is stopped and its resources are released. This
	// can be especially useful for long-running crawlers.
	TTL() (timeout time.Duration)
}

Crawler represents a crawler implementation.

func New

func New(args *pb.Crawler, fn ParseFunc) (Crawler, error)

New returns a default Crawler implementation.

type ParseFunc

type ParseFunc func(url *url.URL, body []byte) (err error)

ParseFunc implements Crawler Parse.

type Store

type Store struct {
	// contains filtered or unexported fields
}

Store implements a boltdb backed record store.

func NewStore

func NewStore(dbpath string, limit int64) (*Store, error)

NewStore returns a boltdb backed record store.

func (*Store) Backup

func (s *Store) Backup(w io.Writer) (int64, error)

Backup writes the entire database to a writer. A reader transaction is maintained during the backup so it is safe to continue using the database while a backup is in progress.

func (*Store) Close

func (s *Store) Close() error

Close closes the store, rendering it unusable for I/O.

func (*Store) Create

func (s *Store) Create(name []byte) (Bucket, error)

Create creates the named bucket. Create returns an error if it already exists. If successful, methods on the returned bucket can be used for I/O.

func (*Store) List

func (s *Store) List(name []byte) (<-chan []byte, error)

func (*Store) ListAll

func (s *Store) ListAll() (<-chan []byte, error)

func (*Store) Open

func (s *Store) Open(name, uuid []byte) (Bucket, error)

Open opens the named bucket. If successful, methods on the returned bucket can be used for I/O.

Directories

Path Synopsis
Package crawlerpb is a generated protocol buffer package.
Package crawlerpb is a generated protocol buffer package.
Package robotstxt implements the robots.txt Exclusion Protocol as specified in http://www.robotstxt.org/wc/robots.html
Package robotstxt implements the robots.txt Exclusion Protocol as specified in http://www.robotstxt.org/wc/robots.html

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL