crawler

package module

v0.0.0-...-739935b Latest Latest Go to latest Published: Nov 29, 2014 License: ISC Imports: 18 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/outself/crawler

Links

Open Source Insights

Documentation ¶

Overview ¶

Package crawler provides a crawler implementation.

Index ¶

Constants
func Fetch(url *url.URL, c Crawler, push chan<- *url.URL) error
func Queue(push <-chan *url.URL, pop chan<- *url.URL)
func Start(ctx context.Context, crawler Crawler, crawlers uint8)
type Bucket
type Crawler
- func New(args *pb.Crawler, fn ParseFunc) (Crawler, error)
type ParseFunc
type Store
- func NewStore(dbpath string, limit int64) (*Store, error)

Constants ¶

View Source

const (
	DefaultUserAgent       = "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"
	DefaultRobotsUserAgent = "Googlebot (crawlbot v1)"
	DefaultTimeToLive      = 2 * DefaultCrawlDelay
	DefaultCrawlDelay      = 3 * time.Second
)

Default options.

Variables ¶

This section is empty.

Functions ¶

func Fetch ¶

func Fetch(url *url.URL, c Crawler, push chan<- *url.URL) error

Fetch issues a GET to the specified URL. Fetch follows redirects up to a maximum of 10 redirects. Fetch sends all found links to push and afterwards calls Crawler Parse, if not nil.

func Queue ¶

func Queue(push <-chan *url.URL, pop chan<- *url.URL)

Queue creates an infinite buffered channel. Queue receives input on push and sending output to pop. Queue should be run in its own goroutine. On termination Queue closes pop.

func Start ¶

func Start(ctx context.Context, crawler Crawler, crawlers uint8)

Start starts a new crawl. Crawlers defines the number concurrently working crawlers.

Types ¶

type Bucket ¶

type Bucket interface {
	// Put puts the record for the given key. It overwrites any previous
	// record for that key; a Bucket is not a multi-map.
	Put(key []byte, record *pb.Record) (err error)

	// Get returns the record for the given key. It returns errNotFound if
	// the Bucket does not contain the key.
	Get(key []byte, record *pb.Record) (err error)

	// Delete deletes the record for the given key. It returns errNotFound if
	// the Bucket does not contain the key.
	Delete(key []byte) (err error)

	// Exists returns errNotFound if the Bucket does not contain the key.
	Exists(key []byte) (err error)

	// Bucket returns the name and unique identifier for this Bucket.
	Bucket() (name, uuid []byte)

	// List returns all collected records.
	List() (rec <-chan *pb.Record, err error)
}

Bucket represents a Record store.

type Crawler ¶

type Crawler interface {
	// Fetch issues a GET to the specified URL and returns the response body
	// and an error if any.
	Fetch(url *url.URL) (rc io.ReadCloser, err error)

	// Parse is called when visiting a page. Parse receives a http response
	// body reader and should return an error, if any. Can be nil.
	Parse(url *url.URL, body []byte) (err error)

	// Domain returns the host to crawl.
	Domain() (domain *url.URL)

	// Accept can be used to control the crawler.
	Accept(url *url.URL) (ok bool)

	// Seeds returns the base URLs used to start crawling.
	Seeds() []*url.URL

	// MaxVisit returns the maximum number of pages visited before stopping
	// the crawl. Note that the Crawler will send its stop signal once this
	// number of visits is reached, but workers may be in the process of
	// visiting other pages, so when the crawling stops, the number of pages
	// visited will be at least MaxVisits, possibly more.
	MaxVisit() (max uint32)

	// Delay returns the time to wait between each request to the same host.
	// The delay starts as soon as the response is received from the host.
	Delay() (delay time.Duration)

	// TTL returns the duration that a crawler goroutine can wait without
	// receiving new commands to fetch. If the idle time-to-live is reached,
	// the crawler goroutine is stopped and its resources are released. This
	// can be especially useful for long-running crawlers.
	TTL() (timeout time.Duration)
}

Crawler represents a crawler implementation.

func New ¶

func New(args *pb.Crawler, fn ParseFunc) (Crawler, error)

New returns a default Crawler implementation.

type ParseFunc ¶

type ParseFunc func(url *url.URL, body []byte) (err error)

ParseFunc implements Crawler Parse.

type Store ¶

type Store struct {
	// contains filtered or unexported fields
}

Store implements a boltdb backed record store.

func NewStore ¶

func NewStore(dbpath string, limit int64) (*Store, error)

NewStore returns a boltdb backed record store.

func (*Store) Backup ¶

func (s *Store) Backup(w io.Writer) (int64, error)

Backup writes the entire database to a writer. A reader transaction is maintained during the backup so it is safe to continue using the database while a backup is in progress.

func (*Store) Close ¶

func (s *Store) Close() error

Close closes the store, rendering it unusable for I/O.

func (*Store) Create ¶

func (s *Store) Create(name []byte) (Bucket, error)

Create creates the named bucket. Create returns an error if it already exists. If successful, methods on the returned bucket can be used for I/O.

func (*Store) List ¶

func (s *Store) List(name []byte) (<-chan []byte, error)

func (*Store) ListAll ¶

func (s *Store) ListAll() (<-chan []byte, error)

func (*Store) Open ¶

func (s *Store) Open(name, uuid []byte) (Bucket, error)

Open opens the named bucket. If successful, methods on the returned bucket can be used for I/O.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
crawlerpb Package crawlerpb is a generated protocol buffer package.	Package crawlerpb is a generated protocol buffer package.
robotstxt Package robotstxt implements the robots.txt Exclusion Protocol as specified in http://www.robotstxt.org/wc/robots.html	Package robotstxt implements the robots.txt Exclusion Protocol as specified in http://www.robotstxt.org/wc/robots.html

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL