crawler

package module

v0.0.0-...-522e6a3 Latest Latest Go to latest Published: Jul 16, 2016 License: ISC Imports: 14 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/mars9/crawler

Links

Open Source Insights

README ¶

crawler

WARNING: This software is new, experimental, and under heavy development. The documentation is lacking, if any. There are almost no tests.

You have been warned

Package crawler provides a load-balanced, concurrent and flexible crawler that follows the robots.txt policies and crawl delays in its default configuration.

Installation

To install, simply run in a terminal:

go get github.com/mars9/crawler

Documentation ¶

Index ¶

Constants
func Accept(url *url.URL, host string, reject, accept []*regexp.Regexp) bool
func Get(url *url.URL, agent string, robots Robots) (io.ReadCloser, error)
type Crawler
- func New(w *Worker, ttl time.Duration, log *log.Logger) *Crawler
type Error
- func (e Error) Error() string
type Queue
- func NewQueue(limit int64, ttl time.Duration) *Queue
type Robots
type Worker

Constants ¶

View Source

const (
	DefaultUserAgent   = "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"
	DefaultRobotsAgent = "Googlebot (crawlbot v1)"
	DefaultTimeToLive  = 3 * DefaultDelay
	DefaultDelay       = 3 * time.Second
)

View Source

const (
	ErrNotAbsoluteURL = Error("not an absolute url")
	ErrRejectedURL    = Error("url rejected")

	ErrQueueClosed  = Error("queue is shut down")
	ErrDuplicateURL = Error("duplicate url")
	ErrEmptyURL     = Error("empty url")
	ErrLimitReached = Error("limit reached")
)

Variables ¶

This section is empty.

Functions ¶

func Accept ¶

func Accept(url *url.URL, host string, reject, accept []*regexp.Regexp) bool

func Get ¶

func Get(url *url.URL, agent string, robots Robots) (io.ReadCloser, error)

Types ¶

type Crawler ¶

type Crawler struct {
	// contains filtered or unexported fields
}

func New ¶

func New(w *Worker, ttl time.Duration, log *log.Logger) *Crawler

func (*Crawler) Close ¶

func (c *Crawler) Close() error

func (*Crawler) Done ¶

func (c *Crawler) Done() <-chan struct{}

func (*Crawler) Start ¶

func (c *Crawler) Start(sitemap *url.URL, seeds ...*url.URL) error

type Error ¶

type Error string

func (Error) Error ¶

func (e Error) Error() string

type Queue ¶

type Queue struct {
	// contains filtered or unexported fields
}

func NewQueue ¶

func NewQueue(limit int64, ttl time.Duration) *Queue

func (*Queue) Close ¶

func (q *Queue) Close() error

func (*Queue) Pop ¶

func (q *Queue) Pop() <-chan *url.URL

func (*Queue) Push ¶

func (q *Queue) Push(url *url.URL) error

type Robots ¶

type Robots interface {
	Test(*url.URL) bool
}

type Worker ¶

type Worker struct {
	// GetFunc issues a GET request to the specified URL and returns the
	// response body and an error if any.
	GetFunc func(*url.URL) (io.ReadCloser, error)

	// IsAcceptedFunc can be used to control the crawler.
	IsAcceptedFunc func(*url.URL) bool

	// ProcessFunc can be used to scrape data.
	ProcessFunc func(*url.URL, *html.Node, []byte)

	// Host defines the hostname to crawl. Worker is a single-host crawler.
	Host *url.URL

	// UserAgent defines the user-agent string to use for URL fetching.
	UserAgent string
	Accept    []*regexp.Regexp
	Reject    []*regexp.Regexp

	// Delay to use between requests to a same host if there is not
	// robots.txt crawl delay. The delay starts as soon as the response
	// is received from the host.
	Delay time.Duration

	// MaxEnqueue returns the maximum number of pages visited before
	// stopping the crawl. Note that the Crawler will send its stop signal
	// once this number of visits is reached, but workers may be in the
	// process of visiting other pages, so when the crawling stops, the
	// number of pages visited will be at least MaxEnqueues, possibly more.
	MaxEnqueue int64

	Robots Robots

	Concurrent int
}

Worker represents a crawler worker implementation.

func (*Worker) Get ¶

func (w *Worker) Get(url *url.URL) (io.ReadCloser, error)

func (*Worker) IsAccepted ¶

func (w *Worker) IsAccepted(url *url.URL) bool

func (*Worker) Process ¶

func (w *Worker) Process(url *url.URL, node *html.Node, data []byte)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
sitemap
transform

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL