fetchbot

package module

v0.0.0-...-e4ab2e6 Latest Latest Go to latest Published: Jul 4, 2014 License: BSD-3-Clause Imports: 11 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/morephp/fetchbot

Links

Open Source Insights

README ¶

fetchbot

Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.

It is very much a rewrite of gocrawl with a simpler API, less features built-in, but at the same time more flexibility. As for Go itself, sometimes less is more!

Installation

To install, simply run in a terminal:

go get github.com/PuerkitoBio/fetchbot

The package has a single external dependency, robotstxt. It also integrates code from the iq package.

The API documentation is available on godoc.org.

Changes

2014-07-04 : change the type of Fetcher.HttpClient from *http.Client to the Doer interface. Low chance of breaking existing code, but it's a possibility if someone used the fetcher's client to run other requests (e.g. f.HttpClient.Get(...)).

Usage

The following example (taken from /example/short/main.go) shows how to create and start a Fetcher, one way to send commands, and how to stop the fetcher once all commands have been handled.

package main

import (
	"fmt"
	"net/http"

	"github.com/PuerkitoBio/fetchbot"
)

func main() {
	f := fetchbot.New(fetchbot.HandlerFunc(handler))
	queue := f.Start()
	queue.SendStringHead("http://google.com", "http://golang.org", "http://golang.org/doc")
	queue.Close()
}

func handler(ctx *fetchbot.Context, res *http.Response, err error) {
	if err != nil {
		fmt.Printf("error: %s\n", err)
		return
	}
	fmt.Printf("[%d] %s %s\n", res.StatusCode, ctx.Cmd.Method(), ctx.Cmd.URL())
}

A more complex and complete example can be found in the repository, at /example/full/.

Basically, a Fetcher is an instance of a web crawler, independent of other Fetchers. It receives Commands via the Queue, executes the requests, and calls a Handler to process the responses. A Command is an interface that tells the Fetcher which URL to fetch, and which HTTP method to use (i.e. "GET", "HEAD", ...).

A call to Fetcher.Start() returns the Queue associated with this Fetcher. This is the thread-safe object that can be used to send commands, or to stop the crawler.

Both the Command and the Handler are interfaces, and may be implemented in various ways. They are defined like so:

type Command interface {
	URL() *url.URL
	Method() string
}
type Handler interface {
	Handle(*Context, *http.Response, error)
}

A Context is a struct that holds the Command and the Queue, so that the Handler always knows which Command initiated this call, and has a handle to the Queue.

A Handler is similar to the net/http Handler, and middleware-style combinations can be built on top of it. A HandlerFunc type is provided so that simple functions with the right signature can be used as Handlers (like net/http.HandlerFunc), and there is also a multiplexer Mux that can be used to dispatch calls to different Handlers based on some criteria.

The Fetcher recognizes a number of interfaces that the Command may implement, for more advanced needs. If the Command implements the BasicAuthProvider interface, a Basic Authentication header will be put in place with the given credentials to fetch the URL.

Similarly, the CookiesProvider and HeaderProvider interfaces offer the expected features (setting cookies and header values on the request). The ReaderProvider and ValuesProvider interfaces are also supported, although they should be mutually exclusive as they both set the body of the request. If both are supported, the ReaderProvider interface is used. It sets the body of the request (e.g. for a "POST") using the given io.Reader instance. The ValuesProvider does the same, but using the given url.Values instance, and sets the Content-Type of the body to "application/x-www-form-urlencoded" (unless it is explicitly set by a HeaderProvider).

Since the Command is an interface, it can be a custom struct that holds additional information, such as an ID for the URL (e.g. from a database), or a depth counter so that the crawling stops at a certain depth, etc. For basic commands that don't require additional information, the package provides the Cmd struct that implements the Command interface. This is the Command implementation used when using the various Queue.SendString* methods.

The Fetcher has a number of fields that provide further customization:

HttpClient : By default, the Fetcher uses the net/http default Client to make requests. A different client can be set on the Fetcher.HttpClient field.
CrawlDelay : That value is used only if there is no delay specified by the robots.txt of a given host.
UserAgent : Sets the user agent string to use for the requests and to validate against the robots.txt entries.
WorkerIdleTTL : Sets the duration that a worker goroutine can wait without receiving new commands to fetch. If the idle time-to-live is reached, the worker goroutine is stopped and its resources are released. This can be especially useful for long-running crawlers.

What fetchbot doesn't do - especially compared to gocrawl - is that it doesn't keep track of already visited URLs, and it doesn't normalize the URLs. This is outside the scope of this package - all commands sent on the Queue will be fetched. Normalization can easily be done (e.g. using purell) before sending the Command to the Fetcher. How to keep track of visited URLs depends on the use-case of the specific crawler, but for an example, see /example/full/main.go.

License

The BSD 3-Clause license, the same as the Go language. The iq package source code is under the CDDL-1.0 license (details in the source file).

Documentation ¶

Overview ¶

Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.

It is very much a rewrite of gocrawl (https://github.com/PuerkitoBio/gocrawl) with a simpler API, less features built-in, but at the same time more flexibility. As for Go itself, sometimes less is more!

Installation ¶

To install, simply run in a terminal:

go get github.com/PuerkitoBio/fetchbot

The package has a single external dependency, robotstxt (https://github.com/temoto/robotstxt-go). It also integrates code from the iq package (https://github.com/kylelemons/iq).

The API documentation is available on godoc.org (http://godoc.org/github.com/PuerkitoBio/fetchbot).

Usage ¶

The following example (taken from /example/short/main.go) shows how to create and start a Fetcher, one way to send commands, and how to stop the fetcher once all commands have been handled.

package main

import (
	"fmt"
	"net/http"

	"github.com/PuerkitoBio/fetchbot"
)

func main() {
	f := fetchbot.New(fetchbot.HandlerFunc(handler))
	queue := f.Start()
	queue.SendStringHead("http://google.com", "http://golang.org", "http://golang.org/doc")
	queue.Close()
}

func handler(ctx *fetchbot.Context, res *http.Response, err error) {
	if err != nil {
		fmt.Printf("error: %s\n", err)
		return
	}
	fmt.Printf("[%d] %s %s\n", res.StatusCode, ctx.Cmd.Method(), ctx.Cmd.URL())
}

A more complex and complete example can be found in the repository, at /example/full/.

Basically, a Fetcher is an instance of a web crawler, independent of other Fetchers. It receives Commands via the Queue, executes the requests, and calls a Handler to process the responses. A Command is an interface that tells the Fetcher which URL to fetch, and which HTTP method to use (i.e. "GET", "HEAD", ...).

A call to Fetcher.Start() returns the Queue associated with this Fetcher. This is the thread-safe object that can be used to send commands, or to stop the crawler.

Both the Command and the Handler are interfaces, and may be implemented in various ways. They are defined like so:

type Command interface {
	URL() *url.URL
	Method() string
}
type Handler interface {
	Handle(*Context, *http.Response, error)
}

A Context is a struct that holds the Command and the Queue, so that the Handler always knows which Command initiated this call, and has a handle to the Queue.

A Handler is similar to the net/http Handler, and middleware-style combinations can be built on top of it. A HandlerFunc type is provided so that simple functions with the right signature can be used as Handlers (like net/http.HandlerFunc), and there is also a multiplexer Mux that can be used to dispatch calls to different Handlers based on some criteria.

The Fetcher recognizes a number of interfaces that the Command may implement, for more advanced needs. If the Command implements the BasicAuthProvider interface, a Basic Authentication header will be put in place with the given credentials to fetch the URL.

Similarly, the CookiesProvider and HeaderProvider interfaces offer the expected features (setting cookies and header values on the request). The ReaderProvider and ValuesProvider interfaces are also supported, although they should be mutually exclusive as they both set the body of the request. If both are supported, the ReaderProvider interface is used. It sets the body of the request (e.g. for a "POST") using the given io.Reader instance. The ValuesProvider does the same, but using the given url.Values instance, and sets the Content-Type of the body to "application/x-www-form-urlencoded" (unless it is explicitly set by a HeaderProvider).

Since the Command is an interface, it can be a custom struct that holds additional information, such as an ID for the URL (e.g. from a database), or a depth counter so that the crawling stops at a certain depth, etc. For basic commands that don't require additional information, the package provides the Cmd struct that implements the Command interface. This is the Command implementation used when using the various Queue.SendString* methods.

The Fetcher has a number of fields that provide further customization:

- HttpClient : By default, the Fetcher uses the net/http default Client to make requests. A different client can be set on the Fetcher.HttpClient field.

- CrawlDelay : That value is used only if there is no delay specified by the robots.txt of a given host.

- UserAgent : Sets the user agent string to use for the requests and to validate against the robots.txt entries.

- WorkerIdleTTL : Sets the duration that a worker goroutine can wait without receiving new commands to fetch. If the idle time-to-live is reached, the worker goroutine is stopped and its resources are released. This can be especially useful for long-running crawlers.

What fetchbot doesn't do - especially compared to gocrawl - is that it doesn't keep track of already visited URLs, and it doesn't normalize the URLs. This is outside the scope of this package - all commands sent on the Queue will be fetched. Normalization can easily be done (e.g. using https://github.com/PuerkitoBio/purell) before sending the Command to the Fetcher. How to keep track of visited URLs depends on the use-case of the specific crawler, but for an example, see /example/full/main.go.

License ¶

The BSD 3-Clause license (http://opensource.org/licenses/BSD-3-Clause), the same as the Go language. The iq_slice.go file is under the CDDL-1.0 license (details in the source file).

Index ¶

Constants
Variables
type BasicAuthProvider
type Cmd
- func (c *Cmd) Method() string
- func (c *Cmd) URL() *url.URL
type Command
type Context
type CookiesProvider
type DebugInfo
type Doer
type Fetcher
- func New(h Handler) *Fetcher
- func (f *Fetcher) Debug() <-chan *DebugInfo
- func (f *Fetcher) Start() *Queue
type Handler
type HandlerFunc
- func (h HandlerFunc) Handle(ctx *Context, res *http.Response, err error)
type HeaderProvider
type Mux
- func NewMux() *Mux
- func (mux *Mux) Handle(ctx *Context, res *http.Response, err error)
- func (mux *Mux) HandleError(err error, h Handler)
- func (mux *Mux) HandleErrors(h Handler)
- func (mux *Mux) Response() *ResponseMatcher
type Queue
- func (q *Queue) Block()
- func (q *Queue) Close() error
- func (q *Queue) Send(c Command) error
- func (q *Queue) SendString(method string, rawurl ...string) (int, error)
- func (q *Queue) SendStringGet(rawurl ...string) (int, error)
- func (q *Queue) SendStringHead(rawurl ...string) (int, error)
type ReaderProvider
type ResponseMatcher
- func (r *ResponseMatcher) ContentType(ct string) *ResponseMatcher
- func (r *ResponseMatcher) Handler(h Handler) *ResponseMatcher
- func (r *ResponseMatcher) Host(host string) *ResponseMatcher
- func (r *ResponseMatcher) Method(m string) *ResponseMatcher
- func (r *ResponseMatcher) Path(p string) *ResponseMatcher
- func (r *ResponseMatcher) Status(code int) *ResponseMatcher
- func (r *ResponseMatcher) StatusRange(min, max int) *ResponseMatcher
type ValuesProvider

Constants ¶

View Source

const (
	// DefaultCrawlDelay represents the delay to use if there is no robots.txt
	// specified delay.
	DefaultCrawlDelay = 5 * time.Second

	// DefaultUserAgent is the default user agent string.
	DefaultUserAgent = "Fetchbot (https://github.com/PuerkitoBio/fetchbot)"

	// DefaultWorkerIdleTTL is the default time-to-live of an idle host worker goroutine.
	// If no URL is sent for a given host within this duration, this host's goroutine
	// is disposed of.
	DefaultWorkerIdleTTL = 30 * time.Second
)

Variables ¶

View Source

var (
	// ErrEmptyHost is returned if a command to be enqueued has an URL with an empty host.
	ErrEmptyHost = errors.New("fetchbot: invalid empty host")

	// ErrDisallowed is returned when the requested URL is disallowed by the robots.txt
	// policy.
	ErrDisallowed = errors.New("fetchbot: disallowed by robots.txt")

	// ErrQueueClosed is returned when a Send call is made on a closed Queue.
	ErrQueueClosed = errors.New("fetchbot: send on a closed queue")
)

Functions ¶

This section is empty.

Types ¶

type BasicAuthProvider ¶

type BasicAuthProvider interface {
	BasicAuth() (user string, pwd string)
}

The BasicAuthProvider interface gets the credentials to use to perform the request with Basic Authentication.

type Cmd ¶

type Cmd struct {
	U *url.URL
	M string
}

The Cmd struct defines a basic Command implementation.

func (*Cmd) Method ¶

func (c *Cmd) Method() string

Method returns the HTTP verb to use to process this command (i.e. "GET", "HEAD", etc.).

func (*Cmd) URL ¶

func (c *Cmd) URL() *url.URL

URL returns the resource targeted by this command.

type Command ¶

type Command interface {
	URL() *url.URL
	Method() string
}

The Command interface defines the methods required by the Fetcher to request a resource.

type Context ¶

type Context struct {
	Cmd Command
	Q   *Queue
}

Context is a Command's fetch context, passed to the Handler. It gives access to the original Command and the associated Queue.

type CookiesProvider ¶

type CookiesProvider interface {
	Cookies() []*http.Cookie
}

The CookiesProvider interface gets the cookies to send with the request.

type DebugInfo ¶

type DebugInfo struct {
	NumHosts int
}

The DebugInfo holds information to introspect the Fetcher's state.

type Doer ¶

type Doer interface {
	Do(*http.Request) (*http.Response, error)
}

Doer defines the method required to use a type as HttpClient. The net/*http.Client type satisfies this interface.

type Fetcher ¶

type Fetcher struct {
	// The Handler to be called for each request. All successfully enqueued requests
	// produce a Handler call.
	Handler Handler

	// Default delay to use between requests to a same host if there is no robots.txt
	// crawl delay.
	CrawlDelay time.Duration

	// The *http.Client to use for the requests. If nil, defaults to the net/http
	// package's default client. Should be HTTPClient to comply with go lint, but
	// this is a breaking change, won't fix.
	HttpClient Doer

	// The user-agent string to use for robots.txt validation and URL fetching.
	UserAgent string

	// The time a host-dedicated worker goroutine can stay idle, with no Command to enqueue,
	// before it is stopped and cleared from memory.
	WorkerIdleTTL time.Duration

	// AutoClose makes the fetcher close its queue automatically once the number
	// of hosts reach 0. A host is removed once it has been idle for WorkerIdleTTL
	// duration.
	AutoClose bool
	// contains filtered or unexported fields
}

A Fetcher defines the parameters for running a web crawler.

func New ¶

func New(h Handler) *Fetcher

New returns an initialized Fetcher.

func (*Fetcher) Debug ¶

func (f *Fetcher) Debug() <-chan *DebugInfo

Debug returns the channel to use to receive the debugging information. It is not intended to be used by package users.

func (*Fetcher) Start ¶

func (f *Fetcher) Start() *Queue

Start starts the Fetcher, and returns the Queue to use to send Commands to be fetched.

type Handler ¶

type Handler interface {
	Handle(*Context, *http.Response, error)
}

The Handler interface is used to process the Fetcher's requests. It is similar to the net/http.Handler interface.

type HandlerFunc ¶

type HandlerFunc func(*Context, *http.Response, error)

A HandlerFunc is a function signature that implements the Handler interface. A function with this signature can thus be used as a Handler.

func (HandlerFunc) Handle ¶

func (h HandlerFunc) Handle(ctx *Context, res *http.Response, err error)

Handle is the Handler interface implementation for the HandlerFunc type.

type HeaderProvider ¶

type HeaderProvider interface {
	Header() http.Header
}

The HeaderProvider interface gets the headers to set on the request. If an Authorization header is set, it will be overridden by the BasicAuthProvider, if implemented.

type Mux ¶

type Mux struct {
	DefaultHandler Handler
	// contains filtered or unexported fields
}

Mux is a simple multiplexer for the Handler interface, similar to net/http.ServeMux. It is itself a Handler, and dispatches the calls to the matching Handlers.

For error Handlers, if there is a Handler registered for the same error value, it will be called. Otherwise, if there is a Handler registered for any error, this Handler will be called.

For Response Handlers, a match with a path criteria has higher priority than other matches, and the longer path match will get called.

If multiple Response handlers with the same path length (or no path criteria) match a response, the actual handler called is undefined, but one and only one will be called.

In any case, if no Handler matches, the DefaultHandler is called, and it defaults to a no-op.

func NewMux ¶

func NewMux() *Mux

NewMux returns an initialized Mux.

func (*Mux) Handle ¶

func (mux *Mux) Handle(ctx *Context, res *http.Response, err error)

Handle is the Handler interface implementation for Mux. It dispatches the calls to the matching Handler.

func (*Mux) HandleError ¶

func (mux *Mux) HandleError(err error, h Handler)

HandleError registers a Handler for a specific error value. Multiple calls with the same error value override previous calls. As a special case, a nil error value registers a Handler for any error that doesn't have a specific Handler.

func (*Mux) HandleErrors ¶

func (mux *Mux) HandleErrors(h Handler)

HandleErrors registers a Handler for any error that doesn't have a specific Handler.

func (*Mux) Response ¶

func (mux *Mux) Response() *ResponseMatcher

Response initializes an entry for a Response Handler based on various criteria. The Response Handler is not registered until Handle is called.

type Queue ¶

type Queue struct {
	// contains filtered or unexported fields
}

Queue offers methods to send Commands to the Fetcher, and to Stop the crawling process. It is safe to use from concurrent goroutines.

func (*Queue) Block ¶

func (q *Queue) Block()

Block blocks the current goroutine until the Queue is closed and all pending commands are drained.

func (*Queue) Close ¶

func (q *Queue) Close() error

Close closes the Queue so that no more Commands can be sent. It blocks until the Fetcher drains all pending commands. After the call, the Fetcher is stopped. Attempts to enqueue new URLs after Close has been called will always result in a ErrQueueClosed error.

func (*Queue) Send ¶

func (q *Queue) Send(c Command) error

Send enqueues a Command into the Fetcher. If the Queue has been closed, it returns ErrQueueClosed.

func (*Queue) SendString ¶

func (q *Queue) SendString(method string, rawurl ...string) (int, error)

SendString enqueues a method and some URL strings into the Fetcher. It returns an error if the URL string cannot be parsed, or if the Queue has been closed. The first return value is the number of URLs successfully enqueued.

func (*Queue) SendStringGet ¶

func (q *Queue) SendStringGet(rawurl ...string) (int, error)

SendStringGet enqueues the URL strings to be fetched with a GET method. It returns an error if the URL string cannot be parsed, or if the Queue has been closed. The first return value is the number of URLs successfully enqueued.

func (*Queue) SendStringHead ¶

func (q *Queue) SendStringHead(rawurl ...string) (int, error)

SendStringHead enqueues the URL strings to be fetched with a HEAD method. It returns an error if the URL string cannot be parsed, or if the Queue has been closed. The first return value is the number of URLs successfully enqueued.

type ReaderProvider ¶

type ReaderProvider interface {
	Reader() io.Reader
}

The ReaderProvider interface gets the Reader to use as the Body of the request. It has higher priority than the ValuesProvider interface, so that if both interfaces are implemented, the ReaderProvider is used.

type ResponseMatcher ¶

type ResponseMatcher struct {
	// contains filtered or unexported fields
}

A ResponseMatcher holds the criteria for a response Handler.

func (*ResponseMatcher) ContentType ¶

func (r *ResponseMatcher) ContentType(ct string) *ResponseMatcher

ContentType sets a criteria based on the Content-Type header for the Response Handler. Its Handler will only be called if it has this content type, ignoring any additional parameter on the Header value (following the semicolon, i.e. "text/html; charset=utf-8").

func (*ResponseMatcher) Handler ¶

func (r *ResponseMatcher) Handler(h Handler) *ResponseMatcher

Handler sets the Handler to be called when this Response Handler is the match for a given response. It registers the Response Handler in its parent Mux.

func (*ResponseMatcher) Host ¶

func (r *ResponseMatcher) Host(host string) *ResponseMatcher

Host sets a criteria based on the host of the URL for the Response Handler. Its Handler will only be called if the host of the URL matches exactly the specified host.

func (*ResponseMatcher) Method ¶

func (r *ResponseMatcher) Method(m string) *ResponseMatcher

Method sets a method criteria for the Response Handler. Its Handler will only be called if it has this HTTP method (i.e. "GET", "HEAD", ...).

func (*ResponseMatcher) Path ¶

func (r *ResponseMatcher) Path(p string) *ResponseMatcher

Path sets a criteria based on the path of the URL for the Response Handler. Its Handler will only be called if the path of the URL starts with this path. Longer matches have priority over shorter ones.

func (*ResponseMatcher) Status ¶

func (r *ResponseMatcher) Status(code int) *ResponseMatcher

Status sets a criteria based on the Status code of the response for the Response Handler. Its Handler will only be called if the response has this status code.

func (*ResponseMatcher) StatusRange ¶

func (r *ResponseMatcher) StatusRange(min, max int) *ResponseMatcher

StatusRange sets a criteria based on the Status code of the response for the Response Handler. Its Handler will only be called if the response has a status code between the min and max. If min is greater than max, the values are switched.

type ValuesProvider ¶

type ValuesProvider interface {
	Values() url.Values
}

The ValuesProvider interface gets the values to send as the Body of the request. It has lower priority than the ReaderProvider interface, so that if both interfaces are implemented, the ReaderProvider is used. If the request has no explicit Content-Type set, it will be automatically set to "application/x-www-form-urlencoded".

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
example
full
short
shortauto

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL