Documentation ¶
Overview ¶
Package crawler provides a crawler implementation.
Index ¶
Constants ¶
View Source
const ( DefaultUserAgent = "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0" DefaultRobotsUserAgent = "Googlebot (crawlbot v1)" DefaultTimeToLive = 2 * DefaultCrawlDelay DefaultCrawlDelay = 3 * time.Second )
Default options.
Variables ¶
This section is empty.
Functions ¶
func Fetch ¶
Fetch issues a GET to the specified URL. Fetch follows redirects up to a maximum of 10 redirects. Fetch sends all found links to push and afterwards calls Crawler Parse, if not nil.
Types ¶
type Bucket ¶
type Bucket interface { // Put puts the record for the given key. It overwrites any previous // record for that key; a Bucket is not a multi-map. Put(key []byte, record *pb.Record) (err error) // Get returns the record for the given key. It returns errNotFound if // the Bucket does not contain the key. Get(key []byte, record *pb.Record) (err error) // Delete deletes the record for the given key. It returns errNotFound if // the Bucket does not contain the key. Delete(key []byte) (err error) // Exists returns errNotFound if the Bucket does not contain the key. Exists(key []byte) (err error) // Bucket returns the name and unique identifier for this Bucket. Bucket() (name, uuid []byte) // List returns all collected records. List() (rec <-chan *pb.Record, err error) }
Bucket represents a Record store.
type Crawler ¶
type Crawler interface { // Fetch issues a GET to the specified URL and returns the response body // and an error if any. Fetch(url *url.URL) (rc io.ReadCloser, err error) // Parse is called when visiting a page. Parse receives a http response // body reader and should return an error, if any. Can be nil. Parse(url *url.URL, body []byte) (err error) // Domain returns the host to crawl. Domain() (domain *url.URL) // Accept can be used to control the crawler. Accept(url *url.URL) (ok bool) // Seeds returns the base URLs used to start crawling. Seeds() []*url.URL // MaxVisit returns the maximum number of pages visited before stopping // the crawl. Note that the Crawler will send its stop signal once this // number of visits is reached, but workers may be in the process of // visiting other pages, so when the crawling stops, the number of pages // visited will be at least MaxVisits, possibly more. MaxVisit() (max uint32) // Delay returns the time to wait between each request to the same host. // The delay starts as soon as the response is received from the host. Delay() (delay time.Duration) // TTL returns the duration that a crawler goroutine can wait without // receiving new commands to fetch. If the idle time-to-live is reached, // the crawler goroutine is stopped and its resources are released. This // can be especially useful for long-running crawlers. TTL() (timeout time.Duration) }
Crawler represents a crawler implementation.
type Store ¶
type Store struct {
// contains filtered or unexported fields
}
Store implements a boltdb backed record store.
func (*Store) Backup ¶
Backup writes the entire database to a writer. A reader transaction is maintained during the backup so it is safe to continue using the database while a backup is in progress.
Directories ¶
Path | Synopsis |
---|---|
Package crawlerpb is a generated protocol buffer package.
|
Package crawlerpb is a generated protocol buffer package. |
Package robotstxt implements the robots.txt Exclusion Protocol as specified in http://www.robotstxt.org/wc/robots.html
|
Package robotstxt implements the robots.txt Exclusion Protocol as specified in http://www.robotstxt.org/wc/robots.html |
Click to show internal directories.
Click to hide internal directories.