gocrawl

package module
v0.0.0-...-a62c0e3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 16, 2015 License: BSD-3-Clause Imports: 15 Imported by: 0

README

crawler
=======

crawler 致力于实现中文友好的网络抓取系统,项目基于PuerkitoBio的一个初级的并行的轻量级抓取库gocrawl.本项目目标:

    1. 实现分布式
    #. 增加机器学习算法
    #. 优化中文编码处理
    #. 完善文档


Features
========
*    Full control over the URLs to visit, inspect and query (using a pre-initialized [goquery][] document)
*    Crawl delays applied per host
*    Obedience to robots.txt rules (using the [robotstxt.go][robots] library)
*    Concurrent execution using goroutines
*    Configurable logging
*    Open, customizable design providing hooks into the execution logic

Installation and dependencies
=============================

crawl depends on the following userland libraries:

*    [goquery][]
*    [purell][]
*    [robotstxt.go][robots]

To install:

   *go get github.com/zchking/crawl*

To install a previous version, you have to `git clone https://github.com/zchking/crawl` into your `$GOPATH/src/github.com/zchking/gocrawl/` directory, and then run (for example) `git checkout v0.3.2` to checkout a specific version, and `go install` to build and install the Go package.

Changelog
=========

**2015.01.16** start from PuerkitoBio.Thanks. 


Example
=======

From `example_test.go`::

    package gocrawl

    import (
      "github.com/PuerkitoBio/goquery"
      "net/http"
      "regexp"
      "time"
    )

    // Only enqueue the root and paths beginning with an "a"
    var rxOk = regexp.MustCompile(`http://duckduckgo\.com(/a.*)?$`)

    // Create the Extender implementation, based on the gocrawl-provided DefaultExtender,
    // because we don't want/need to override all methods.
    type ExampleExtender struct {
      DefaultExtender // Will use the default implementation of all but Visit() and Filter()
    }

    // Override Visit for our need.
    func (this *ExampleExtender) Visit(ctx *URLContext, res *http.Response, doc *goquery.Document) (interface{}, bool) {
      // Use the goquery document or res.Body to manipulate the data
      // ...

      // Return nil and true - let gocrawl find the links
      return nil, true
    }

    // Override Filter for our need.
    func (this *ExampleExtender) Filter(ctx *URLContext, isVisited bool) bool {
      return !isVisited && rxOk.MatchString(ctx.NormalizedURL().String())
    }

    func ExampleCrawl() {
      // Set custom options
      opts := NewOptions(new(ExampleExtender))
      opts.CrawlDelay = 1 * time.Second
      opts.LogFlags = LogAll

      // Play nice with ddgo when running the test!
      opts.MaxVisits = 2

      // Create crawler and start at root of duckduckgo
      c := NewCrawlerWithOptions(opts)
      c.Run("https://duckduckgo.com/")

      // Remove "x" before Output: to activate the example (will run on go test)

      // xOutput: voluntarily fail to see log output
    }
    ```

Document
========
@TODO crawler Document

Refer to PuerkitoBio/gocrawl


Thanks
======
    
    - PuerkitoBio
    - Richard Penman
    - Dmitry Bondarenko
    - Markus Sonderegger

License
=======

The [BSD 3-Clause license][bsd].

[bsd]: http://opensource.org/licenses/BSD-3-Clause

[goquery]: https://github.com/PuerkitoBio/goquery

[robots]: https://github.com/temoto/robotstxt.go

[purell]: https://github.com/PuerkitoBio/purell

[robprot]: http://www.robotstxt.org/robotstxt.html

[robspec]: https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

Documentation

Overview

gocrawl is a polite, slim and concurrent web crawler written in Go.

Index

Constants

View Source
const (
	DefaultUserAgent          string                    = `Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2`
	DefaultRobotUserAgent     string                    = `Googlebot (gocrawl v0.4)`
	DefaultEnqueueChanBuffer  int                       = 100
	DefaultHostBufferFactor   int                       = 10
	DefaultCrawlDelay         time.Duration             = 5 * time.Second
	DefaultIdleTTL            time.Duration             = 10 * time.Second
	DefaultNormalizationFlags purell.NormalizationFlags = purell.FlagsAllGreedy
)

Default options

Variables

View Source
var (
	// The error returned when a redirection is requested, so that the
	// worker knows that this is not an actual Fetch error, but a request to
	// enqueue the redirect-to URL.
	ErrEnqueueRedirect = errors.New("redirection not followed")

	// The error returned when the maximum number of visits, as specified by the
	// Options field MaxVisits, is reached.
	ErrMaxVisits = errors.New("the maximum number of visits is reached")

	ErrInterrupted = errors.New("interrupted")
)
View Source
var HttpClient = &http.Client{CheckRedirect: func(req *http.Request, via []*http.Request) error {

	if isRobotsURL(req.URL) {
		if len(via) >= 10 {
			return errors.New("stopped after 10 redirects")
		}
		return nil
	}

	return ErrEnqueueRedirect
}}

The default HTTP client used by DefaultExtender's fetch requests (this is thread-safe). The client's fields can be customized (i.e. for a different redirection strategy, a different Transport object, ...). It should be done prior to starting the crawler.

Functions

This section is empty.

Types

type CrawlError

type CrawlError struct {
	Ctx  *URLContext
	Err  error
	Kind CrawlErrorKind
	// contains filtered or unexported fields
}

Crawl error information.

func (CrawlError) Error

func (this CrawlError) Error() string

Implementation of the error interface.

type CrawlErrorKind

type CrawlErrorKind uint8

Enum indicating the kind of the crawling error.

const (
	CekFetch CrawlErrorKind = iota
	CekParseRobots
	CekHttpStatusCode
	CekReadBody
	CekParseBody
	CekParseURL
	CekProcessLinks
	CekParseRedirectURL
)

func (CrawlErrorKind) String

func (this CrawlErrorKind) String() string

type Crawler

type Crawler struct {
	Options *Options
	// contains filtered or unexported fields
}

The crawler itself, the master of the whole process

func NewCrawler

func NewCrawler(ext Extender) *Crawler

Crawler constructor with the specified extender object.

func NewCrawlerWithOptions

func NewCrawlerWithOptions(opts *Options) *Crawler

Crawler constructor with a pre-initialized Options object.

func (*Crawler) Run

func (this *Crawler) Run(seeds interface{}) error

Run starts the crawling process, based on the given seeds and the current Options settings. Execution stops either when MaxVisits is reached (if specified) or when no more URLs need visiting. If an error occurs, it is returned (if MaxVisits is reached, the error ErrMaxVisits is returned).

func (*Crawler) Stop

func (this *Crawler) Stop()

type DefaultExtender

type DefaultExtender struct {
	EnqueueChan chan<- interface{}
}

Default working implementation of an extender.

func (*DefaultExtender) ComputeDelay

func (this *DefaultExtender) ComputeDelay(host string, di *DelayInfo, lastFetch *FetchInfo) time.Duration

ComputeDelay returns the delay specified in the Crawler's Options, unless a crawl-delay is specified in the robots.txt file, which has precedence.

func (*DefaultExtender) Disallowed

func (this *DefaultExtender) Disallowed(ctx *URLContext)

Disallowed is a no-op.

func (*DefaultExtender) End

func (this *DefaultExtender) End(err error)

End is a no-op.

func (*DefaultExtender) Enqueued

func (this *DefaultExtender) Enqueued(ctx *URLContext)

Enqueued is a no-op.

func (*DefaultExtender) Error

func (this *DefaultExtender) Error(err *CrawlError)

Error is a no-op (logging is done automatically, regardless of the implementation of the Error() hook).

func (*DefaultExtender) Fetch

func (this *DefaultExtender) Fetch(ctx *URLContext, userAgent string, headRequest bool) (*http.Response, error)

Fetch requests the specified URL using the given user agent string. It uses a custom http Client instance that doesn't follow redirections. Instead, the redirected-to URL is enqueued so that it goes through the same Filter() and Fetch() process as any other URL.

Two options were considered for the default Fetch() implementation : 1- Not following any redirections, and enqueuing the redirect-to URL,

failing the current call with the 3xx status code.

2- Following all redirections, enqueuing only the last one (where redirection

stops). Returning the response of the next-to-last request.

Ultimately, 1) was implemented, as it is the most generic solution that makes sense as default for the library. It involves no "magic" and gives full control as to what can happen, with the disadvantage of having the Filter() being aware of all possible intermediary URLs before reaching the final destination of a redirection (i.e. if A redirects to B that redirects to C, Filter has to allow A, B, and C to be Fetched, while solution 2 would only have required Filter to allow A and C).

Solution 2) also has the disadvantage of fetching twice the final URL (once while processing the original URL, so that it knows that there is no more redirection HTTP code, and another time when the actual destination URL is fetched to be visited).

func (*DefaultExtender) FetchedRobots

func (this *DefaultExtender) FetchedRobots(ctx *URLContext, res *http.Response)

FetchedRobots is a no-op.

func (*DefaultExtender) Filter

func (this *DefaultExtender) Filter(ctx *URLContext, isVisited bool) bool

Enqueue the URL if it hasn't been visited yet.

func (*DefaultExtender) Log

func (this *DefaultExtender) Log(logFlags LogFlags, msgLevel LogFlags, msg string)

Log prints to the standard error by default, based on the requested log verbosity.

func (*DefaultExtender) RequestGet

func (this *DefaultExtender) RequestGet(ctx *URLContext, headRes *http.Response) bool

Ask the worker to actually request the URL's body (issue a GET), unless the status code is not 2xx.

func (*DefaultExtender) RequestRobots

func (this *DefaultExtender) RequestRobots(ctx *URLContext, robotAgent string) (data []byte, doRequest bool)

Ask the worker to actually request (fetch) the Robots.txt document.

func (*DefaultExtender) Start

func (this *DefaultExtender) Start(seeds interface{}) interface{}

Return the same seeds as those received (those that were passed to Run() initially).

func (*DefaultExtender) Visit

func (this *DefaultExtender) Visit(ctx *URLContext, res *http.Response, doc *goquery.Document) (harvested interface{}, findLinks bool)

Ask the worker to harvest the links in this page.

func (*DefaultExtender) Visited

func (this *DefaultExtender) Visited(ctx *URLContext, harvested interface{})

Visited is a no-op.

type DelayInfo

type DelayInfo struct {
	OptsDelay   time.Duration
	RobotsDelay time.Duration
	LastDelay   time.Duration
}

Delay information: the Options delay, the Robots.txt delay, and the last delay used.

type Extender

type Extender interface {
	// Start, End, Error and Log are not related to a specific URL, so they don't
	// receive a URLContext struct.
	Start(interface{}) interface{}
	End(error)
	Error(*CrawlError)
	Log(LogFlags, LogFlags, string)

	// ComputeDelay is related to a Host only, not to a URLContext, although the FetchInfo
	// is related to a URLContext (holds a ctx field).
	ComputeDelay(string, *DelayInfo, *FetchInfo) time.Duration

	// All other extender methods are executed in the context of an URL, and thus
	// receive an URLContext struct as first argument.
	Fetch(*URLContext, string, bool) (*http.Response, error)
	RequestGet(*URLContext, *http.Response) bool
	RequestRobots(*URLContext, string) ([]byte, bool)
	FetchedRobots(*URLContext, *http.Response)
	Filter(*URLContext, bool) bool
	Enqueued(*URLContext)
	Visit(*URLContext, *http.Response, *goquery.Document) (interface{}, bool)
	Visited(*URLContext, interface{})
	Disallowed(*URLContext)
}

Extension methods required to provide an extender instance.

type FetchInfo

type FetchInfo struct {
	Ctx           *URLContext
	Duration      time.Duration
	StatusCode    int
	IsHeadRequest bool
}

Fetch information: the duration of the fetch, the returned status code, whether or not it was a HEAD request, and whether or not it was a robots.txt request.

type LogFlags

type LogFlags uint
const (
	LogError LogFlags = 1 << iota
	LogInfo
	LogEnqueued
	LogIgnored
	LogTrace
	LogNone LogFlags = 0
	LogAll  LogFlags = LogError | LogInfo | LogEnqueued | LogIgnored | LogTrace
)

Log levels for the library's logger

type Options

type Options struct {
	UserAgent             string
	RobotUserAgent        string
	MaxVisits             int
	EnqueueChanBuffer     int
	HostBufferFactor      int
	CrawlDelay            time.Duration // Applied per host
	WorkerIdleTTL         time.Duration
	SameHostOnly          bool
	HeadBeforeGet         bool
	URLNormalizationFlags purell.NormalizationFlags
	LogFlags              LogFlags
	Extender              Extender
}

The Options available to control and customize the crawling process.

func NewOptions

func NewOptions(ext Extender) *Options

type S

type S map[string]interface{}

type U

type U map[*url.URL]interface{}

type URLContext

type URLContext struct {
	HeadBeforeGet bool
	State         interface{}
	// contains filtered or unexported fields
}

func (*URLContext) IsRobotsURL

func (this *URLContext) IsRobotsURL() bool

func (*URLContext) NormalizedSourceURL

func (this *URLContext) NormalizedSourceURL() *url.URL

func (*URLContext) NormalizedURL

func (this *URLContext) NormalizedURL() *url.URL

func (*URLContext) SourceURL

func (this *URLContext) SourceURL() *url.URL

func (*URLContext) URL

func (this *URLContext) URL() *url.URL

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL