colly: github.com/asciimoo/colly Index | Files | Directories

package colly

import "github.com/asciimoo/colly"

Package colly implements a HTTP scraping framework

Index

Package Files

colly.go http_backend.go

type Collector Uses

type Collector struct {
    // UserAgent is the User-Agent string used by HTTP requests
    UserAgent string
    // MaxDepth limits the recursion depth of visited URLs.
    // Set it to 0 for infinite recursion (default).
    MaxDepth int
    // AllowedDomains is a domain whitelist.
    // Leave it blank to allow any domains to be visited
    AllowedDomains []string
    // DisallowedDomains is a domain blacklist.
    DisallowedDomains []string
    // URLFilters is a list of regular expressions which restricts
    // visiting URLs. If any of the rules matches to a URL the
    // request won't be stopped.
    // Leave it blank to allow any URLs to be visited
    URLFilters []*regexp.Regexp
    // AllowURLRevisit allows multiple downloads of the same URL
    AllowURLRevisit bool
    // MaxBodySize is the limit of the retrieved response body in bytes.
    // `0` means unlimited.
    // The default value for MaxBodySize is 10MB (10 * 1024 * 1024 bytes).
    MaxBodySize int
    // CacheDir specifies a location where GET requests are cached as files.
    // When it's not defined, caching is disabled.
    CacheDir string
    // contains filtered or unexported fields
}

Collector provides the scraper instance for a scraping job

func NewCollector Uses

func NewCollector() *Collector

NewCollector creates a new Collector instance with default configuration

func (*Collector) Cookies Uses

func (c *Collector) Cookies(URL string) []*http.Cookie

Cookies returns the cookies to send in a request for the given URL.

func (*Collector) DisableCookies Uses

func (c *Collector) DisableCookies()

DisableCookies turns off cookie handling for this collector

func (*Collector) Init Uses

func (c *Collector) Init()

Init initializes the Collector's private variables and sets default configuration for the Collector

func (*Collector) Limit Uses

func (c *Collector) Limit(rule *LimitRule) error

Limit adds a new `LimitRule` to the collector

func (*Collector) Limits Uses

func (c *Collector) Limits(rules []*LimitRule) error

Limits adds new `LimitRule`s to the collector

func (*Collector) OnError Uses

func (c *Collector) OnError(f ErrorCallback)

OnError registers a function. Function will be executed if an error occurs during the HTTP request.

func (*Collector) OnHTML Uses

func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback)

OnHTML registers a function. Function will be executed on every HTML element matched by the `goquerySelector` parameter. `goquerySelector` is a selector used by https://github.com/PuerkitoBio/goquery

func (*Collector) OnHTMLDetach Uses

func (c *Collector) OnHTMLDetach(goquerySelector string)

OnHTMLDetach deregister a function. Function will not be execute after detached

func (*Collector) OnRequest Uses

func (c *Collector) OnRequest(f RequestCallback)

OnRequest registers a function. Function will be executed on every request made by the Collector

func (*Collector) OnResponse Uses

func (c *Collector) OnResponse(f ResponseCallback)

OnResponse registers a function. Function will be executed on every response

func (*Collector) Post Uses

func (c *Collector) Post(URL string, requestData map[string]string) error

Post starts a collector job by creating a POST request. Post also calls the previously provided callbacks

func (*Collector) PostMultipart Uses

func (c *Collector) PostMultipart(URL string, requestData map[string][]byte) error

PostMultipart starts a collector job by creating a Multipart POST request with raw binary data. PostMultipart also calls the previously provided callbacks

func (*Collector) PostRaw Uses

func (c *Collector) PostRaw(URL string, requestData []byte) error

PostRaw starts a collector job by creating a POST request with raw binary data. Post also calls the previously provided callbacks

func (*Collector) SetCookies Uses

func (c *Collector) SetCookies(URL string, cookies []*http.Cookie) error

SetCookies handles the receipt of the cookies in a reply for the given URL

func (*Collector) SetRequestTimeout Uses

func (c *Collector) SetRequestTimeout(timeout time.Duration)

SetRequestTimeout overrides the default timeout (10 seconds) for this collector

func (*Collector) Visit Uses

func (c *Collector) Visit(URL string) error

Visit starts Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided callbacks

func (*Collector) Wait Uses

func (c *Collector) Wait()

Wait returns when the collector jobs are finished

func (*Collector) WithTransport Uses

func (c *Collector) WithTransport(transport *http.Transport)

WithTransport allows you to set a custom http.Transport for this collector.

type Context Uses

type Context struct {
    // contains filtered or unexported fields
}

Context provides a tiny layer for passing data between callbacks

func NewContext Uses

func NewContext() *Context

NewContext initializes a new Context instance

func (*Context) Get Uses

func (c *Context) Get(key string) string

Get retrieves a value from Context. If no value found for `k` Get returns an empty string if key not found

func (*Context) MarshalBinary Uses

func (c *Context) MarshalBinary() (_ []byte, _ error)

MarshalBinary encodes Context value This function is used by request caching

func (*Context) Put Uses

func (c *Context) Put(key, value string)

Put stores a value in Context

func (*Context) UnmarshalBinary Uses

func (c *Context) UnmarshalBinary(_ []byte) error

UnmarshalBinary decodes Context value to nil This function is used by request caching

type ErrorCallback Uses

type ErrorCallback func(*Response, error)

ErrorCallback is a type alias for OnError callback functions

type HTMLCallback Uses

type HTMLCallback func(*HTMLElement)

HTMLCallback is a type alias for OnHTML callback functions

type HTMLElement Uses

type HTMLElement struct {
    // Name is the name of the tag
    Name string
    Text string

    // Request is the request object of the element's HTML document
    Request *Request
    // Response is the Response object of the element's HTML document
    Response *Response
    // DOM is the goquery parsed DOM object of the page. DOM is relative
    // to the current HTMLElement
    DOM *goquery.Selection
    // contains filtered or unexported fields
}

HTMLElement is the representation of a HTML tag.

func (*HTMLElement) Attr Uses

func (h *HTMLElement) Attr(k string) string

Attr returns the selected attribute of a HTMLElement or empty string if no attribute found

type LimitRule Uses

type LimitRule struct {
    // DomainRegexp is a regular expression to match against domains
    DomainRegexp string
    // DomainRegexp is a glob pattern to match against domains
    DomainGlob string
    // Delay is the duration to wait before creating a new request to the matching domains
    Delay time.Duration
    // Parallelism is the number of the maximum allowed concurrent requests of the matching domains
    Parallelism int
    // contains filtered or unexported fields
}

LimitRule provides connection restrictions for domains. There can be two kind of limitations:

- Parallelism: Set limit for the number of concurrent requests to a domain
- Delay: Set rate limit for a domain (this means no parallelism on the matching domains)

func (*LimitRule) Init Uses

func (r *LimitRule) Init() error

Init initializes the private members of LimitRule

func (*LimitRule) Match Uses

func (r *LimitRule) Match(domain string) bool

Match checks that the domain parameter triggers the rule

type Request Uses

type Request struct {
    // URL is the parsed URL of the HTTP request
    URL *url.URL
    // Headers contains the Request's HTTP headers
    Headers *http.Header
    // Ctx is a context between a Request and a Response
    Ctx *Context
    // Depth is the number of the parents of this request
    Depth int
    // contains filtered or unexported fields
}

Request is the representation of a HTTP request made by a Collector

func (*Request) AbsoluteURL Uses

func (r *Request) AbsoluteURL(u string) string

AbsoluteURL returns with the resolved absolute URL of an URL chunk. AbsoluteURL returns empty string if the URL chunk is a fragment or could not be parsed

func (*Request) Post Uses

func (r *Request) Post(URL string, requestData map[string]string) error

Post continues a collector job by creating a POST request and preserves the Context of the previous request. Post also calls the previously provided callbacks

func (*Request) PostMultipart Uses

func (r *Request) PostMultipart(URL string, requestData map[string][]byte) error

PostMultipart starts a collector job by creating a Multipart POST request with raw binary data. PostMultipart also calls the previously provided. callbacks

func (*Request) PostRaw Uses

func (r *Request) PostRaw(URL string, requestData []byte) error

PostRaw starts a collector job by creating a POST request with raw binary data. PostRaw preserves the Context of the previous request and calls the previously provided callbacks

func (*Request) Visit Uses

func (r *Request) Visit(URL string) error

Visit continues Collector's collecting job by creating a request and preserves the Context of the previous request. Visit also calls the previously provided callbacks

type RequestCallback Uses

type RequestCallback func(*Request)

RequestCallback is a type alias for OnRequest callback functions

type Response Uses

type Response struct {
    // StatusCode is the status code of the Response
    StatusCode int
    // Body is the content of the Response
    Body []byte
    // Ctx is a context between a Request and a Response
    Ctx *Context
    // Request is the Request object of the response
    Request *Request
    // Headers contains the Response's HTTP headers
    Headers *http.Header
}

Response is the representation of a HTTP response made by a Collector

type ResponseCallback Uses

type ResponseCallback func(*Response)

ResponseCallback is a type alias for OnResponse callback functions

Directories

PathSynopsis
examples/basic
examples/coursera_courses
examples/error_handling
examples/max_depth
examples/multipart
examples/parallel
examples/rate_limit
examples/request_context
examples/url_filter

Package colly imports 22 packages (graph) and is imported by 9 packages. Updated 2017-10-21. Refresh now. Tools for package owners.