colly: github.com/gocolly/colly Index | Files | Directories

package colly

import "github.com/gocolly/colly"

Package colly implements a HTTP scraping framework

Index

Package Files

colly.go context.go htmlelement.go http_backend.go request.go response.go unmarshal.go xmlelement.go

Variables

var (
    // ErrForbiddenDomain is the error thrown if visiting
    // a domain which is not allowed in AllowedDomains
    ErrForbiddenDomain = errors.New("Forbidden domain")
    // ErrMissingURL is the error type for missing URL errors
    ErrMissingURL = errors.New("Missing URL")
    // ErrMaxDepth is the error type for exceeding max depth
    ErrMaxDepth = errors.New("Max depth limit reached")
    // ErrNoURLFiltersMatch is the error thrown if visiting
    // a URL which is not allowed by URLFilters
    ErrNoURLFiltersMatch = errors.New("No URLFilters match")
    // ErrAlreadyVisited is the error type for already visited URLs
    ErrAlreadyVisited = errors.New("URL already visited")
    // ErrRobotsTxtBlocked is the error type for robots.txt errors
    ErrRobotsTxtBlocked = errors.New("URL blocked by robots.txt")
    // ErrNoCookieJar is the error type for missing cookie jar
    ErrNoCookieJar = errors.New("Cookie jar is not available")
    // ErrNoPattern is the error type for LimitRules without patterns
    ErrNoPattern = errors.New("No pattern defined in LimitRule")
)

func AllowURLRevisit Uses

func AllowURLRevisit() func(*Collector)

AllowURLRevisit instructs the Collector to allow multiple downloads of the same URL

func AllowedDomains Uses

func AllowedDomains(domains ...string) func(*Collector)

AllowedDomains sets the domain whitelist used by the Collector.

func Async Uses

func Async(a bool) func(*Collector)

Async turns on asynchronous network requests.

func CacheDir Uses

func CacheDir(path string) func(*Collector)

CacheDir specifies the location where GET requests are cached as files.

func Debugger Uses

func Debugger(d debug.Debugger) func(*Collector)

Debugger sets the debugger used by the Collector.

func DetectCharset Uses

func DetectCharset() func(*Collector)

DetectCharset enables character encoding detection for non-utf8 response bodies without explicit charset declaration. This feature uses https://github.com/saintfish/chardet

func DisallowedDomains Uses

func DisallowedDomains(domains ...string) func(*Collector)

DisallowedDomains sets the domain blacklist used by the Collector.

func ID Uses

func ID(id uint32) func(*Collector)

ID sets the unique identifier of the Collector.

func IgnoreRobotsTxt Uses

func IgnoreRobotsTxt() func(*Collector)

IgnoreRobotsTxt instructs the Collector to ignore any restrictions set by the target host's robots.txt file.

func MaxBodySize Uses

func MaxBodySize(sizeInBytes int) func(*Collector)

MaxBodySize sets the limit of the retrieved response body in bytes.

func MaxDepth Uses

func MaxDepth(depth int) func(*Collector)

MaxDepth limits the recursion depth of visited URLs.

func ParseHTTPErrorResponse Uses

func ParseHTTPErrorResponse() func(*Collector)

ParseHTTPErrorResponse allows parsing responses with HTTP errors

func SanitizeFileName Uses

func SanitizeFileName(fileName string) string

SanitizeFileName replaces dangerous characters in a string so the return value can be used as a safe file name.

func URLFilters Uses

func URLFilters(filters ...*regexp.Regexp) func(*Collector)

URLFilters sets the list of regular expressions which restricts visiting URLs. If any of the rules matches to a URL the request won't be stopped.

func UnmarshalHTML Uses

func UnmarshalHTML(v interface{}, s *goquery.Selection) error

UnmarshalHTML declaratively extracts text or attributes to a struct from HTML response using struct tags composed of css selectors. Allowed struct tags:

- "selector" (required): CSS (goquery) selector of the desired data
- "attr" (optional): Selects the matching element's attribute's value.
   Leave it blank or omit to get the text of the element.

Example struct declaration:

type Nested struct {
	String  string   `selector:"div > p"`
   Classes []string `selector:"li" attr:"class"`
	Struct  *Nested  `selector:"div > div"`
}

Supported types: struct, *struct, string, []string

func UserAgent Uses

func UserAgent(ua string) func(*Collector)

UserAgent sets the user agent used by the Collector.

type Collector Uses

type Collector struct {
    // UserAgent is the User-Agent string used by HTTP requests
    UserAgent string
    // MaxDepth limits the recursion depth of visited URLs.
    // Set it to 0 for infinite recursion (default).
    MaxDepth int
    // AllowedDomains is a domain whitelist.
    // Leave it blank to allow any domains to be visited
    AllowedDomains []string
    // DisallowedDomains is a domain blacklist.
    DisallowedDomains []string
    // URLFilters is a list of regular expressions which restricts
    // visiting URLs. If any of the rules matches to a URL the
    // request won't be stopped.
    // Leave it blank to allow any URLs to be visited
    URLFilters []*regexp.Regexp
    // AllowURLRevisit allows multiple downloads of the same URL
    AllowURLRevisit bool
    // MaxBodySize is the limit of the retrieved response body in bytes.
    // 0 means unlimited.
    // The default value for MaxBodySize is 10MB (10 * 1024 * 1024 bytes).
    MaxBodySize int
    // CacheDir specifies a location where GET requests are cached as files.
    // When it's not defined, caching is disabled.
    CacheDir string
    // IgnoreRobotsTxt allows the Collector to ignore any restrictions set by
    // the target host's robots.txt file.  See http://www.robotstxt.org/ for more
    // information.
    IgnoreRobotsTxt bool
    // Async turns on asynchronous network communication. Use Collector.Wait() to
    // be sure all requests have been finished.
    Async bool
    // ParseHTTPErrorResponse allows parsing HTTP responses with non 2xx status codes.
    // By default, Colly parses only successful HTTP responses. Set ParseHTTPErrorResponse
    // to true to enable it.
    ParseHTTPErrorResponse bool
    // ID is the unique identifier of a collector
    ID  uint32
    // DetectCharset can enable character encoding detection for non-utf8 response bodies
    // without explicit charset declaration. This feature uses https://github.com/saintfish/chardet
    DetectCharset bool
    // RedirectHandler allows control on how a redirect will be managed
    RedirectHandler func(req *http.Request, via []*http.Request) error
    // contains filtered or unexported fields
}

Collector provides the scraper instance for a scraping job

func NewCollector Uses

func NewCollector(options ...func(*Collector)) *Collector

NewCollector creates a new Collector instance with default configuration

func (*Collector) Appengine Uses

func (c *Collector) Appengine(req *http.Request)

Appengine will replace the Collector's backend http.Client With an Http.Client that is provided by appengine/urlfetch This function should be used when the scraper is initiated by a http.Request to Google App Engine

func (*Collector) Clone Uses

func (c *Collector) Clone() *Collector

Clone creates an exact copy of a Collector without callbacks. HTTP backend, robots.txt cache and cookie jar are shared between collectors.

func (*Collector) Cookies Uses

func (c *Collector) Cookies(URL string) []*http.Cookie

Cookies returns the cookies to send in a request for the given URL.

func (*Collector) DisableCookies Uses

func (c *Collector) DisableCookies()

DisableCookies turns off cookie handling

func (*Collector) Init Uses

func (c *Collector) Init()

Init initializes the Collector's private variables and sets default configuration for the Collector

func (*Collector) Limit Uses

func (c *Collector) Limit(rule *LimitRule) error

Limit adds a new LimitRule to the collector

func (*Collector) Limits Uses

func (c *Collector) Limits(rules []*LimitRule) error

Limits adds new LimitRules to the collector

func (*Collector) OnError Uses

func (c *Collector) OnError(f ErrorCallback)

OnError registers a function. Function will be executed if an error occurs during the HTTP request.

func (*Collector) OnHTML Uses

func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback)

OnHTML registers a function. Function will be executed on every HTML element matched by the GoQuery Selector parameter. GoQuery Selector is a selector used by https://github.com/PuerkitoBio/goquery

func (*Collector) OnHTMLDetach Uses

func (c *Collector) OnHTMLDetach(goquerySelector string)

OnHTMLDetach deregister a function. Function will not be execute after detached

func (*Collector) OnRequest Uses

func (c *Collector) OnRequest(f RequestCallback)

OnRequest registers a function. Function will be executed on every request made by the Collector

func (*Collector) OnResponse Uses

func (c *Collector) OnResponse(f ResponseCallback)

OnResponse registers a function. Function will be executed on every response

func (*Collector) OnScraped Uses

func (c *Collector) OnScraped(f ScrapedCallback)

OnScraped registers a function. Function will be executed after OnHTML, as a final part of the scraping.

func (*Collector) OnXML Uses

func (c *Collector) OnXML(xpathQuery string, f XMLCallback)

OnXML registers a function. Function will be executed on every XML element matched by the xpath Query parameter. xpath Query is used by https://github.com/antchfx/xmlquery

func (*Collector) OnXMLDetach Uses

func (c *Collector) OnXMLDetach(xpathQuery string)

OnXMLDetach deregister a function. Function will not be execute after detached

func (*Collector) Post Uses

func (c *Collector) Post(URL string, requestData map[string]string) error

Post starts a collector job by creating a POST request. Post also calls the previously provided callbacks

func (*Collector) PostMultipart Uses

func (c *Collector) PostMultipart(URL string, requestData map[string][]byte) error

PostMultipart starts a collector job by creating a Multipart POST request with raw binary data. PostMultipart also calls the previously provided callbacks

func (*Collector) PostRaw Uses

func (c *Collector) PostRaw(URL string, requestData []byte) error

PostRaw starts a collector job by creating a POST request with raw binary data. Post also calls the previously provided callbacks

func (*Collector) Request Uses

func (c *Collector) Request(method, URL string, requestData io.Reader, ctx *Context, hdr http.Header) error

Request starts a collector job by creating a custom HTTP request where method, context, headers and request data can be specified. Set requestData, ctx, hdr parameters to nil if you don't want to use them. Valid methods:

- "GET"
- "POST"
- "PUT"
- "DELETE"
- "PATCH"
- "OPTIONS"

func (*Collector) SetCookieJar Uses

func (c *Collector) SetCookieJar(j *cookiejar.Jar)

SetCookieJar overrides the previously set cookie jar

func (*Collector) SetCookies Uses

func (c *Collector) SetCookies(URL string, cookies []*http.Cookie) error

SetCookies handles the receipt of the cookies in a reply for the given URL

func (*Collector) SetDebugger Uses

func (c *Collector) SetDebugger(d debug.Debugger)

SetDebugger attaches a debugger to the collector

func (*Collector) SetProxy Uses

func (c *Collector) SetProxy(proxyURL string) error

SetProxy sets a proxy for the collector. This method overrides the previously used http.Transport if the type of the transport is not http.RoundTripper. The proxy type is determined by the URL scheme. "http" and "socks5" are supported. If the scheme is empty, "http" is assumed.

func (*Collector) SetProxyFunc Uses

func (c *Collector) SetProxyFunc(p ProxyFunc)

SetProxyFunc sets a custom proxy setter/switcher function. See built-in ProxyFuncs for more details. This method overrides the previously used http.Transport if the type of the transport is not http.RoundTripper. The proxy type is determined by the URL scheme. "http" and "socks5" are supported. If the scheme is empty, "http" is assumed.

func (*Collector) SetRequestTimeout Uses

func (c *Collector) SetRequestTimeout(timeout time.Duration)

SetRequestTimeout overrides the default timeout (10 seconds) for this collector

func (*Collector) SetStorage Uses

func (c *Collector) SetStorage(s storage.Storage) error

SetStorage overrides the default in-memory storage. Storage stores scraping related data like cookies and visited urls

func (*Collector) String Uses

func (c *Collector) String() string

String is the text representation of the collector. It contains useful debug information about the collector's internals

func (*Collector) UnmarshalRequest Uses

func (c *Collector) UnmarshalRequest(r []byte) (*Request, error)

UnmarshalRequest creates a Request from serialized data

func (*Collector) Visit Uses

func (c *Collector) Visit(URL string) error

Visit starts Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided callbacks

func (*Collector) Wait Uses

func (c *Collector) Wait()

Wait returns when the collector jobs are finished

func (*Collector) WithTransport Uses

func (c *Collector) WithTransport(transport http.RoundTripper)

WithTransport allows you to set a custom http.RoundTripper (transport)

type Context Uses

type Context struct {
    // contains filtered or unexported fields
}

Context provides a tiny layer for passing data between callbacks

func NewContext Uses

func NewContext() *Context

NewContext initializes a new Context instance

func (*Context) ForEach Uses

func (c *Context) ForEach(fn func(k string, v interface{}) interface{}) []interface{}

ForEach iterate context

func (*Context) Get Uses

func (c *Context) Get(key string) string

Get retrieves a string value from Context. Get returns an empty string if key not found

func (*Context) GetAny Uses

func (c *Context) GetAny(key string) interface{}

GetAny retrieves a value from Context. GetAny returns nil if key not found

func (*Context) MarshalBinary Uses

func (c *Context) MarshalBinary() (_ []byte, _ error)

MarshalBinary encodes Context value This function is used by request caching

func (*Context) Put Uses

func (c *Context) Put(key string, value interface{})

Put stores a value of any type in Context

func (*Context) UnmarshalBinary Uses

func (c *Context) UnmarshalBinary(_ []byte) error

UnmarshalBinary decodes Context value to nil This function is used by request caching

type ErrorCallback Uses

type ErrorCallback func(*Response, error)

ErrorCallback is a type alias for OnError callback functions

type HTMLCallback Uses

type HTMLCallback func(*HTMLElement)

HTMLCallback is a type alias for OnHTML callback functions

type HTMLElement Uses

type HTMLElement struct {
    // Name is the name of the tag
    Name string
    Text string

    // Request is the request object of the element's HTML document
    Request *Request
    // Response is the Response object of the element's HTML document
    Response *Response
    // DOM is the goquery parsed DOM object of the page. DOM is relative
    // to the current HTMLElement
    DOM *goquery.Selection
    // contains filtered or unexported fields
}

HTMLElement is the representation of a HTML tag.

func NewHTMLElementFromSelectionNode Uses

func NewHTMLElementFromSelectionNode(resp *Response, s *goquery.Selection, n *html.Node) *HTMLElement

NewHTMLElementFromSelectionNode creates a HTMLElement from a goquery.Selection Node.

func (*HTMLElement) Attr Uses

func (h *HTMLElement) Attr(k string) string

Attr returns the selected attribute of a HTMLElement or empty string if no attribute found

func (*HTMLElement) ChildAttr Uses

func (h *HTMLElement) ChildAttr(goquerySelector, attrName string) string

ChildAttr returns the stripped text content of the first matching element's attribute.

func (*HTMLElement) ChildAttrs Uses

func (h *HTMLElement) ChildAttrs(goquerySelector, attrName string) []string

ChildAttrs returns the stripped text content of all the matching element's attributes.

func (*HTMLElement) ChildText Uses

func (h *HTMLElement) ChildText(goquerySelector string) string

ChildText returns the concatenated and stripped text content of the matching elements.

func (*HTMLElement) ForEach Uses

func (h *HTMLElement) ForEach(goquerySelector string, callback func(int, *HTMLElement))

ForEach iterates over the elements matched by the first argument and calls the callback function on every HTMLElement match.

func (*HTMLElement) Unmarshal Uses

func (h *HTMLElement) Unmarshal(v interface{}) error

Unmarshal is a shorthand for colly.UnmarshalHTML

type LimitRule Uses

type LimitRule struct {
    // DomainRegexp is a regular expression to match against domains
    DomainRegexp string
    // DomainRegexp is a glob pattern to match against domains
    DomainGlob string
    // Delay is the duration to wait before creating a new request to the matching domains
    Delay time.Duration
    // RandomDelay is the extra randomized duration to wait added to Delay before creating a new request
    RandomDelay time.Duration
    // Parallelism is the number of the maximum allowed concurrent requests of the matching domains
    Parallelism int
    // contains filtered or unexported fields
}

LimitRule provides connection restrictions for domains. Both DomainRegexp and DomainGlob can be used to specify the included domains patterns, but at least one is required. There can be two kind of limitations:

- Parallelism: Set limit for the number of concurrent requests to matching domains
- Delay: Wait specified amount of time between requests (parallelism is 1 in this case)

func (*LimitRule) Init Uses

func (r *LimitRule) Init() error

Init initializes the private members of LimitRule

func (*LimitRule) Match Uses

func (r *LimitRule) Match(domain string) bool

Match checks that the domain parameter triggers the rule

type ProxyFunc Uses

type ProxyFunc func(*http.Request) (*url.URL, error)

ProxyFunc is a type alias for proxy setter functions.

type Request Uses

type Request struct {
    // URL is the parsed URL of the HTTP request
    URL *url.URL
    // Headers contains the Request's HTTP headers
    Headers *http.Header
    // Ctx is a context between a Request and a Response
    Ctx *Context
    // Depth is the number of the parents of the request
    Depth int
    // Method is the HTTP method of the request
    Method string
    // Body is the request body which is used on POST/PUT requests
    Body io.Reader
    // ResponseCharacterencoding is the character encoding of the response body.
    // Leave it blank to allow automatic character encoding of the response body.
    // It is empty by default and it can be set in OnRequest callback.
    ResponseCharacterEncoding string
    // ID is the Unique identifier of the request
    ID  uint32
    // contains filtered or unexported fields
}

Request is the representation of a HTTP request made by a Collector

func (*Request) Abort Uses

func (r *Request) Abort()

Abort cancels the HTTP request when called in an OnRequest callback

func (*Request) AbsoluteURL Uses

func (r *Request) AbsoluteURL(u string) string

AbsoluteURL returns with the resolved absolute URL of an URL chunk. AbsoluteURL returns empty string if the URL chunk is a fragment or could not be parsed

func (*Request) Marshal Uses

func (r *Request) Marshal() ([]byte, error)

Marshal serializes the Request

func (*Request) New Uses

func (r *Request) New(method, URL string, body io.Reader) (*Request, error)

New creates a new request with the context of the original request

func (*Request) Post Uses

func (r *Request) Post(URL string, requestData map[string]string) error

Post continues a collector job by creating a POST request and preserves the Context of the previous request. Post also calls the previously provided callbacks

func (*Request) PostMultipart Uses

func (r *Request) PostMultipart(URL string, requestData map[string][]byte) error

PostMultipart starts a collector job by creating a Multipart POST request with raw binary data. PostMultipart also calls the previously provided. callbacks

func (*Request) PostRaw Uses

func (r *Request) PostRaw(URL string, requestData []byte) error

PostRaw starts a collector job by creating a POST request with raw binary data. PostRaw preserves the Context of the previous request and calls the previously provided callbacks

func (*Request) Retry Uses

func (r *Request) Retry() error

Retry submits HTTP request again with the same parameters

func (*Request) Visit Uses

func (r *Request) Visit(URL string) error

Visit continues Collector's collecting job by creating a request and preserves the Context of the previous request. Visit also calls the previously provided callbacks

type RequestCallback Uses

type RequestCallback func(*Request)

RequestCallback is a type alias for OnRequest callback functions

type Response Uses

type Response struct {
    // StatusCode is the status code of the Response
    StatusCode int
    // Body is the content of the Response
    Body []byte
    // Ctx is a context between a Request and a Response
    Ctx *Context
    // Request is the Request object of the response
    Request *Request
    // Headers contains the Response's HTTP headers
    Headers *http.Header
}

Response is the representation of a HTTP response made by a Collector

func (*Response) FileName Uses

func (r *Response) FileName() string

FileName returns the sanitized file name parsed from "Content-Disposition" header or from URL

func (*Response) Save Uses

func (r *Response) Save(fileName string) error

Save writes response body to disk

type ResponseCallback Uses

type ResponseCallback func(*Response)

ResponseCallback is a type alias for OnResponse callback functions

type ScrapedCallback Uses

type ScrapedCallback func(*Response)

ScrapedCallback is a type alias for OnScraped callback functions

type XMLCallback Uses

type XMLCallback func(*XMLElement)

XMLCallback is a type alias for OnXML callback functions

type XMLElement Uses

type XMLElement struct {
    // Name is the name of the tag
    Name string
    Text string

    // Request is the request object of the element's HTML document
    Request *Request
    // Response is the Response object of the element's HTML document
    Response *Response
    // DOM is the DOM object of the page. DOM is relative
    // to the current XMLElement and is either a html.Node or xmlquery.Node
    // based on how the XMLElement was created.
    DOM interface{}
    // contains filtered or unexported fields
}

XMLElement is the representation of a XML tag.

func NewXMLElementFromHTMLNode Uses

func NewXMLElementFromHTMLNode(resp *Response, s *html.Node) *XMLElement

NewXMLElementFromHTMLNode creates a XMLElement from a html.Node.

func NewXMLElementFromXMLNode Uses

func NewXMLElementFromXMLNode(resp *Response, s *xmlquery.Node) *XMLElement

NewXMLElementFromXMLNode creates a XMLElement from a xmlquery.Node.

func (*XMLElement) Attr Uses

func (h *XMLElement) Attr(k string) string

Attr returns the selected attribute of a HTMLElement or empty string if no attribute found

func (*XMLElement) ChildAttr Uses

func (h *XMLElement) ChildAttr(xpathQuery, attrName string) string

ChildAttr returns the stripped text content of the first matching element's attribute.

func (*XMLElement) ChildAttrs Uses

func (h *XMLElement) ChildAttrs(xpathQuery, attrName string) []string

ChildAttrs returns the stripped text content of all the matching element's attributes.

func (*XMLElement) ChildText Uses

func (h *XMLElement) ChildText(xpathQuery string) string

ChildText returns the concatenated and stripped text content of the matching elements.

Directories

PathSynopsis
debug
extensionsPackage extensions implements various helper addons for Colly
proxy
queue
storage

Package colly imports 41 packages (graph) and is imported by 24 packages. Updated 2018-04-17. Refresh now. Tools for package owners.