Documentation ¶
Overview ¶
Package colly implements a HTTP scraping framework
Index ¶
- func SanitizeFileName(fileName string) string
- type Collector
- func (c *Collector) Clone() *Collector
- func (c *Collector) Cookies(URL string) []*http.Cookie
- func (c *Collector) DisableCookies()
- func (c *Collector) Init()
- func (c *Collector) Limit(rule *LimitRule) error
- func (c *Collector) Limits(rules []*LimitRule) error
- func (c *Collector) OnError(f ErrorCallback)
- func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback)
- func (c *Collector) OnHTMLDetach(goquerySelector string)
- func (c *Collector) OnRequest(f RequestCallback)
- func (c *Collector) OnResponse(f ResponseCallback)
- func (c *Collector) Post(URL string, requestData map[string]string) error
- func (c *Collector) PostMultipart(URL string, requestData map[string][]byte) error
- func (c *Collector) PostRaw(URL string, requestData []byte) error
- func (c *Collector) Request(method, URL string, requestData io.Reader, ctx *Context, hdr http.Header) error
- func (c *Collector) SetCookies(URL string, cookies []*http.Cookie) error
- func (c *Collector) SetProxy(proxyURL string) error
- func (c *Collector) SetRequestTimeout(timeout time.Duration)
- func (c *Collector) String() string
- func (c *Collector) Visit(URL string) error
- func (c *Collector) Wait()
- func (c *Collector) WithTransport(transport http.RoundTripper)
- type Context
- type ErrorCallback
- type HTMLCallback
- type HTMLElement
- type LimitRule
- type Request
- func (r *Request) AbsoluteURL(u string) string
- func (r *Request) Post(URL string, requestData map[string]string) error
- func (r *Request) PostMultipart(URL string, requestData map[string][]byte) error
- func (r *Request) PostRaw(URL string, requestData []byte) error
- func (r *Request) Visit(URL string) error
- type RequestCallback
- type Response
- type ResponseCallback
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func SanitizeFileName ¶
SanitizeFileName replaces dangerous characters in a string so the return value can be used as a safe file name.
Types ¶
type Collector ¶
type Collector struct { // UserAgent is the User-Agent string used by HTTP requests UserAgent string // MaxDepth limits the recursion depth of visited URLs. // Set it to 0 for infinite recursion (default). MaxDepth int // AllowedDomains is a domain whitelist. // Leave it blank to allow any domains to be visited AllowedDomains []string // DisallowedDomains is a domain blacklist. DisallowedDomains []string // URLFilters is a list of regular expressions which restricts // visiting URLs. If any of the rules matches to a URL the // request won't be stopped. // Leave it blank to allow any URLs to be visited URLFilters []*regexp.Regexp // AllowURLRevisit allows multiple downloads of the same URL AllowURLRevisit bool // MaxBodySize is the limit of the retrieved response body in bytes. // 0 means unlimited. // The default value for MaxBodySize is 10MB (10 * 1024 * 1024 bytes). MaxBodySize int // CacheDir specifies a location where GET requests are cached as files. // When it's not defined, caching is disabled. CacheDir string // IgnoreRobotsTxt allows the Collector to ignore any restrictions set by // the target host's robots.txt file. See http://www.robotstxt.org/ for more // information. IgnoreRobotsTxt bool // Use this to identify the same request if AllowURLRevisit is not enabled. RequestIdentifier func(*Request) string // contains filtered or unexported fields }
Collector provides the scraper instance for a scraping job
func NewCollector ¶
func NewCollector() *Collector
NewCollector creates a new Collector instance with default configuration
func (*Collector) Clone ¶
Clone creates an exact copy of a Collector without callbacks. HTTP backend, robots.txt cache and cookie jar are shared between collectors.
func (*Collector) DisableCookies ¶
func (c *Collector) DisableCookies()
DisableCookies turns off cookie handling for this collector
func (*Collector) Init ¶
func (c *Collector) Init()
Init initializes the Collector's private variables and sets default configuration for the Collector
func (*Collector) OnError ¶
func (c *Collector) OnError(f ErrorCallback)
OnError registers a function. Function will be executed if an error occurs during the HTTP request.
func (*Collector) OnHTML ¶
func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback)
OnHTML registers a function. Function will be executed on every HTML element matched by the GoQuery Selector parameter. GoQuery Selector is a selector used by https://github.com/PuerkitoBio/goquery
func (*Collector) OnHTMLDetach ¶
OnHTMLDetach deregister a function. Function will not be execute after detached
func (*Collector) OnRequest ¶
func (c *Collector) OnRequest(f RequestCallback)
OnRequest registers a function. Function will be executed on every request made by the Collector
func (*Collector) OnResponse ¶
func (c *Collector) OnResponse(f ResponseCallback)
OnResponse registers a function. Function will be executed on every response
func (*Collector) Post ¶
Post starts a collector job by creating a POST request. Post also calls the previously provided callbacks
func (*Collector) PostMultipart ¶
PostMultipart starts a collector job by creating a Multipart POST request with raw binary data. PostMultipart also calls the previously provided callbacks
func (*Collector) PostRaw ¶
PostRaw starts a collector job by creating a POST request with raw binary data. Post also calls the previously provided callbacks
func (*Collector) Request ¶
func (c *Collector) Request(method, URL string, requestData io.Reader, ctx *Context, hdr http.Header) error
Request starts a collector job by creating a custom HTTP request where method, context, headers and request data can be specified. Set requestData, ctx, hdr parameters to nil if you don't want to use them. Valid methods:
- "GET"
- "POST"
- "PUT"
- "DELETE"
- "PATCH"
- "OPTIONS"
func (*Collector) SetCookies ¶
SetCookies handles the receipt of the cookies in a reply for the given URL
func (*Collector) SetProxy ¶
SetProxy sets a proxy for the collector. This overrides the previously used http.Transport if the type of the transport is not http.RoundTripper
func (*Collector) SetRequestTimeout ¶
SetRequestTimeout overrides the default timeout (10 seconds) for this collector
func (*Collector) String ¶
String is the text representation of the collector. It contains useful debug information about the collector's internals
func (*Collector) Visit ¶
Visit starts Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided callbacks
func (*Collector) Wait ¶
func (c *Collector) Wait()
Wait returns when the collector jobs are finished
func (*Collector) WithTransport ¶
func (c *Collector) WithTransport(transport http.RoundTripper)
WithTransport allows you to set a custom http.RoundTripper (transport) for this collector.
type Context ¶
type Context struct {
// contains filtered or unexported fields
}
Context provides a tiny layer for passing data between callbacks
func (*Context) Get ¶
Get retrieves a value from Context. Get returns an empty string if key not found
func (*Context) MarshalBinary ¶
MarshalBinary encodes Context value This function is used by request caching
func (*Context) UnmarshalBinary ¶
UnmarshalBinary decodes Context value to nil This function is used by request caching
type ErrorCallback ¶
ErrorCallback is a type alias for OnError callback functions
type HTMLCallback ¶
type HTMLCallback func(*HTMLElement)
HTMLCallback is a type alias for OnHTML callback functions
type HTMLElement ¶
type HTMLElement struct { // Name is the name of the tag Name string Text string // Request is the request object of the element's HTML document Request *Request // Response is the Response object of the element's HTML document Response *Response // DOM is the goquery parsed DOM object of the page. DOM is relative // to the current HTMLElement DOM *goquery.Selection // contains filtered or unexported fields }
HTMLElement is the representation of a HTML tag.
func (*HTMLElement) Attr ¶
func (h *HTMLElement) Attr(k string) string
Attr returns the selected attribute of a HTMLElement or empty string if no attribute found
func (*HTMLElement) ChildAttr ¶
func (h *HTMLElement) ChildAttr(goquerySelector, attrName string) string
ChildAttr returns the stripped text content of the first matching element's attribute.
func (*HTMLElement) ChildText ¶
func (h *HTMLElement) ChildText(goquerySelector string) string
ChildText returns the concatenated and stripped text content of the matching elements.
type LimitRule ¶
type LimitRule struct { // DomainRegexp is a regular expression to match against domains DomainRegexp string // DomainRegexp is a glob pattern to match against domains DomainGlob string // Delay is the duration to wait before creating a new request to the matching domains Delay time.Duration // Parallelism is the number of the maximum allowed concurrent requests of the matching domains Parallelism int // contains filtered or unexported fields }
LimitRule provides connection restrictions for domains. There can be two kind of limitations:
- Parallelism: Set limit for the number of concurrent requests to a domain
- Delay: Set rate limit for a domain (this means no parallelism on the matching domains)
type Request ¶
type Request struct { // URL is the parsed URL of the HTTP request URL *url.URL // Headers contains the Request's HTTP headers Headers *http.Header // Ctx is a context between a Request and a Response Ctx *Context // Depth is the number of the parents of this request Depth int // contains filtered or unexported fields }
Request is the representation of a HTTP request made by a Collector
func (*Request) AbsoluteURL ¶
AbsoluteURL returns with the resolved absolute URL of an URL chunk. AbsoluteURL returns empty string if the URL chunk is a fragment or could not be parsed
func (*Request) Post ¶
Post continues a collector job by creating a POST request and preserves the Context of the previous request. Post also calls the previously provided callbacks
func (*Request) PostMultipart ¶
PostMultipart starts a collector job by creating a Multipart POST request with raw binary data. PostMultipart also calls the previously provided. callbacks
type RequestCallback ¶
type RequestCallback func(*Request)
RequestCallback is a type alias for OnRequest callback functions
type Response ¶
type Response struct { // StatusCode is the status code of the Response StatusCode int // Body is the content of the Response Body []byte // Ctx is a context between a Request and a Response Ctx *Context // Request is the Request object of the response Request *Request // Headers contains the Response's HTTP headers Headers *http.Header }
Response is the representation of a HTTP response made by a Collector
type ResponseCallback ¶
type ResponseCallback func(*Response)
ResponseCallback is a type alias for OnResponse callback functions