Documentation ¶
Overview ¶
Package wander is a scraping library for Go. It aims to provide an easy to use API while also exposing tools for advanced use cases.
Index ¶
- func FollowRobotRules(s *Spider, req *request.Request) error
- func IgnoreRobotRules(s *Spider, req *request.Request) error
- type AlreadyVisited
- type RobotLimitFunction
- type Spider
- func (s *Spider) AddLimits(limits ...limits.RequestFilter)
- func (s *Spider) CheckResponseStatus(res *request.Response)
- func (s *Spider) DownloadRobotLimits(req *request.Request) (*robots.RobotFile, error)
- func (s *Spider) Follow(url *url.URL, res *request.Response, priority int) error
- func (s *Spider) OnError(f func(err error))
- func (s *Spider) OnHTML(selector string, f func(res *request.Response, el *goquery.Selection))
- func (s *Spider) OnPipelineFinished(f func())
- func (s *Spider) OnRequest(f func(req *request.Request) *request.Request)
- func (s *Spider) OnResponse(f func(res *request.Response))
- func (s *Spider) RemoveLimits(limits ...limits.RequestFilter)
- func (s *Spider) Resume(ctx context.Context, state *SpiderState)
- func (s *Spider) RoundTrip(req *http.Request) (*http.Response, error)
- func (s *Spider) SetAllowedDomains(paths ...string) error
- func (s *Spider) SetProxyFunc(proxyFunc func(r *http.Request) (*url.URL, error))
- func (s *Spider) SetThrottles(def *limits.DefaultThrottle, domainThrottles ...*limits.DomainThrottle)
- func (s *Spider) Start()
- func (s *Spider) Stop(ctx context.Context) *SpiderState
- func (s *Spider) Visit(url *url.URL) error
- func (s *Spider) VisitNow(url *url.URL) (*request.Response, error)
- func (s *Spider) Wait()
- type SpiderConstructorOption
- func AllowedDomains(domains ...string) SpiderConstructorOption
- func Cache(cache request.Cache) SpiderConstructorOption
- func IgnoreRobots() SpiderConstructorOption
- func Ingestors(n int) SpiderConstructorOption
- func MaxDepth(max int) SpiderConstructorOption
- func ProxyFunc(f func(r *http.Request) (*url.URL, error)) SpiderConstructorOption
- func Queue(queue request.Queue) SpiderConstructorOption
- func RobotLimits(limits *robots.RobotRules) SpiderConstructorOption
- func Threads(n int) SpiderConstructorOption
- func Throttle(defaultThrottle *limits.DefaultThrottle, ...) SpiderConstructorOption
- func UserAgent(agentFunction UserAgentFunction) SpiderConstructorOption
- type SpiderParameters
- type SpiderState
- type UserAgentFunction
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func FollowRobotRules ¶
FollowRobotRules fetches and follows the limitations imposed by the robots.txt file. Implementation of RobotLimitFunction.
Types ¶
type AlreadyVisited ¶
AlreadyVisited is thrown when a request's URL has been visited before by the spider.
func (AlreadyVisited) Error ¶
func (e AlreadyVisited) Error() string
type RobotLimitFunction ¶
RobotLimitFunction determines how a spider acts upon robot.txt limitations. The default is FollowRobotRules, IgnoreRobotRules is also provided. It's possible to define your own RobotLimitFunction in order to e.a. ignore only certain limitations.
type Spider ¶
type Spider struct { SpiderState SpiderParameters RobotLimits *robots.RobotRules AllowedDomains []string // contains filtered or unexported fields }
Spider provides a parallelized scraper.
func NewSpider ¶
func NewSpider(options ...SpiderConstructorOption) (*Spider, error)
NewSpider instantiates a new spider.
func (*Spider) AddLimits ¶
func (s *Spider) AddLimits(limits ...limits.RequestFilter)
Getters/setters
AddLimits adds limits to the spider, it will not add duplicate limits.
func (*Spider) CheckResponseStatus ¶
CheckResponseStatus checks the response for any non-standard status codes. It will apply additional throttling when it encounters a 429 or 503 status code, according to the spider parameters.
func (*Spider) DownloadRobotLimits ¶
DownloadRobotLimits downloads and parses the robots.txt file for a domain. Respects the spider throttles.
func (*Spider) Follow ¶
Follow a link by adding the path to the queue, blocks when the queue is full until there is free space. Unlike Visit, this method also accepts a response, allowing the url parser to convert relative urls into absolute ones and keep track of depth.
func (*Spider) OnError ¶
OnError is called when an error is encountered. This will overwrite any previous callbacks set by this method.
func (*Spider) OnPipelineFinished ¶
func (s *Spider) OnPipelineFinished(f func())
OnPipelineFinished is called when a pipeline (all callbacks and selectors) finishes. This will overwrite any previous callbacks set by this method.
func (*Spider) OnRequest ¶
OnRequest is called when a request is about to be made. This function should return a request, allowing the callback to mutate the request. If null is returned, no requests are made. This will overwrite any previous callbacks set by this method.
func (*Spider) OnResponse ¶
OnResponse is called when a response has been received and tokenized. This will overwrite any previous callbacks set by this method.
func (*Spider) RemoveLimits ¶
func (s *Spider) RemoveLimits(limits ...limits.RequestFilter)
RemoveLimits removes the given limits (if present).
func (*Spider) Resume ¶
func (s *Spider) Resume(ctx context.Context, state *SpiderState)
Resume from spider state. This method is idempotent and will return without doing anything if the spider is already isRunning.
func (*Spider) RoundTrip ¶
RoundTrip implements the http.RoundTripper interface. It will wait for any throttles before making requests.
func (*Spider) SetAllowedDomains ¶
SetAllowedDomains sets the allowed domains.
func (*Spider) SetProxyFunc ¶
SetProxyFunc sets the proxy function to be used
func (*Spider) SetThrottles ¶
func (s *Spider) SetThrottles(def *limits.DefaultThrottle, domainThrottles ...*limits.DomainThrottle)
SetThrottles sets or replaces the default and custom throttles for the spider.
func (*Spider) Start ¶
func (s *Spider) Start()
Start the spider. This method is idempotent and will return without doing anything if the spider is already isRunning.
func (*Spider) Stop ¶
func (s *Spider) Stop(ctx context.Context) *SpiderState
Stop the spider if it is currently isRunning, returns a SpiderState to allow a later call to Resume. Accepts a context and will forcibly stop the spider if cancelled, regardless of status. This method is idempotent and will return without doing anything if the spider is not isRunning.
func (*Spider) Visit ¶
Control/navigation functions
Visit adds a request with the given path to the queue with maximum priority. Blocks when the queue is full until there is free space. This method is meant to be used solely for setting the starting points of crawls before calling Start.
type SpiderConstructorOption ¶
SpiderConstructorOption is used for chaining constructor options.
func AllowedDomains ¶
func AllowedDomains(domains ...string) SpiderConstructorOption
AllowedDomains sets allowed domains, utility funtion for SetAllowedDomains.
func Cache ¶
func Cache(cache request.Cache) SpiderConstructorOption
Cache sets the RequestCache. Allows request caches to be shared between spiders.
func IgnoreRobots ¶
func IgnoreRobots() SpiderConstructorOption
IgnoreRobots sets the spider's RobotExclusionFunction to IgnoreRobotRules, ignoring robots.txt.
func Ingestors ¶
func Ingestors(n int) SpiderConstructorOption
Ingestors sets the amount of goroutines for ingestors.
func MaxDepth ¶
func MaxDepth(max int) SpiderConstructorOption
MaxDepth sets the maximum request depth.
func Queue ¶
func Queue(queue request.Queue) SpiderConstructorOption
Queue sets the RequestQueue. Allows request queues to be shared between spiders.
func RobotLimits ¶
func RobotLimits(limits *robots.RobotRules) SpiderConstructorOption
RobotLimits sets the robot exclusion cache.
func Threads ¶
func Threads(n int) SpiderConstructorOption
Threads sets the amount of ingestors goroutines.
func Throttle ¶
func Throttle(defaultThrottle *limits.DefaultThrottle, domainThrottles ...*limits.DomainThrottle) SpiderConstructorOption
Throttle is a constructor function for SetThrottles.
func UserAgent ¶
func UserAgent(agentFunction UserAgentFunction) SpiderConstructorOption
UserAgent set the spider User-agent.
type SpiderParameters ¶
type SpiderParameters struct { UserAgent UserAgentFunction RobotExclusionFunction RobotLimitFunction // DefaultWaitTime for 429 & 503 responses without a Retry-After header DefaultWaitTime time.Duration // MaxWaitTime for 429 & 503 responses with a Retry-After header MaxWaitTime time.Duration // IgnoreTimeouts if true, the bot will ignore 429 response timeouts. // Defaults to false. IgnoreTimeouts bool }
SpiderParameters crawling parameters for a spider
type SpiderState ¶
SpiderState holds a spider's state, such as the request queue and cache. It is returned by the Start and Resume methods, allowing the Resume method to resume a previously stopped crawl.
type UserAgentFunction ¶
UserAgentFunction determines what User-Agent the spider will use.