wander

package module

v0.14.0-alpha Latest Latest Go to latest Published: Jun 23, 2020 License: MIT Imports: 12 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/KillianMeersman/wander

README ¶

Wander

Overview

Convenient scraping library for Gophers.

Based on Colly and Scrapy, Wander aims to provide an easy-to-use API while also exposing the tools for advanced use cases.

Features

Prioritized request queueing.
Redis support for distributed scraping.
Easy parallelization of crawlers and pipelines.
Stop, save and resume crawls.
Global and per-domain throttling.
Proxy switching.
Support for robots.txt, including non-standard directives and custom filter functions (e.a. ignore certain rules).
Sitemap support

Example

package main

import (
	"context"
	"log"
	"time"

	"github.com/PuerkitoBio/goquery"

	"github.com/KillianMeersman/wander"
	"github.com/KillianMeersman/wander/limits"
	"github.com/KillianMeersman/wander/request"
)

func main() {
	spid, _ := wander.NewSpider(
		wander.AllowedDomains("localhost:8080"),
		wander.MaxDepth(10),
		wander.Throttle(limits.NewDefaultThrottle(200*time.Millisecond)),
		wander.Threads(2),
	)

	spid.OnRequest(func(req *request.Request) *request.Request {
		log.Printf("visiting %s", req.URL.String())
		return req
	})

	spid.OnResponse(func(res *request.Response) {
		log.Printf("response from %s", res.Request.URL.String())
	})

	spid.OnError(func(err error) {
		log.Fatal(err)
	})

	spid.OnHTML("a[href]", func(res *request.Response, el *goquery.Selection) {
		link, _ := el.Attr("href")
		url, err := url.Parse(link)
		if err != nil {
			log.Fatal(err)
		}

		if err := spid.Follow(url, res, 10-res.Request.Depth); err != nil {
			log.Printf(err.Error())
		}
	})

	root := &url.URL{
		Scheme: "http",
		Host:   "localhost:8080",
	}
	spid.Visit(root)

	// Get sitemap
	robotsURL := root
	robotsURL.Path = "/robots.txt"
	robots, err := robots.NewRobotFileFromURL(robotsURL, spid)
	if err != nil {
		log.Fatal(err)
	}

	sitemap, err := robots.GetSitemap("wander", spid)
	if err != nil {
		log.Fatal(err)
	}

	locations, err := sitemap.GetLocations(spid, 10000)
	if err != nil {
		log.Fatal(err)
	}
	for _, location := range locations {
		url, err := url.Parse(location.Loc)
		if err != nil {
			log.Fatal(err)
		}
		spid.Visit(url)
	}

	go func() {
		<-time.After(5 * time.Second)
		spid.Stop(context.Background())
	}()

	spid.Start()
	spid.Wait()

Installation

go get -u github.com/KillianMeersman/wander

Documentation ¶

Overview ¶

Package wander is a scraping library for Go. It aims to provide an easy to use API while also exposing tools for advanced use cases.

Index ¶

func FollowRobotRules(s *Spider, req *request.Request) error
func IgnoreRobotRules(s *Spider, req *request.Request) error
type AlreadyVisited
- func (e AlreadyVisited) Error() string
type RobotLimitFunction
type Spider
- func NewSpider(options ...SpiderConstructorOption) (*Spider, error)
type SpiderConstructorOption
type SpiderParameters
type SpiderState
type UserAgentFunction

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func FollowRobotRules ¶

func FollowRobotRules(s *Spider, req *request.Request) error

FollowRobotRules fetches and follows the limitations imposed by the robots.txt file. Implementation of RobotLimitFunction.

func IgnoreRobotRules ¶

func IgnoreRobotRules(s *Spider, req *request.Request) error

IgnoreRobotRules ignores the robots.txt file. Implementation of RobotLimitFunction.

Types ¶

type AlreadyVisited ¶

type AlreadyVisited struct {
	URL url.URL
}

AlreadyVisited is thrown when a request's URL has been visited before by the spider.

func (AlreadyVisited) Error ¶

func (e AlreadyVisited) Error() string

type RobotLimitFunction ¶

type RobotLimitFunction func(spid *Spider, req *request.Request) error

RobotLimitFunction determines how a spider acts upon robot.txt limitations. The default is FollowRobotRules, IgnoreRobotRules is also provided. It's possible to define your own RobotLimitFunction in order to e.a. ignore only certain limitations.

type Spider ¶

type Spider struct {
	SpiderState
	SpiderParameters
	RobotLimits    *robots.RobotRules
	AllowedDomains []string
	// contains filtered or unexported fields
}

Spider provides a parallelized scraper.

func NewSpider ¶

func NewSpider(options ...SpiderConstructorOption) (*Spider, error)

NewSpider instantiates a new spider.

func (*Spider) AddLimits ¶

func (s *Spider) AddLimits(limits ...limits.RequestFilter)

Getters/setters

AddLimits adds limits to the spider, it will not add duplicate limits.

func (*Spider) CheckResponseStatus ¶

func (s *Spider) CheckResponseStatus(res *request.Response)

CheckResponseStatus checks the response for any non-standard status codes. It will apply additional throttling when it encounters a 429 or 503 status code, according to the spider parameters.

func (*Spider) DownloadRobotLimits ¶

func (s *Spider) DownloadRobotLimits(req *request.Request) (*robots.RobotFile, error)

DownloadRobotLimits downloads and parses the robots.txt file for a domain. Respects the spider throttles.

func (*Spider) Follow ¶

func (s *Spider) Follow(url *url.URL, res *request.Response, priority int) error

Follow a link by adding the path to the queue, blocks when the queue is full until there is free space. Unlike Visit, this method also accepts a response, allowing the url parser to convert relative urls into absolute ones and keep track of depth.

func (*Spider) OnError ¶

func (s *Spider) OnError(f func(err error))

OnError is called when an error is encountered. This will overwrite any previous callbacks set by this method.

func (*Spider) OnHTML ¶

func (s *Spider) OnHTML(selector string, f func(res *request.Response, el *goquery.Selection))

OnHTML is called for each element matching the selector in a response body

func (*Spider) OnPipelineFinished ¶

func (s *Spider) OnPipelineFinished(f func())

OnPipelineFinished is called when a pipeline (all callbacks and selectors) finishes. This will overwrite any previous callbacks set by this method.

func (*Spider) OnRequest ¶

func (s *Spider) OnRequest(f func(req *request.Request) *request.Request)

OnRequest is called when a request is about to be made. This function should return a request, allowing the callback to mutate the request. If null is returned, no requests are made. This will overwrite any previous callbacks set by this method.

func (*Spider) OnResponse ¶

func (s *Spider) OnResponse(f func(res *request.Response))

OnResponse is called when a response has been received and tokenized. This will overwrite any previous callbacks set by this method.

func (*Spider) RemoveLimits ¶

func (s *Spider) RemoveLimits(limits ...limits.RequestFilter)

RemoveLimits removes the given limits (if present).

func (*Spider) Resume ¶

func (s *Spider) Resume(ctx context.Context, state *SpiderState)

Resume from spider state. This method is idempotent and will return without doing anything if the spider is already isRunning.

func (*Spider) RoundTrip ¶

func (s *Spider) RoundTrip(req *http.Request) (*http.Response, error)

RoundTrip implements the http.RoundTripper interface. It will wait for any throttles before making requests.

func (*Spider) SetAllowedDomains ¶

func (s *Spider) SetAllowedDomains(paths ...string) error

SetAllowedDomains sets the allowed domains.

func (*Spider) SetProxyFunc ¶

func (s *Spider) SetProxyFunc(proxyFunc func(r *http.Request) (*url.URL, error))

SetProxyFunc sets the proxy function to be used

func (*Spider) SetThrottles ¶

func (s *Spider) SetThrottles(def *limits.DefaultThrottle, domainThrottles ...*limits.DomainThrottle)

SetThrottles sets or replaces the default and custom throttles for the spider.

func (*Spider) Start ¶

func (s *Spider) Start()

Start the spider. This method is idempotent and will return without doing anything if the spider is already isRunning.

func (*Spider) Stop ¶

func (s *Spider) Stop(ctx context.Context) *SpiderState

Stop the spider if it is currently isRunning, returns a SpiderState to allow a later call to Resume. Accepts a context and will forcibly stop the spider if cancelled, regardless of status. This method is idempotent and will return without doing anything if the spider is not isRunning.

func (*Spider) Visit ¶

func (s *Spider) Visit(url *url.URL) error

Control/navigation functions

Visit adds a request with the given path to the queue with maximum priority. Blocks when the queue is full until there is free space. This method is meant to be used solely for setting the starting points of crawls before calling Start.

func (*Spider) VisitNow ¶

func (s *Spider) VisitNow(url *url.URL) (*request.Response, error)

VisitNow visits the given url without adding it to the queue. It will still wait for any throttling.

func (*Spider) Wait ¶

func (s *Spider) Wait()

Wait blocks until the spider has been stopped.

type SpiderConstructorOption ¶

type SpiderConstructorOption func(s *Spider) error

SpiderConstructorOption is used for chaining constructor options.

func AllowedDomains ¶

func AllowedDomains(domains ...string) SpiderConstructorOption

AllowedDomains sets allowed domains, utility funtion for SetAllowedDomains.

func Cache ¶

func Cache(cache request.Cache) SpiderConstructorOption

Cache sets the RequestCache. Allows request caches to be shared between spiders.

func IgnoreRobots ¶

func IgnoreRobots() SpiderConstructorOption

IgnoreRobots sets the spider's RobotExclusionFunction to IgnoreRobotRules, ignoring robots.txt.

func Ingestors ¶

func Ingestors(n int) SpiderConstructorOption

Ingestors sets the amount of goroutines for ingestors.

func MaxDepth ¶

func MaxDepth(max int) SpiderConstructorOption

MaxDepth sets the maximum request depth.

func ProxyFunc ¶

func ProxyFunc(f func(r *http.Request) (*url.URL, error)) SpiderConstructorOption

ProxyFunc sets the proxy function, utility function for SetProxyFunc.

func Queue ¶

func Queue(queue request.Queue) SpiderConstructorOption

Queue sets the RequestQueue. Allows request queues to be shared between spiders.

func RobotLimits ¶

func RobotLimits(limits *robots.RobotRules) SpiderConstructorOption

RobotLimits sets the robot exclusion cache.

func Threads ¶

func Threads(n int) SpiderConstructorOption

Threads sets the amount of ingestors goroutines.

func Throttle ¶

func Throttle(defaultThrottle *limits.DefaultThrottle, domainThrottles ...*limits.DomainThrottle) SpiderConstructorOption

Throttle is a constructor function for SetThrottles.

func UserAgent ¶

func UserAgent(agentFunction UserAgentFunction) SpiderConstructorOption

UserAgent set the spider User-agent.

type SpiderParameters ¶

type SpiderParameters struct {
	UserAgent              UserAgentFunction
	RobotExclusionFunction RobotLimitFunction
	// DefaultWaitTime for 429 & 503 responses without a Retry-After header
	DefaultWaitTime time.Duration
	// MaxWaitTime for 429 & 503 responses with a Retry-After header
	MaxWaitTime time.Duration
	// IgnoreTimeouts if true, the bot will ignore 429 response timeouts.
	// Defaults to false.
	IgnoreTimeouts bool
}

SpiderParameters crawling parameters for a spider

type SpiderState ¶

type SpiderState struct {
	Queue request.Queue
	Cache request.Cache
}

SpiderState holds a spider's state, such as the request queue and cache. It is returned by the Start and Resume methods, allowing the Resume method to resume a previously stopped crawl.

type UserAgentFunction ¶

type UserAgentFunction func(req *request.Request) string

UserAgentFunction determines what User-Agent the spider will use.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
limits Package limits provides request filters and throttles.	Package limits provides request filters and throttles.
robots
proxy
request
util

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL