wander

package module
v0.14.0-alpha Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 23, 2020 License: MIT Imports: 12 Imported by: 0

README

Wander

Overview

Convenient scraping library for Gophers.

Based on Colly and Scrapy, Wander aims to provide an easy-to-use API while also exposing the tools for advanced use cases.

Features

  • Prioritized request queueing.
  • Redis support for distributed scraping.
  • Easy parallelization of crawlers and pipelines.
  • Stop, save and resume crawls.
  • Global and per-domain throttling.
  • Proxy switching.
  • Support for robots.txt, including non-standard directives and custom filter functions (e.a. ignore certain rules).
  • Sitemap support

Example

package main

import (
	"context"
	"log"
	"time"

	"github.com/PuerkitoBio/goquery"

	"github.com/KillianMeersman/wander"
	"github.com/KillianMeersman/wander/limits"
	"github.com/KillianMeersman/wander/request"
)

func main() {
	spid, _ := wander.NewSpider(
		wander.AllowedDomains("localhost:8080"),
		wander.MaxDepth(10),
		wander.Throttle(limits.NewDefaultThrottle(200*time.Millisecond)),
		wander.Threads(2),
	)

	spid.OnRequest(func(req *request.Request) *request.Request {
		log.Printf("visiting %s", req.URL.String())
		return req
	})

	spid.OnResponse(func(res *request.Response) {
		log.Printf("response from %s", res.Request.URL.String())
	})

	spid.OnError(func(err error) {
		log.Fatal(err)
	})

	spid.OnHTML("a[href]", func(res *request.Response, el *goquery.Selection) {
		link, _ := el.Attr("href")
		url, err := url.Parse(link)
		if err != nil {
			log.Fatal(err)
		}

		if err := spid.Follow(url, res, 10-res.Request.Depth); err != nil {
			log.Printf(err.Error())
		}
	})

	root := &url.URL{
		Scheme: "http",
		Host:   "localhost:8080",
	}
	spid.Visit(root)

	// Get sitemap
	robotsURL := root
	robotsURL.Path = "/robots.txt"
	robots, err := robots.NewRobotFileFromURL(robotsURL, spid)
	if err != nil {
		log.Fatal(err)
	}

	sitemap, err := robots.GetSitemap("wander", spid)
	if err != nil {
		log.Fatal(err)
	}

	locations, err := sitemap.GetLocations(spid, 10000)
	if err != nil {
		log.Fatal(err)
	}
	for _, location := range locations {
		url, err := url.Parse(location.Loc)
		if err != nil {
			log.Fatal(err)
		}
		spid.Visit(url)
	}

	go func() {
		<-time.After(5 * time.Second)
		spid.Stop(context.Background())
	}()

	spid.Start()
	spid.Wait()

Installation

go get -u github.com/KillianMeersman/wander

Documentation

Overview

Package wander is a scraping library for Go. It aims to provide an easy to use API while also exposing tools for advanced use cases.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func FollowRobotRules

func FollowRobotRules(s *Spider, req *request.Request) error

FollowRobotRules fetches and follows the limitations imposed by the robots.txt file. Implementation of RobotLimitFunction.

func IgnoreRobotRules

func IgnoreRobotRules(s *Spider, req *request.Request) error

IgnoreRobotRules ignores the robots.txt file. Implementation of RobotLimitFunction.

Types

type AlreadyVisited

type AlreadyVisited struct {
	URL url.URL
}

AlreadyVisited is thrown when a request's URL has been visited before by the spider.

func (AlreadyVisited) Error

func (e AlreadyVisited) Error() string

type RobotLimitFunction

type RobotLimitFunction func(spid *Spider, req *request.Request) error

RobotLimitFunction determines how a spider acts upon robot.txt limitations. The default is FollowRobotRules, IgnoreRobotRules is also provided. It's possible to define your own RobotLimitFunction in order to e.a. ignore only certain limitations.

type Spider

type Spider struct {
	SpiderState
	SpiderParameters
	RobotLimits    *robots.RobotRules
	AllowedDomains []string
	// contains filtered or unexported fields
}

Spider provides a parallelized scraper.

func NewSpider

func NewSpider(options ...SpiderConstructorOption) (*Spider, error)

NewSpider instantiates a new spider.

func (*Spider) AddLimits

func (s *Spider) AddLimits(limits ...limits.RequestFilter)
Getters/setters

AddLimits adds limits to the spider, it will not add duplicate limits.

func (*Spider) CheckResponseStatus

func (s *Spider) CheckResponseStatus(res *request.Response)

CheckResponseStatus checks the response for any non-standard status codes. It will apply additional throttling when it encounters a 429 or 503 status code, according to the spider parameters.

func (*Spider) DownloadRobotLimits

func (s *Spider) DownloadRobotLimits(req *request.Request) (*robots.RobotFile, error)

DownloadRobotLimits downloads and parses the robots.txt file for a domain. Respects the spider throttles.

func (*Spider) Follow

func (s *Spider) Follow(url *url.URL, res *request.Response, priority int) error

Follow a link by adding the path to the queue, blocks when the queue is full until there is free space. Unlike Visit, this method also accepts a response, allowing the url parser to convert relative urls into absolute ones and keep track of depth.

func (*Spider) OnError

func (s *Spider) OnError(f func(err error))

OnError is called when an error is encountered. This will overwrite any previous callbacks set by this method.

func (*Spider) OnHTML

func (s *Spider) OnHTML(selector string, f func(res *request.Response, el *goquery.Selection))

OnHTML is called for each element matching the selector in a response body

func (*Spider) OnPipelineFinished

func (s *Spider) OnPipelineFinished(f func())

OnPipelineFinished is called when a pipeline (all callbacks and selectors) finishes. This will overwrite any previous callbacks set by this method.

func (*Spider) OnRequest

func (s *Spider) OnRequest(f func(req *request.Request) *request.Request)

OnRequest is called when a request is about to be made. This function should return a request, allowing the callback to mutate the request. If null is returned, no requests are made. This will overwrite any previous callbacks set by this method.

func (*Spider) OnResponse

func (s *Spider) OnResponse(f func(res *request.Response))

OnResponse is called when a response has been received and tokenized. This will overwrite any previous callbacks set by this method.

func (*Spider) RemoveLimits

func (s *Spider) RemoveLimits(limits ...limits.RequestFilter)

RemoveLimits removes the given limits (if present).

func (*Spider) Resume

func (s *Spider) Resume(ctx context.Context, state *SpiderState)

Resume from spider state. This method is idempotent and will return without doing anything if the spider is already isRunning.

func (*Spider) RoundTrip

func (s *Spider) RoundTrip(req *http.Request) (*http.Response, error)

RoundTrip implements the http.RoundTripper interface. It will wait for any throttles before making requests.

func (*Spider) SetAllowedDomains

func (s *Spider) SetAllowedDomains(paths ...string) error

SetAllowedDomains sets the allowed domains.

func (*Spider) SetProxyFunc

func (s *Spider) SetProxyFunc(proxyFunc func(r *http.Request) (*url.URL, error))

SetProxyFunc sets the proxy function to be used

func (*Spider) SetThrottles

func (s *Spider) SetThrottles(def *limits.DefaultThrottle, domainThrottles ...*limits.DomainThrottle)

SetThrottles sets or replaces the default and custom throttles for the spider.

func (*Spider) Start

func (s *Spider) Start()

Start the spider. This method is idempotent and will return without doing anything if the spider is already isRunning.

func (*Spider) Stop

func (s *Spider) Stop(ctx context.Context) *SpiderState

Stop the spider if it is currently isRunning, returns a SpiderState to allow a later call to Resume. Accepts a context and will forcibly stop the spider if cancelled, regardless of status. This method is idempotent and will return without doing anything if the spider is not isRunning.

func (*Spider) Visit

func (s *Spider) Visit(url *url.URL) error
Control/navigation functions

Visit adds a request with the given path to the queue with maximum priority. Blocks when the queue is full until there is free space. This method is meant to be used solely for setting the starting points of crawls before calling Start.

func (*Spider) VisitNow

func (s *Spider) VisitNow(url *url.URL) (*request.Response, error)

VisitNow visits the given url without adding it to the queue. It will still wait for any throttling.

func (*Spider) Wait

func (s *Spider) Wait()

Wait blocks until the spider has been stopped.

type SpiderConstructorOption

type SpiderConstructorOption func(s *Spider) error

SpiderConstructorOption is used for chaining constructor options.

func AllowedDomains

func AllowedDomains(domains ...string) SpiderConstructorOption

AllowedDomains sets allowed domains, utility funtion for SetAllowedDomains.

func Cache

Cache sets the RequestCache. Allows request caches to be shared between spiders.

func IgnoreRobots

func IgnoreRobots() SpiderConstructorOption

IgnoreRobots sets the spider's RobotExclusionFunction to IgnoreRobotRules, ignoring robots.txt.

func Ingestors

func Ingestors(n int) SpiderConstructorOption

Ingestors sets the amount of goroutines for ingestors.

func MaxDepth

func MaxDepth(max int) SpiderConstructorOption

MaxDepth sets the maximum request depth.

func ProxyFunc

func ProxyFunc(f func(r *http.Request) (*url.URL, error)) SpiderConstructorOption

ProxyFunc sets the proxy function, utility function for SetProxyFunc.

func Queue

Queue sets the RequestQueue. Allows request queues to be shared between spiders.

func RobotLimits

func RobotLimits(limits *robots.RobotRules) SpiderConstructorOption

RobotLimits sets the robot exclusion cache.

func Threads

func Threads(n int) SpiderConstructorOption

Threads sets the amount of ingestors goroutines.

func Throttle

func Throttle(defaultThrottle *limits.DefaultThrottle, domainThrottles ...*limits.DomainThrottle) SpiderConstructorOption

Throttle is a constructor function for SetThrottles.

func UserAgent

func UserAgent(agentFunction UserAgentFunction) SpiderConstructorOption

UserAgent set the spider User-agent.

type SpiderParameters

type SpiderParameters struct {
	UserAgent              UserAgentFunction
	RobotExclusionFunction RobotLimitFunction
	// DefaultWaitTime for 429 & 503 responses without a Retry-After header
	DefaultWaitTime time.Duration
	// MaxWaitTime for 429 & 503 responses with a Retry-After header
	MaxWaitTime time.Duration
	// IgnoreTimeouts if true, the bot will ignore 429 response timeouts.
	// Defaults to false.
	IgnoreTimeouts bool
}

SpiderParameters crawling parameters for a spider

type SpiderState

type SpiderState struct {
	Queue request.Queue
	Cache request.Cache
}

SpiderState holds a spider's state, such as the request queue and cache. It is returned by the Start and Resume methods, allowing the Resume method to resume a previously stopped crawl.

type UserAgentFunction

type UserAgentFunction func(req *request.Request) string

UserAgentFunction determines what User-Agent the spider will use.

Directories

Path Synopsis
Package limits provides request filters and throttles.
Package limits provides request filters and throttles.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL