crawler

package module
v0.0.0-...-67712f9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 5, 2018 License: MIT Imports: 12 Imported by: 0

README

crawler

GoDoc Build Status Go Report Card

A simple package to quickly build programs that require crawling websites.

go get github.com/ernesto-jimenez/crawler

Usage

func Example() {
	startURL := "https://godoc.org"

	cr, err := crawler.New()
	if err != nil {
		panic(err)
	}

	err = cr.Crawl(startURL, func(url string, res *crawler.Response, err error) error {
		if err != nil {
			fmt.Printf("error: %s", err.Error())
			return nil
		}
		fmt.Printf("%s - Links: %d Assets: %d\n", url, len(res.Links), len(res.Assets))
		return crawler.ErrSkipURL
	})
	if err != nil {
		panic(err)
	}
	// Output:
	// https://godoc.org/ - Links: 39 Assets: 5
}

Documentation

Overview

Example
package main

import (
	"fmt"

	"github.com/ernesto-jimenez/crawler"
)

func main() {
	startURL := "https://godoc.org"

	cr, err := crawler.New()
	if err != nil {
		panic(err)
	}

	err = cr.Crawl(startURL, func(url string, res *crawler.Response, err error) error {
		if err != nil {
			fmt.Printf("error: %s", err.Error())
			return nil
		}
		fmt.Printf("%s - Links: %d Assets: %d\n", url, len(res.Links), len(res.Assets))
		return crawler.ErrSkipURL
	})
	if err != nil {
		panic(err)
	}
}
Output:

https://godoc.org/ - Links: 39 Assets: 5

Index

Examples

Constants

This section is empty.

Variables

View Source
var ErrSkipURL = errors.New("skip URL")

ErrSkipURL can be returned by CrawlFunc to avoid crawling the links from the given url

Functions

func ReadResponse

func ReadResponse(base *url.URL, r io.Reader, res *Response) error

ReadResponse extracts links and assets from the HTML read form the given io reader and fills it in the response

Types

type Asset

type Asset struct {
	// Tag used to link the asset
	Tag string `json:"tag"`

	// URL of the asset
	URL string `json:"url"`

	// Rel contains the text of the rel attribute
	Rel string `json:"rel,omitempty"`

	// Type contains the text of the type attribute
	Type string `json:"type,omitempty"`
}

Asset represents linked assets such as link, script and img tags

type CheckFetchFunc

type CheckFetchFunc func(*Request) bool

CheckFetchFunc is used to check whether a page should be fetched during the crawl or not

type CheckFetchStack

type CheckFetchStack []CheckFetchFunc

CheckFetchStack is a stack of CheckFetchFunc types where all have to pass for the fetch to happen.

func (CheckFetchStack) CheckFetch

func (s CheckFetchStack) CheckFetch(req *Request) bool

CheckFetch will return true if all funcs in the stack return true. false otherwise.

type CrawlFunc

type CrawlFunc func(url string, res *Response, err error) error

CrawlFunc is the type of the function called for each webpage visited by Crawl. The incoming url specifies which url was fetched, while res contains the response of the fetched URL if it was successful. If the fetch failed, the incoming error will specify the reason and res will be nil.

Returning ErrSkipURL will avoid queing up the resources links to be crawled.

Returning any other error from the function will immediately stop the crawl.

type InMemoryQueue

type InMemoryQueue struct {
	// contains filtered or unexported fields
}

InMemoryQueue holds a queue of items to be crawled in memory

func NewInMemoryQueue

func NewInMemoryQueue(ctx context.Context) *InMemoryQueue

NewInMemoryQueue returns an in memory queue ready to be used by different workers

func (*InMemoryQueue) PopFront

func (q *InMemoryQueue) PopFront() (*Request, error)

PopFront gets the next request from the queue. It will return a nil request and a nil error if the queue is empty.

func (*InMemoryQueue) PushBack

func (q *InMemoryQueue) PushBack(req *Request) error

PushBack adds a request to the queue

type Link struct {
	// URL contains the href attribute of the link. e.g: <a href="{href}">...</a>
	URL string `json:"url"`
}

Link contains the informaiton from a single `a` tag

type Option

type Option func(*options) error

Option is used to provide optional configuration to a crawler

func WithAllowedHosts

func WithAllowedHosts(hosts ...string) Option

WithAllowedHosts adds a check to only allow URLs with the given hosts

func WithCheckFetch

func WithCheckFetch(fn CheckFetchFunc) Option

WithCheckFetch takes CheckFetchFunc that will be run before fetching each page to check whether it should be fetched or not

func WithConcurrentRequests

func WithConcurrentRequests(n int) Option

WithConcurrentRequests sets how many concurrent requests to allow

func WithExcludedHosts

func WithExcludedHosts(hosts ...string) Option

WithExcludedHosts adds a check to only allow URLs with hosts other than the given ones

func WithHTTPTransport

func WithHTTPTransport(rt http.RoundTripper) Option

WithHTTPTransport sets the optional http client

func WithMaxDepth

func WithMaxDepth(depth int) Option

WithMaxDepth sets the max depth of the crawl. It must be over zero or the call will panic.

func WithOneRequestPerURL

func WithOneRequestPerURL() Option

WithOneRequestPerURL adds a check to only allow URLs once

type Queue

type Queue interface {
	PushBack(*Request) error
	PopFront() (*Request, error)
}

Queue is used by workers to keep track of the urls that need to be fetched. Queue must be safe to use concurrently.

type Request

type Request struct {
	URL *url.URL
	// contains filtered or unexported fields
}

Request is used to fetch a page and informoation about its resources

func NewRequest

func NewRequest(uri string) (*Request, error)

NewRequest initialises a new crawling request to extract information from a single URL

func (*Request) Finish

func (r *Request) Finish()

Finish should be called once the request has been completed

type Response

type Response struct {
	URL        string  `json:"url"`
	RedirectTo string  `json:"redirect_to,omitempty"`
	Links      []Link  `json:"links"`
	Assets     []Asset `json:"assets"`
	// contains filtered or unexported fields
}

Response has the details from crawling a single URL

type Runner

type Runner interface {
	Run(context.Context, Queue) error
}

Runner defines the interface requred to run a crawl

type Simple

type Simple struct {
	// contains filtered or unexported fields
}

Simple is responsible of running a crawl, allowing you to queue new URLs to be crawled and build requests to be crawled.

func New

func New(opts ...Option) (*Simple, error)

New initialises a new crawl runner

func (*Simple) Crawl

func (s *Simple) Crawl(startURL string, crawlFn CrawlFunc) error

Crawl will fetch all the linked websites starting from startURL and invoking crawlFn for each fetched url with either the response or the error.

It will return an error if the crawl was prematurely stopped or could not be started.

Crawl will always add WithOneRequestPerURL to the options of the worker to avoid infinite loops.

type Worker

type Worker struct {
	// contains filtered or unexported fields
}

Worker is used to run a crawl on a single goroutine

func NewWorker

func NewWorker(fn CrawlFunc, opts ...Option) (*Worker, error)

NewWorker initialises a goroutine

func (*Worker) Run

func (w *Worker) Run(ctx context.Context, q Queue) error

Run starts processing requests from the queue

Directories

Path Synopsis
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL