crawler

package module
v0.0.0-...-deb2a4f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 31, 2018 License: MIT Imports: 12 Imported by: 0

README

MIT License Tag godoc Go Report

Crawler

Web crawler PoC

Build / Deploy

# Clone
git clone https://github.com/disq/crawler.git
cd crawler

# Fetch dependencies
dep ensure

# Build
go build ./cmd/crawler

Usage

Usage: ./crawler [options] [start url] [additional hosts to include...]
  -l string
        Log level (default "info")
  -t duration
        HTTP timeout (default 5s)
  -w int
        Number of worker goroutines. Negative numbers mean multiples of the CPU core count. (default 256)

License

MIT.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	ErrUnsupportedScheme = fmt.Errorf("Unsupported scheme")
	ErrFilteredOut       = fmt.Errorf("Filtered out")
	ErrAlreadyInList     = fmt.Errorf("Already in visit list")
)

Functions

This section is empty.

Types

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler is our main struct

func New

func New(ctx context.Context, logger yolo.Logger, client *http.Client, filter FilterFunc, mapper Mapper) *Crawler

New creates a new crawler

func (*Crawler) Add

func (c *Crawler) Add(source *url.URL, uri ...*url.URL) []error

Add adds one or more previously un-added urls to crawler to visit. source can be nil to indicate root. Returns a list of errors if any occured.

func (*Crawler) Run

func (c *Crawler) Run(numWorkers int)

Run launches the worker pool and blocks until they all finish.

func (*Crawler) Stats

func (c *Crawler) Stats() (uint64, uint64, uint64)

type FilterFunc

type FilterFunc func(*url.URL) bool

FilterFunc is used to exclude urls from getting crawled

type Mapper

type Mapper interface {
	Add(string, ...string)
}

Mapper is used to map a site structure

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL