crawl

package
v0.0.0-...-106735c Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 4, 2017 License: MIT Imports: 10 Imported by: 0

Documentation

Index

Constants

View Source
const (
	WAITING uint8 = 0
	STOPPED uint8 = 1
	RUNNING uint8 = 2
)

Possible Worker states

Variables

This section is empty.

Functions

This section is empty.

Types

type AsyncHTTPCrawler

type AsyncHTTPCrawler struct {
	// contains filtered or unexported fields
}

AsyncHTTPCrawler is an implementation of the Crawler interface. It contains a fetcher that initiates the crawling and zero or more workers that perform the processing

func NewAsyncHTTPCrawler

func NewAsyncHTTPCrawler(seedURL *url.URL) *AsyncHTTPCrawler

NewAsyncHTTPCrawler is a constructor. It takes in a Fetcher that will start the crawl and zero or more workers that will process the response and create a Sitemap

func (*AsyncHTTPCrawler) Crawl

func (c *AsyncHTTPCrawler) Crawl() (sitemap.Sitemapper, error)

Crawl is the main entrypoint to crawling a domain (url). Crawl returns a Sitemapper that can later be used to create a represenation of the crawled site. It returns an error in case the crawl url is invalid

type AsyncHTTPFetcher

type AsyncHTTPFetcher struct {
	//AsyncHTTPFetcher is an Asynchronous Worker
	*AsyncWorker
	// contains filtered or unexported fields
}

AsyncHTTPFetcher implements Fetcher

func NewAsyncHTTPFetcher

func NewAsyncHTTPFetcher() *AsyncHTTPFetcher

NewAsyncHTTPFetcher is a constructor for a AsyncHTTPFetcher. It does not start the Fetcher, which should be done by using the Run method

func (*AsyncHTTPFetcher) Fetch

func (a *AsyncHTTPFetcher) Fetch(url *url.URL) error

Fetch places a request for a URL into the requestQueue Returns nil on success and an error in case the url is not valid

func (*AsyncHTTPFetcher) ResponseChannel

func (a *AsyncHTTPFetcher) ResponseChannel() (responseQueue *FetchResponseQueue)

ResponseChannel is a Getter returning the Fetcher's Channel that consumers should be receiving results from

func (*AsyncHTTPFetcher) Run

func (a *AsyncHTTPFetcher) Run() error

Run starts a loop that waits for requests or the quit signal. Run will be interrupted once the Stop method is used

func (*AsyncHTTPFetcher) Worker

func (a *AsyncHTTPFetcher) Worker() Worker

Worker Returns the embedded AsyncWorker struct which is used to Run and Stop the fetcher worker

type AsyncHTTPParser

type AsyncHTTPParser struct {
	//Fetcher is an Asynchronous Worker
	*AsyncWorker
	// contains filtered or unexported fields
}

func NewAsyncHTTPParser

func NewAsyncHTTPParser(seedURL *url.URL, fetcher Fetcher) *AsyncHTTPParser

func (*AsyncHTTPParser) ResponseChannel

func (p *AsyncHTTPParser) ResponseChannel() *parserResponseQueue

func (*AsyncHTTPParser) Run

func (p *AsyncHTTPParser) Run() error

Run starts a loop that waits for requests or the quit signal. Run will be interrupted once the Stop method is used

func (*AsyncHTTPParser) Worker

func (p *AsyncHTTPParser) Worker() Worker

Worker Returns the embedded AsyncWorker struct which is used to Run and Stop the Parser worker

type AsyncHttpTracker

type AsyncHttpTracker struct {
	//Tracker is an Asynchronous Worker
	*AsyncWorker
	// contains filtered or unexported fields
}

An AsyncHttpTracker is an Asynchronous worker struct that is responsible for receiving URLs from a Parser and passing the uncrawled URLs to the Fetcher

func NewAsyncHttpTracker

func NewAsyncHttpTracker(fetcher Fetcher, parser Parser) *AsyncHttpTracker

func (*AsyncHttpTracker) Run

func (t *AsyncHttpTracker) Run() error

func (*AsyncHttpTracker) SetSitemapper

func (t *AsyncHttpTracker) SetSitemapper(s sitemap.Sitemapper)

SetSitemapper provides the Tracker with a Sitemapper. The Tracker is responsible for building the providing the Sitemapper with new URL data

func (*AsyncHttpTracker) Worker

func (t *AsyncHttpTracker) Worker() Worker

type AsyncWorker

type AsyncWorker struct {
	RunFunc func() error

	Quit chan uint8
	Name string
	// contains filtered or unexported fields
}

AsyncWorker implements the worker interface It is meant to be embedded in another struct, like AsyncHttpFetcher

func NewAsyncWorker

func NewAsyncWorker(name string) *AsyncWorker

NewAsyncWorker is a constructor for a AsyncWorker.

func (*AsyncWorker) Run

func (w *AsyncWorker) Run() error

Run calls the encapsulating

func (*AsyncWorker) SetState

func (w *AsyncWorker) SetState(state uint8)

SetState setter (See interface definition)

func (*AsyncWorker) State

func (w *AsyncWorker) State() uint8

State getter (See interface definition)

func (*AsyncWorker) Stop

func (w *AsyncWorker) Stop()

Stop notifies the quit channel. The encapsulating struct's RunFunc needs to receive from the quit channel in order to stop.

func (*AsyncWorker) Type

func (w *AsyncWorker) Type() string

Type returns the Name given to the Worker in initialisation

type Crawler

type Crawler interface {
	//Crawl is the main entrypoint to
	//crawling a domain (url)
	Crawl(url string) (sitemap.Sitemapper, error)
}

A Crawler crawls a domain and returns a representation of the crawled domain

type FetchMessage

type FetchMessage struct {
	Request  *url.URL
	Response *http.Response
	Error    error
}

FetchMessage is a struct used to pass results of a Fetch request back to the requester. It includes Request: The original request (for tracking) Response Error in case request could not finish successfully

type FetchResponseQueue

type FetchResponseQueue chan *FetchMessage

FetchResponseQueue queue is used for outgoing responses from the Fetcher

type Fetcher

type Fetcher interface {
	// Fetch provides work to the Fetcher, in the
	// form of a URL to process
	Fetch(url *url.URL) error

	// ResponseChannel is a Getter returning
	// the Fetcher's Channel  that consumers
	// should be receiving results from
	ResponseChannel() (responseQueue *FetchResponseQueue)

	// Retrieve Worker that manages Fetcher Service
	Worker() Worker
}

Fetcher is an Asynchronous Worker interface that is responsible for Fetching URLs and exposing a ResponseChannel where the results of type FetchMessage are passed to the consumers

type HTPPClient

type HTPPClient interface {
	// At the moment, response is of type http.Response which locks
	// in implementation!
	Get(url string) (resp *http.Response, err error)
}

HTPPClient interface that wraps around the http.Client struct and can be replaced by any other client implementation

type ParseMessage

type ParseMessage struct {
	Request  *url.URL
	Response *url.URL
}

type Parser

type Parser interface {
	// ResponseChannel is a Getter returning
	// the Parser's Channel  that consumers
	// should be receiving results from
	ResponseChannel() (responseQueue *parserResponseQueue)

	// Retrieve Worker
	Worker() Worker
}

Parser is an Asynchronous interface

type RequestQueue

type RequestQueue chan url.URL

RequestQueue is used for incoming requests to the fetcher

type Tracker

type Tracker interface {
	// SetSitemapper provides the Tracker with
	// a Sitemapper. The Tracker is responsible for
	// building the providing the Sitemapper with
	// new URL data.
	SetSitemapper(sitemap.Sitemapper)

	// Retrieve Worker
	Worker() Worker
}

A Tracker is an Asynchronous worker interface that is responsible for receiving URLs from the

type Worker

type Worker interface {
	// Run starts the Asynchronous worker
	Run() error

	// Returns worker name
	// Example names are:
	// - Fetcher
	// - Parser
	// - Tracker
	// - Sitemapper
	Type() string

	// State returns the state the worker is in:
	// RUNNING - processing work
	// WAITING - Waits for work
	// STOPPED - Not running
	State() uint8
	SetState(state uint8)
}

Worker is an interface that can be used to manage agents perform work in a different thread.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL