octopus

package
v1.2.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 25, 2019 License: MIT Imports: 10 Imported by: 0

Documentation

Overview

Package octopus implements a concurrent web crawler. The octopus uses a pipeline of channels to implement a non-blocking web crawler. The octopus also provides user configurable options that can be used to customize the behaviour of the crawler.

Features

Current Features of the crawler include:

  1. User specifiable Depth Limited Crawling
  2. User specified valid protocols
  3. User buildable adapters that the crawler feeds output to.
  4. Filter Duplicates.
  5. Filter URLs that fail a HEAD request.
  6. User specifiable max timeout between two successive url requests.
  7. User specifiable Max Number of Links to be crawled.

Pipeline Overview

The overview of the Pipeline is given below:

  1. Ingest
  2. Link Absolution
  3. Protocol Filter
  4. Duplicate Filter
  5. Invalid Url Filter (Urls whose HEAD request Fails) (5x) (Optional) Crawl Rate Limiter. [6]. Make GET Request 7a. Send to Output Adapter 7b. Check for Timeout (gap between two output on this channel).
  6. Max Links Crawled Limit Filter
  7. Depth Limit Filter
  8. Parse Page for more URLs.

Note: The output from 7b. is fed to 8.

1 -> 2 -> 3 -> 4 -> 5 -> (5x) -> [6] -> 7b -> 8 -> 9 -> 10 -> 1

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func New

func New(opt *CrawlOptions) *octopus

New - Create an Instance of the Octopus with the given CrawlOptions.

func NewWithDefaultOptions

func NewWithDefaultOptions() *octopus

NewWithDefaultOptions - Create an Instance of the Octopus with the default CrawlOptions.

Types

type CrawlOptions

type CrawlOptions struct {
	MaxCrawlDepth         int64
	MaxCrawledUrls        int64
	StayWithinBaseHost    bool
	CrawlRatePerSec       int64
	CrawlBurstLimitPerSec int64
	RespectRobots         bool
	IncludeBody           bool
	OpAdapter             OutputAdapter
	ValidProtocols        []string
	TimeToQuit            int64
}

CrawlOptions is used to house options for crawling.

You can specify depth of exploration for each link, if crawler should ignore other host names (except from base host).

MaxCrawlDepth - Indicates the maximum depth that will be crawled,
for each new link.

MaxCrawledUrls - Specifies the Maximum Number of Unique Links that will be crawled.
Note : When combined with DepthPerLink, it will combine both.
Use -1 to indicate infinite links to be crawled (only bounded by depth of traversal).

StayWithinBaseHost - (unimplemented) Ensures crawler stays within the
level 1 link's hostname.

CrawlRatePerSec - is the rate at which requests will be made (per second).
If this is negative, Crawl feature will be ignored. Default is negative.

CrawlBurstLimitPerSec - Represents the max burst capacity with which requests
can be made. This must be greater than or equal to the CrawlRatePerSec.

RespectRobots (unimplemented) choose whether to respect robots.txt or not.

IncludeBody - (unimplemented) Include the response Body in the crawled
NodeInfo (for further processing).

OpAdapter is a user specified concrete implementation of an Output Adapter. The crawler
will pump output onto the implementation's channel returned by its Consume method.

ValidProtocols - This is an array containing the list of url protocols that
should be crawled.

TimeToQuit - represents the total time to wait between two new nodes to be
generated before the crawler quits. This is in seconds.

func GetDefaultCrawlOptions

func GetDefaultCrawlOptions() *CrawlOptions

Returns an instance of CrawlOptions with the values set to sensible defaults.

type Node

type Node struct {
	*NodeInfo
	Body io.ReadCloser
}

Node encloses a NodeInfo and its Body (HTML) Content.

type NodeChSet

type NodeChSet struct {
	NodeCh chan<- *Node
	*StdChannels
}

NodeChSet is the standard set of channels used to build the concurrency pipelines in the crawler.

func MakeDefaultNodeChSet

func MakeDefaultNodeChSet() (*NodeChSet, chan *Node, chan int)

Utility to create a NodeChSet and get full access to the Quit & Node Channel.

func MakeNodeChSet

func MakeNodeChSet(nodeCh chan<- *Node, quitCh chan<- int) *NodeChSet

Utility function to create a NodeChSet given a created Node and Quit Channel.

type NodeInfo

type NodeInfo struct {
	ParentUrlString string
	UrlString       string
	Depth           int64
}

NodeInfo is used to represent each crawled link and its associated crawl depth.

type OutputAdapter

type OutputAdapter interface {
	Consume() *NodeChSet
}

OutputAdapter is the interface that has to be implemented in order to handle outputs from the octopus crawler.

The octopus will call the OutputAdapter.Consume( ) method and deliver all relevant output and quit signals on the channels included in the received NodeChSet.

This implies that it is the responsibility of the user who implements OutputAdapter to handle processing the output of the crawler that is delivered on the NodeChSet.NodeCh.

Implementers of the interface should listen to the included channels in the output of Consume() for output from the crawler.

type StdChannels

type StdChannels struct {
	QuitCh chan<- int
}

StdChannels are used to hold the standard set of channels that are used for special operations. Will include channels for Logging, Statistics, etc. in the future.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL