crawler

package
v0.0.0-...-7b67181 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 5, 2023 License: MIT Imports: 23 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	// A PrivateNetworkDetector instance
	PrivateNetworkDetector PrivateNetworkDetector

	// A URLGetter instance for fetching links.
	URLGetter URLGetter

	// A GraphUpdater instance for adding new links to the link graph.
	Graph Graph

	// A TextIndexer instance for indexing the content of each retrieved link.
	Indexer Indexer

	// The number of concurrent workers used for retrieving links.
	FetchWorkers int
}

Config encapsulates the configuration options for creating a new Crawler.

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler implements a web-page crawling pipeline consisting of the following stages:

  • Given a URL, retrieve the web-page contents from the remote server.
  • Extract and resolve absolute and relative links from the retrieved page.
  • Extract page title and text content from the retrieved page.
  • Update the link graph: add new links and create edges between the crawled page and the links within it.
  • Index crawled page title and text content.

func NewCrawler

func NewCrawler(cfg Config) *Crawler

NewCrawler returns a new crawler instance.

func (*Crawler) Crawl

func (c *Crawler) Crawl(ctx context.Context, linkIt graph.LinkIterator) (int, error)

Crawl iterates linkIt and sends each link through the crawler pipeline returning the total count of links that went through the pipeline. Calls to Crawl block until the link iterator is exhausted, an error occurs or the context is cancelled.

type Graph

type Graph interface {
	// UpsertLink creates a new link or updates an existing link.
	UpsertLink(link *graph.Link) error

	// UpsertEdge creates a new edge or updates an existing edge.
	UpsertEdge(edge *graph.Edge) error

	// RemoveStaleEdges removes any edge that originates from the specified
	// link ID and was updated before the specified timestamp.
	RemoveStaleEdges(fromID uuid.UUID, updatedBefore time.Time) error
}

Graph is implemented by objects that can upsert links and edges into a link graph instance.

type GraphAPI

type GraphAPI interface {
	UpsertLink(link *graph.Link) error
	UpsertEdge(edge *graph.Edge) error
	RemoveStaleEdges(fromID uuid.UUID, updatedBefore time.Time) error
	Links(fromID, toID uuid.UUID, retrievedBefore time.Time) (graph.LinkIterator, error)
}

GraphAPI defines as set of API methods for accessing the link graph.

type IndexAPI

type IndexAPI interface {
	Index(doc *index.Document) error
}

IndexAPI defines a set of API methods for indexing crawled documents.

type Indexer

type Indexer interface {
	// Index inserts a new document to the index or updates the index entry
	// for and existing document.
	Index(doc *index.Document) error
}

Indexer is implemented by objects that can index the contents of web-pages retrieved by the crawler pipeline.

type PrivateNetworkDetector

type PrivateNetworkDetector interface {
	IsPrivate(host string) (bool, error)
}

PrivateNetworkDetector is implemented by objects that can detect whether a host resolves to a private network address.

type Service

type Service struct {
	// contains filtered or unexported fields
}

Service implements the web-crawler component for the Links 'R' Us project.

func NewService

func NewService(cfg ServiceConfig) (*Service, error)

NewService creates a new crawler service instance with the specified config.

func (*Service) Name

func (svc *Service) Name() string

Name implements service.Service

func (*Service) Run

func (svc *Service) Run(ctx context.Context) error

Run implements service.Service

type ServiceConfig

type ServiceConfig struct {
	// An API for managing and interating links and edges in the link graph.
	GraphAPI GraphAPI

	// An API for indexing documents.
	IndexAPI IndexAPI

	// An API for detecting private network addresses. If not specified,
	// a default implementation that handles the private network ranges
	// defined in RFC1918 will be used instead.
	PrivateNetworkDetector PrivateNetworkDetector

	// An API for performing HTTP requests. If not specified,
	// http.DefaultClient will be used instead.
	URLGetter URLGetter

	// An API for detecting the partition assignments for this service.
	PartitionDetector partition.Detector

	// A clock instance for generating time-related events. If not specified,
	// the default wall-clock will be used instead.
	Clock clock.Clock

	// The number of concurrent workers used for retrieving links.
	FetchWorkers int

	// The time between subsequent crawler passes.
	UpdateInterval time.Duration

	// The minimum amount of time before re-indexing an already-crawled link.
	ReIndexThreshold time.Duration

	// The logger to use. If not defined an output-discarding logger will
	// be used instead.
	Logger *logrus.Entry
}

ServiceConfig encapsulates the settings for configuring the web-crawler service. Not to be confused with the Config for the crawler itself.

type URLGetter

type URLGetter interface {
	Get(url string) (*http.Response, error)
}

URLGetter is implemented by objects that can perform HTTP GET requests.

Directories

Path Synopsis
Package mocks is a generated GoMock package.
Package mocks is a generated GoMock package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL