crawler

package
v0.0.0-...-8b501b0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 6, 2023 License: MIT Imports: 14 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	// An API for managing and interating links and edges in the link graph.
	GraphAPI GraphAPI

	// An API for indexing documents.
	IndexAPI IndexAPI

	// An API for detecting private network addresses. If not specified,
	// a default implementation that handles the private network ranges
	// defined in RFC1918 will be used instead.
	PrivateNetworkDetector crawler_pipeline.PrivateNetworkDetector

	// An API for performing HTTP requests. If not specified,
	// http.DefaultClient will be used instead.
	URLGetter crawler_pipeline.URLGetter

	// An API for detecting the partition assignments for this service.
	PartitionDetector partition.Detector

	// A clock instance for generating time-related events. If not specified,
	// the default wall-clock will be used instead.
	Clock clock.Clock

	// The number of concurrent workers used for retrieving links.
	FetchWorkers int

	// The time between subsequent crawler passes.
	UpdateInterval time.Duration

	// The minimum amount of time before re-indexing an already-crawled link.
	ReIndexThreshold time.Duration

	// The logger to use. If not defined an output-discarding logger will
	// be used instead.
	Logger *logrus.Entry
}

Config encapsulates the settings for configuring the web-crawler service.

type GraphAPI

type GraphAPI interface {
	UpsertLink(link *graph.Link) error
	UpsertEdge(edge *graph.Edge) error
	RemoveStaleEdges(fromID uuid.UUID, updatedBefore time.Time) error
	Links(fromID, toID uuid.UUID, retrievedBefore time.Time) (graph.LinkIterator, error)
}

GraphAPI defines as set of API methods for accessing the link graph.

type IndexAPI

type IndexAPI interface {
	Index(doc *index.Document) error
}

IndexAPI defines a set of API methods for indexing crawled documents.

type Service

type Service struct {
	// contains filtered or unexported fields
}

Service implements the web-crawler component for the Links 'R' Us project.

func NewService

func NewService(cfg Config) (*Service, error)

NewService creates a new crawler service instance with the specified config.

func (*Service) Name

func (svc *Service) Name() string

Name implements service.Service

func (*Service) Run

func (svc *Service) Run(ctx context.Context) error

Run implements service.Service

Directories

Path Synopsis
Package mocks is a generated GoMock package.
Package mocks is a generated GoMock package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL