crawler

package
v0.0.0-...-32be1cf Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 13, 2019 License: Apache-2.0 Imports: 7 Imported by: 0

Documentation

Overview

Package crawler provides helper methods and defines an interface for lauching source repository crawlers that retrieve files from a source and forwards to a channel for indexing and retrieval.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CrawlFromSeed

func CrawlFromSeed(ctx context.Context, seed CrawlerSeed,
	crawlers []Crawler, conv Converter, indx IndexFunc)

Cleaner, more efficient, and more extensible crawler implementation. The seed must include the ids of each document in the index.

func CrawlerRunner

func CrawlerRunner(ctx context.Context,
	output chan<- CrawlerDocument, crawlers []Crawler) []error

CrawlerRunner is a blocking function and only returns once all of the crawlers are finished with execution.

This function uses the output channel to forward kustomization documents from a list of crawlers. The output is to be consumed by a database/search indexer for later retrieval.

The return value is an array of errors in which each index represents the index of the crawler that emitted the error. Although the errors themselves can be nil, the array will always be exactly the size of the crawlers array.

Crawler Runner takes in a seed, which represents the documents stored in an index somewhere. The document data is not required to be populated. If there are many documents, this is preferable. The order of iteration over the seed is not garanteed, but the CrawlerRunner does guarantee that every element from the seed will be processed before any other documents from the crawlers.

Types

type Converter

type Converter func(*doc.Document) (CrawlerDocument, error)

type Crawler

type Crawler interface {
	// Crawl returns when it is done processing. This method does not take
	// ownership of the channel. The channel is write only, and it
	// designates where the crawler should forward the documents.
	Crawl(ctx context.Context, output chan<- CrawlerDocument) error

	// Get the document data given the FilePath, Repo, and Ref/Tag/Branch.
	FetchDocument(context.Context, *doc.Document) error
	// Write to the document what the created time is.
	SetCreated(context.Context, *doc.Document) error

	Match(*doc.Document) bool
}

Crawler forwards documents from source repositories to index and store them for searching. Each crawler is responsible for querying it's source of information, and forwarding files that have not been seen before or that need updating.

type CrawlerDocument

type CrawlerDocument interface {
	ID() string
	GetDocument() *doc.Document
	GetResources() ([]*doc.Document, error)
	WasCached() bool
}

type CrawlerSeed

type CrawlerSeed []*doc.Document

type IndexFunc

type IndexFunc func(CrawlerDocument, Crawler) error

Directories

Path Synopsis
Package github implements the crawler.Crawler interface, getting data from the Github search API.
Package github implements the crawler.Crawler interface, getting data from the Github search API.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL