crawler

package

v0.0.0-...-e86fd7f Latest Latest Go to latest Published: Apr 23, 2021 License: Apache-2.0 Imports: 9 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

Documentation ¶

Overview ¶

Package crawler provides helper methods and defines an interface for lauching source repository crawlers that retrieve files from a source and forwards to a channel for indexing and retrieval.

Index ¶

func CrawlFromSeed(ctx context.Context, seed CrawlSeed, crawlers []Crawler, conv Converter, ...)
func CrawlFromSeedIterator(ctx context.Context, it *index.KustomizeIterator, crawlers []Crawler, ...)
func CrawlGithub(ctx context.Context, crawlers []Crawler, conv Converter, indx IndexFunc, ...)
func CrawlGithubRunner(ctx context.Context, output chan<- CrawledDocument, crawlers []Crawler, ...) []error
type Converter
type CrawlSeed
type CrawledDocument
type Crawler
type IndexFunc

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CrawlFromSeed ¶

func CrawlFromSeed(ctx context.Context, seed CrawlSeed, crawlers []Crawler,
	conv Converter, indx IndexFunc, seen utils.SeenMap)

CrawlFromSeed updates all the documents in seed, and crawls all the new documents referred in the seed.

func CrawlFromSeedIterator ¶

func CrawlFromSeedIterator(ctx context.Context, it *index.KustomizeIterator, crawlers []Crawler,
	conv Converter, indx IndexFunc, seen utils.SeenMap)

CrawlFromSeedIterator iterates all the documents in the index and call CrawlFromSeed for each document.

func CrawlGithub ¶

func CrawlGithub(ctx context.Context, crawlers []Crawler, conv Converter,
	indx IndexFunc, seen utils.SeenMap)

CrawlGithub crawls all the kustomization files on Github.

func CrawlGithubRunner ¶

func CrawlGithubRunner(ctx context.Context, output chan<- CrawledDocument,
	crawlers []Crawler, seen utils.SeenMap) []error

CrawlGithubRunner is a blocking function and only returns once all of the crawlers are finished with execution.

This function uses the output channel to forward kustomization documents from a list of crawlers. The output is to be consumed by a database/search indexer for later retrieval.

The return value is an array of errors in which each index represents the index of the crawler that emitted the error. Although the errors themselves can be nil, the array will always be exactly the size of the crawlers array.

CrawlGithubRunner takes in a seed, which represents the documents stored in an index somewhere. The document data is not required to be populated. If there are many documents, this is preferable. The order of iteration over the seed is not guaranteed, but the CrawlGithub does guarantee that every element from the seed will be processed before any other documents from the crawlers.

Types ¶

type Converter ¶

type Converter func(*doc.Document) (CrawledDocument, error)

type CrawlSeed ¶

type CrawlSeed []*doc.Document

type CrawledDocument ¶

type CrawledDocument interface {
	ID() string
	GetDocument() *doc.Document
	// Get all the Documents directly referred in a Document.
	// For a Document representing a non-kustomization file, an empty slice will be returned.
	// For a Document representing a kustomization file:
	// the `includeResources` parameter determines whether the documents referred in the `resources` field are returned or not;
	// the `includeTransformers` parameter determines whether the documents referred in the `transformers` field are returned or not;
	// the `includeGenerators` parameter determines whether the documents referred in the `generators` field are returned or not.
	GetResources(includeResources, includeTransformers, includeGenerators bool) ([]*doc.Document, error)
	WasCached() bool
}

type Crawler ¶

type Crawler interface {
	// Crawl returns when it is done processing. This method does not take
	// ownership of the channel. The channel is write only, and it
	// designates where the crawler should forward the documents.
	Crawl(ctx context.Context, output chan<- CrawledDocument, seen utils.SeenMap) error

	// Get the document data given the FilePath, Repo, and Ref/Tag/Branch.
	FetchDocument(context.Context, *doc.Document) error
	// Write to the document what the created time is.
	SetCreated(context.Context, *doc.Document) error

	SetDefaultBranch(*doc.Document)

	Match(*doc.Document) bool
}

Crawler forwards documents from source repositories to index and store them for searching. Each crawler is responsible for querying it's source of information, and forwarding files that have not been seen before or that need updating.

type IndexFunc ¶

type IndexFunc func(CrawledDocument, index.Mode) error

Source Files ¶

View all Source files

crawler.go

Directories ¶

Path	Synopsis
github Package github implements the crawler.Crawler interface, getting data from the Github search API.	Package github implements the crawler.Crawler interface, getting data from the Github search API.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL