crawler

package

v1.1.0 Latest Latest Go to latest Published: Sep 23, 2020 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Package crawler provides a website crawler.

This section is empty.

This section is empty.

This section is empty.

type Crawler struct {
	URL         *url.URL
	Concurrency int
	Allow404    bool
	HTTPClient  *http.Client
	// contains filtered or unexported fields
}

A Crawler is in charge of visiting or "crawling" all pages and assets of a particular URL.

func (c *Crawler) Queue(u *url.URL)

Queue a given URL. This method is non-blocking.

func (c *Crawler) Resources() <-chan Resource

Resources returns a channel of resources visited by the crawler.

func (c *Crawler) Run(ctx context.Context) error

Run starts the crawling process and waits for completion.

func (c *Crawler) Start(ctx context.Context) error

Start crawling workers asynchronously. Use Wait() to block until completion.

func (c *Crawler) Wait() error

Wait for all pending targets to be crawled.

type Resource struct {
	Target
	StatusCode int
	Duration   time.Duration
	Body       io.ReadCloser
	Error      error
}

A Resource is representation of the response to a Target request, for a particular page or asset.

type Target struct {
	Parent *url.URL
	URL    *url.URL
}

A Target is a target URL to crawl, with optional Parent page URL.