crawler

package
v0.0.0-...-0a64c4a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 31, 2018 License: MIT Imports: 15 Imported by: 0

Documentation

Index

Constants

View Source
const (
	// DefaultOutputFileDot is the .dot file location to save the sitemap graph information to.
	DefaultOutputFileDot = "sitemap.dot"

	// DefaultOutputFileSvg is the .svg file location to save the sitemap graph to.
	DefaultOutputFileSvg = "sitemap.svg"
)
View Source
const FetchTimeout = 5 * time.Second

FetchTimeout defines the max amount of time the parser will try to fetch a given page for.

Variables

View Source
var (
	// ErrExternalDomain is returned when the given URL redirects to a domain outside the starting domain
	ErrExternalDomain = errors.New("URL is outside the starting domain, ignoring")

	// ErrTooManyRedirects is returned after 10 consecutive redirects from a given URL
	ErrTooManyRedirects = errors.New("stopped after 10 redirects")
)

Functions

func Graph

func Graph(s Sitemap) error

Graph renders the given sitemap as a graph saved in an SVG file. The graph is generated using dot, a graphviz tool. The dot command is invoked using the exec command, and it is assumed that dot is already installed. The sitemap data is first saved as a .dot file, which is then passed as source to the dot command.

func Text

func Text(s Sitemap) (string, error)

Text renders the given sitemap as a list of pages and links found.

Types

type CanonicalURL

type CanonicalURL string

CanonicalURL represents the normalised page URL (a full URL with no query params or fragments).

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler is used to crawl a given starting URL, up to a max depth.

func NewCrawler

func NewCrawler(start *url.URL, depth int) Crawler

NewCrawler returns an instance of the Crawler with all its required properties initialised.

func (*Crawler) Crawl

func (c *Crawler) Crawl(ctx context.Context) Sitemap

Crawl will start crawling the URL given to the Crawler as the starting URL. Once the maximum depth is reached or no new pages are found, a Sitemap struct will be returned with the results. Crawl accepts a cancellable context and stops crawling when the context is cancelled, returning the current results.

type Links []string

Links is a slice containing links found on a given page.

type Page

type Page struct {
	Addr  CanonicalURL
	Links Links
}

Page defines the data structure representing a single web page. Addr is the full URL of the page with no query params or fragments. Links is a collection of links found on the page.

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

Parser parses the DOM of a single web page.

func NewParser

func NewParser(domainScheme string, domainHost string) Parser

NewParser returns an instance of the Parser with all its required properties initialised. The given domain scheme and host values are used as the scheme and host values of any relative URLs found on the page.

type Sitemap

type Sitemap map[CanonicalURL]Links

Sitemap is the data structure holding current sitemap information. It's a map of a page URL to the links found on that page.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL