gowebcrawler

package module

v0.0.0-...-77334f8 Latest Latest Go to latest Published: Jun 1, 2015 License: MIT Imports: 5 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/cgenuity/gowebcrawler

Links

Open Source Insights

README ¶

gowebcrawler

gowebcrawler is a concurrent Web Crawler that generates a JSON sitemap for a given root URL

TODO

Better logging and error handling

USAGE

See example usage here

Documentation ¶

Overview ¶

gowebcrawler is a concurrent Web Crawler that generates a JSON sitemap for a given root URL

Index ¶

func GetAttributesFromDocument(doc *goquery.Document) (links []string, assets []string)
type Crawler
type Page
type PageMessage
type Parser
type UrlParser
- func (u UrlParser) Parse(url string) (links []string, assets []string, err error)
type WebCrawler
- func (w WebCrawler) Crawl(url string) ([]byte, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func GetAttributesFromDocument ¶

func GetAttributesFromDocument(doc *goquery.Document) (links []string, assets []string)

Gets slices of links and assets from a goquery.Document

Types ¶

type Crawler ¶

type Crawler interface {
	Crawl(string, parser Parser) ([]byte, error)
}

type Page ¶

type Page struct {
	Url      string
	Assets   []string
	Links    []string
	Children map[string]*Page
	// contains filtered or unexported fields
}

A Page represents a web page's relation to other pages and the data needed to make a site map showing assets it depends on

type PageMessage ¶

type PageMessage struct {
	Page  *Page
	Error error
	Url   string
}

type Parser ¶

type Parser interface {
	Parse(string) (links []string, assets []string, err error)
}

type UrlParser ¶

type UrlParser struct{}

UrlParser implements Parser to extract relevant data from a page at a given URL

func (UrlParser) Parse ¶

func (u UrlParser) Parse(url string) (links []string, assets []string, err error)

Grabs links and assets from a page at a URL

type WebCrawler ¶

type WebCrawler struct {
	Parser     *UrlParser
	RootUrl    string
	FetchLimit int
}

WebCrawler implements Crawler and generates a JSON site map from a starting domain and path. It takes care to not crawl other domains or get the same page more than once. Also supports a FetchLimit to limit total fetches made.

func (WebCrawler) Crawl ¶

func (w WebCrawler) Crawl(url string) ([]byte, error)

Starts crawling from a given URL or path.

Source Files ¶

View all Source files

crawler.go

Directories ¶

Path	Synopsis
examples

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL