gowebcrawler

package module
v0.0.0-...-77334f8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 1, 2015 License: MIT Imports: 5 Imported by: 0

README

gowebcrawler

gowebcrawler is a concurrent Web Crawler that generates a JSON sitemap for a given root URL

TODO

  • Better logging and error handling

USAGE

See example usage here

Documentation

Overview

gowebcrawler is a concurrent Web Crawler that generates a JSON sitemap for a given root URL

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GetAttributesFromDocument

func GetAttributesFromDocument(doc *goquery.Document) (links []string, assets []string)

Gets slices of links and assets from a goquery.Document

Types

type Crawler

type Crawler interface {
	Crawl(string, parser Parser) ([]byte, error)
}

type Page

type Page struct {
	Url      string
	Assets   []string
	Links    []string
	Children map[string]*Page
	// contains filtered or unexported fields
}

A Page represents a web page's relation to other pages and the data needed to make a site map showing assets it depends on

type PageMessage

type PageMessage struct {
	Page  *Page
	Error error
	Url   string
}

type Parser

type Parser interface {
	Parse(string) (links []string, assets []string, err error)
}

type UrlParser

type UrlParser struct{}

UrlParser implements Parser to extract relevant data from a page at a given URL

func (UrlParser) Parse

func (u UrlParser) Parse(url string) (links []string, assets []string, err error)

Grabs links and assets from a page at a URL

type WebCrawler

type WebCrawler struct {
	Parser     *UrlParser
	RootUrl    string
	FetchLimit int
}

WebCrawler implements Crawler and generates a JSON site map from a starting domain and path. It takes care to not crawl other domains or get the same page more than once. Also supports a FetchLimit to limit total fetches made.

func (WebCrawler) Crawl

func (w WebCrawler) Crawl(url string) ([]byte, error)

Starts crawling from a given URL or path.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL