Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Crawler ¶
type Crawler interface {
Crawl() (ResourceGraph, error)
}
Crawler produces a ResourceGraph from within a single domain.
func NewCrawler ¶
NewCrawler creates a single threaded Crawler that respects robots.txt, starting with domain, plus the links provided in the sitemap. The sitemap is reported by robots.txt.
Crawler will write to standard log as it progresses.
type ResourceGraph ¶
type ResourceGraph interface { // if a URL is reachable on the website. Contains(string) bool // the number of resources in the graph. ResourceCount() int // the number of links interconnecting the resources. LinkCount() int // walks over the graph as long as the func returns true. Walk(func(link string, status int, refersTo, referedBy []string) bool) // can be marshalled to JSON. json.Marshaler }
ResourceGraph represents a website as a directed graph where resources are connected by URLs.
Directories ¶
Path | Synopsis |
---|---|
Godeps
|
|
_workspace/src/code.google.com/p/cascadia
The cascadia package is an implementation of CSS selectors.
|
The cascadia package is an implementation of CSS selectors. |
_workspace/src/code.google.com/p/go.net/html
Package html implements an HTML5-compliant tokenizer and parser.
|
Package html implements an HTML5-compliant tokenizer and parser. |
_workspace/src/code.google.com/p/go.net/html/atom
Package atom provides integer codes (also known as atoms) for a fixed set of frequently occurring HTML strings: tag names and attribute keys such as "p" and "id".
|
Package atom provides integer codes (also known as atoms) for a fixed set of frequently occurring HTML strings: tag names and attribute keys such as "p" and "id". |
_workspace/src/code.google.com/p/go.net/html/charset
Package charset provides common text encodings for HTML documents.
|
Package charset provides common text encodings for HTML documents. |
_workspace/src/github.com/PuerkitoBio/goquery
Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document (the modification functions of jQuery are not included).
|
Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document (the modification functions of jQuery are not included). |
_workspace/src/github.com/PuerkitoBio/purell
Package purell offers URL normalization as described on the wikipedia page: http://en.wikipedia.org/wiki/URL_normalization
|
Package purell offers URL normalization as described on the wikipedia page: http://en.wikipedia.org/wiki/URL_normalization |
_workspace/src/github.com/gorilla/context
Package gorilla/context stores values shared during a request lifetime.
|
Package gorilla/context stores values shared during a request lifetime. |
_workspace/src/github.com/gorilla/mux
Package gorilla/mux implements a request router and dispatcher.
|
Package gorilla/mux implements a request router and dispatcher. |
_workspace/src/github.com/temoto/robotstxt-go
The robots.txt Exclusion Protocol is implemented as specified in http://www.robotstxt.org/wc/robots.html with various extensions.
|
The robots.txt Exclusion Protocol is implemented as specified in http://www.robotstxt.org/wc/robots.html with various extensions. |
cmd
|
|
Click to show internal directories.
Click to hide internal directories.