Documentation ¶
Index ¶
- Constants
- func CrawlPage(site *url.URL, timeout int, ok, ignore, fail bool)
- func ExtractTagAttribute(node *html.Node, tagName, attrName string) []string
- func FetchDocument(url string, c *http.Client) (*html.Node, error)
- func ProcessLeaf(c *http.Client, l *Link, res resSink, done doneSink, t chan struct{})
- func ProcessNode(c *http.Client, l *Link, links linkSink, res resSink, done doneSink, ...)
- func QualifyInternalURL(page, link *url.URL) *url.URL
- type Link
- type Result
Constants ¶
const ( // Parallelism is the max. amount of HTTP requests open at any given time. Parallelism = 64 // UserAgent defines a value used for the "User-Agent" header to avoid being blocked. UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0" )
Variables ¶
This section is empty.
Functions ¶
func CrawlPage ¶
CrawlPage crawls the given site's URL and reports successfully checked links, ignored links, and failed links (according to the flags ok, ignore, fail, respectively). The given timeout is used to limit the waiting time of the http client for a request.
func ExtractTagAttribute ¶
ExtractTagAttribute traverses the given node's tree, searches it for nodes with the given tag name, and extracts the given attribute value from it.
func FetchDocument ¶
FetchDocument gets the document indicated by the given url using the given client, and returns its root (document) node. An error is returned if the document cannot be fetched or parsed as HTML.
func ProcessLeaf ¶
ProcessLeaf uses the given http.Client to fetch the given link using a GET request, and reports the result of that request. A message is sent to the given done channel when the node has been processed.
func ProcessNode ¶
func ProcessNode(c *http.Client, l *Link, links linkSink, res resSink, done doneSink, t chan struct{})
ProcessNode uses the given http.Client to fetch the given link, and reports the extracted links on the page (indicated by <a href="...">). Links unsuitable for further crawling and malformed links are reported. A message is sent to the given done channel when the node has been processed.
Types ¶
type Link ¶
Link represents a link (URL) in the context of a web site (Site).
func NewLink ¶
NewLink creates a Link from the given address. An error is returned, if the address cannot be parsed.
func (*Link) IsCrawlable ¶
IsCrawlable returns true if the URL of the link has http(s) as the protocol, or no protocol at all (which indicates an internal link), and false otherwise.
func (*Link) IsInternal ¶
IsInternal returns true if the link's URL points to the same domain as its site, and false otherwise.