Documentation ¶
Overview ¶
Package crawler implements a web crawler.
A web crawler is a program that automatically traverses the web by downloading the pages and following the links from page to page.
This package contains a web crawler that can be configured to use different strategies to crawl the web. The strategies are implemented in the strategies.go file.
The crawler can be configured to use a strategy and a set of limits. The limits are the maximum number of seconds and requests that the crawler can make. The crawler will stop when it reaches the limits. The limits are not hard limits and the crawler may exceed them by a small amount. The crawler will stop as soon as it can.
Types of strategies ¶
The crawler can be configured to use one of the following strategies:
## Recursive
This strategy implements a search approach to crawl URLs recursively, discovering new URLs at each level and continuing the crawling process until there are no more unvisited URLs.
## Recursive with limits
This strategy implements a search approach to crawl URLs recursively, discovering new URLs at each level and continuing the crawling process until there are no more unvisited URLs or the limits are reached.
## Parallel
This strategy implements a search approach to crawl URLs in parallel, discovering new URLs at each level and continuing the crawling process until there are no more unvisited URLs.
## Parallel with limits
This strategy implements a search approach to crawl URLs in parallel, discovering new URLs at each level and continuing the crawling process until there are no more unvisited URLs or the limits are reached.
Usage ¶
The following example shows how to use the crawler package to crawl a website using the Recursive strategy:
package main import ( "fmt" "log" "net/url" "github.com/paconte/gocrawler" "github.com/paconte/crawler/strategies" ) func main() { // Create a new URL to crawl. url, err := url.Parse("https://www.example.com") if err != nil { log.Fatal(err) } // Create a new crawler with the Recursive strategy. c := crawler.New(url, strategies.NewRecursive(url)) // Start crawling. visited := c.Crawl() // Print the visited URLs. fmt.Println(visited) }
The following example shows how to use the crawler package to crawl a website using the Recursive with limits strategy:
package main import ( "fmt" "log" "net/url" "time" "github.com/paconte/gocrawler" "github.com/paconte/crawler/strategies" ) func main() { // Create a new URL to crawl. url, err := url.Parse("https://www.example.com") if err != nil { log.Fatal(err) } // Create a new crawler with the Recursive with limits strategy. c := crawler.New(url, strategies.NewRecursiveWithLimits(url, crawler.Limits{ Milliseconds: 1000, Requests: 100, })) // Start crawling. visited := c.Crawl() // Print the visited URLs. fmt.Println(visited) }
Pipeline ¶
The crawler package uses a pipeline to crawl the web. The pipeline is composed of the following stages:
## Download
The Download stage downloads the content of a URL and returns a string ¶
## Parse
The Parse stage parses the content of a URL and returns a slice of URLs ¶
## Extract
The Extract stage extracts the URLs that match the root URL and returns a slice ¶
## Collect
The Collect stage collects the URLs and returns a map ¶
## MapToList
The MapToList stage converts a map to a slice
Index ¶
- func CollectMap(links <-chan string) map[string]bool
- func Download(url ...string) <-chan *http.Response
- func Extract(nodes <-chan *html.Node, url *url.URL) <-chan string
- func GetSubdomains(node *html.Node, domain *url.URL) map[string]bool
- func IsSubdomain(link string, domain *url.URL) bool
- func MapToList(links map[string]bool) []string
- func Parse(nodes <-chan *http.Response) <-chan *html.Node
- func Run(rootUrl string, strategy string, limits Limits) ([]string, error)
- type Limits
- type OneLevel
- type Recursive
- type RecursiveParallel
- type RecursiveParallelWithLimits
- type RecursiveWithLimits
- type Strategy
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CollectMap ¶
CollectMap reads strings from the input channel and collects them into a map. It returns a map containing the collected strings as keys, with a value of true.
func Download ¶
Download asynchronously downloads the specified URLs and returns a channel of *http.Response. Each response will be sent on the channel as it becomes available. The returned channel will be closed once all downloads are complete.
func Extract ¶
Parse asynchronously parses the HTML nodes in the *http.Response objects received on the input channel. It returns a channel of *html.Node containing the parsed nodes. The returned channel will be closed once all parsing is complete.
func GetSubdomains ¶
GetSubdomains recursively extracts subdomains from an HTML node. It returns a map of subdomains found in the HTML node.
func IsSubdomain ¶
IsSubdomain checks if a given link is a subdomain of the specified domain. It returns true if the link is a subdomain, false otherwise. If the domain is exactly the same as the link, it returns false.
func MapToList ¶
MapToList converts a map of strings to a slice of strings. It returns a slice containing all the keys from the input map.
Types ¶
type Limits ¶
Limits represents the limits of a strategy. It contains the maximum number of seconds and requests.
type OneLevel ¶
type OneLevel struct {
// contains filtered or unexported fields
}
This strategy crawls the root URL and collects URLs up to one level deep. It returns a list of collected URLs.
func NewOneLevel ¶
NewOneLevel creates a new instance of the OneLevel strategy.
type Recursive ¶
type Recursive struct {
// contains filtered or unexported fields
}
This strategy implements asearch approach to crawl URLs recursively, discovering new URLs at each level and continuing the crawling process until there are no more unvisited URLs.
func NewRecursive ¶
NewRecursive creates a new instance of the Recursive strategy.
type RecursiveParallel ¶
type RecursiveParallel struct {
// contains filtered or unexported fields
}
RecursiveParallel implements a parallelized version of the Recursive strategy.
func NewRecursiveParallel ¶
func NewRecursiveParallel(url *url.URL) *RecursiveParallel
NewRecursiveParallel creates a new instance of the RecursiveParallel strategy.
func (*RecursiveParallel) Run ¶
func (s *RecursiveParallel) Run() []string
Run starts the web crawling process using the RecursiveParallel strategy. It takes the root URL as input and returns a list of visited URLs.
type RecursiveParallelWithLimits ¶
type RecursiveParallelWithLimits struct {
// contains filtered or unexported fields
}
RecursiveParallelWithLimits implements a parallelized version of the Recursive strategy It also has a limit of http requests and time.
func NewRecursiveParallelWithLimits ¶
func NewRecursiveParallelWithLimits(url *url.URL, limits Limits) *RecursiveParallelWithLimits
RecursiveParallelWithLimits creates a new instance of the RecursiveParallelWithLimits strategy.
func (*RecursiveParallelWithLimits) Run ¶
func (s *RecursiveParallelWithLimits) Run() []string
Run starts the web crawling process using the RecursiveParallelWithLimits strategy. It takes the root URL as input and returns a list of visited URLs.
type RecursiveWithLimits ¶
type RecursiveWithLimits struct {
// contains filtered or unexported fields
}
RecursiveWithLimits implements the same strategy as the Recursive strategy, but adding limits to the number of requests and time.
func NewRecursiveWithLimits ¶
func NewRecursiveWithLimits(url *url.URL, limits Limits) *RecursiveWithLimits
NewRecursiveWithLimits creates a new instance of the Recursive strategy.
func (*RecursiveWithLimits) Run ¶
func (s *RecursiveWithLimits) Run() []string
Run starts the web crawling process using the Recursive strategy with limits. It returns a list of visited URLs.