crawler

package

v0.0.0-...-151a24d Latest Latest Go to latest Published: Jul 17, 2023 License: GPL-3.0 Imports: 8 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/paconte/gocrawler

Links

Open Source Insights

Documentation ¶

Overview ¶

Package crawler implements a web crawler.

A web crawler is a program that automatically traverses the web by downloading the pages and following the links from page to page.

This package contains a web crawler that can be configured to use different strategies to crawl the web. The strategies are implemented in the strategies.go file.

The crawler can be configured to use a strategy and a set of limits. The limits are the maximum number of seconds and requests that the crawler can make. The crawler will stop when it reaches the limits. The limits are not hard limits and the crawler may exceed them by a small amount. The crawler will stop as soon as it can.

Types of strategies ¶

The crawler can be configured to use one of the following strategies:

## Recursive

This strategy implements a search approach to crawl URLs recursively, discovering new URLs at each level and continuing the crawling process until there are no more unvisited URLs.

## Recursive with limits

This strategy implements a search approach to crawl URLs recursively, discovering new URLs at each level and continuing the crawling process until there are no more unvisited URLs or the limits are reached.

## Parallel

This strategy implements a search approach to crawl URLs in parallel, discovering new URLs at each level and continuing the crawling process until there are no more unvisited URLs.

## Parallel with limits

This strategy implements a search approach to crawl URLs in parallel, discovering new URLs at each level and continuing the crawling process until there are no more unvisited URLs or the limits are reached.

Usage ¶

The following example shows how to use the crawler package to crawl a website using the Recursive strategy:

package main

import (
	"fmt"
	"log"
	"net/url"

	"github.com/paconte/gocrawler"
	"github.com/paconte/crawler/strategies"
)

func main() {

	// Create a new URL to crawl.
	url, err := url.Parse("https://www.example.com")
	if err != nil {
		log.Fatal(err)
	}

	// Create a new crawler with the Recursive strategy.
	c := crawler.New(url, strategies.NewRecursive(url))

	// Start crawling.
	visited := c.Crawl()

	// Print the visited URLs.
	fmt.Println(visited)
}

The following example shows how to use the crawler package to crawl a website using the Recursive with limits strategy:

package main

import (
	"fmt"
	"log"
	"net/url"
	"time"

	"github.com/paconte/gocrawler"
	"github.com/paconte/crawler/strategies"
)

func main() {

	// Create a new URL to crawl.
	url, err := url.Parse("https://www.example.com")
	if err != nil {
		log.Fatal(err)
	}

	// Create a new crawler with the Recursive with limits strategy.
	c := crawler.New(url, strategies.NewRecursiveWithLimits(url, crawler.Limits{
		Milliseconds: 1000,
		Requests:     100,
	}))

	// Start crawling.
	visited := c.Crawl()

	// Print the visited URLs.
	fmt.Println(visited)
}

Pipeline ¶

The crawler package uses a pipeline to crawl the web. The pipeline is composed of the following stages:

## Download

The Download stage downloads the content of a URL and returns a string ¶

## Parse

The Parse stage parses the content of a URL and returns a slice of URLs ¶

## Extract

The Extract stage extracts the URLs that match the root URL and returns a slice ¶

## Collect

The Collect stage collects the URLs and returns a map ¶

## MapToList

The MapToList stage converts a map to a slice

Index ¶

func CollectMap(links <-chan string) map[string]bool
func Download(url ...string) <-chan *http.Response
func Extract(nodes <-chan *html.Node, url *url.URL) <-chan string
func GetSubdomains(node *html.Node, domain *url.URL) map[string]bool
func IsSubdomain(link string, domain *url.URL) bool
func MapToList(links map[string]bool) []string
func Parse(nodes <-chan *http.Response) <-chan *html.Node
func Run(rootUrl string, strategy string, limits Limits) ([]string, error)
type Limits
type OneLevel
- func NewOneLevel(url *url.URL) *OneLevel
- func (s *OneLevel) Run() []string
type Recursive
- func NewRecursive(url *url.URL) *Recursive
- func (s *Recursive) Run() []string
type RecursiveParallel
- func NewRecursiveParallel(url *url.URL) *RecursiveParallel
- func (s *RecursiveParallel) Run() []string
type RecursiveParallelWithLimits
- func NewRecursiveParallelWithLimits(url *url.URL, limits Limits) *RecursiveParallelWithLimits
- func (s *RecursiveParallelWithLimits) Run() []string
type RecursiveWithLimits
- func NewRecursiveWithLimits(url *url.URL, limits Limits) *RecursiveWithLimits
- func (s *RecursiveWithLimits) Run() []string
type Strategy

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CollectMap ¶

func CollectMap(links <-chan string) map[string]bool

CollectMap reads strings from the input channel and collects them into a map. It returns a map containing the collected strings as keys, with a value of true.

func Download ¶

func Download(url ...string) <-chan *http.Response

Download asynchronously downloads the specified URLs and returns a channel of *http.Response. Each response will be sent on the channel as it becomes available. The returned channel will be closed once all downloads are complete.

func Extract ¶

func Extract(nodes <-chan *html.Node, url *url.URL) <-chan string

Parse asynchronously parses the HTML nodes in the *http.Response objects received on the input channel. It returns a channel of *html.Node containing the parsed nodes. The returned channel will be closed once all parsing is complete.

func GetSubdomains ¶

func GetSubdomains(node *html.Node, domain *url.URL) map[string]bool

GetSubdomains recursively extracts subdomains from an HTML node. It returns a map of subdomains found in the HTML node.

func IsSubdomain ¶

func IsSubdomain(link string, domain *url.URL) bool

IsSubdomain checks if a given link is a subdomain of the specified domain. It returns true if the link is a subdomain, false otherwise. If the domain is exactly the same as the link, it returns false.

func MapToList ¶

func MapToList(links map[string]bool) []string

MapToList converts a map of strings to a slice of strings. It returns a slice containing all the keys from the input map.

func Parse ¶

func Parse(nodes <-chan *http.Response) <-chan *html.Node

Parse asynchronously parses the HTML nodes in the *http.Response objects received on the input channel. It returns a channel of *html.Node containing the parsed nodes. The returned channel will be closed once all parsing is complete.

func Run ¶

func Run(rootUrl string, strategy string, limits Limits) ([]string, error)

Run starts the web crawling process with the specified root URL. It returns the result of the crawl or an error if any occurred.

Types ¶

type Limits ¶

type Limits struct {
	Milliseconds int
	Requests     int
}

Limits represents the limits of a strategy. It contains the maximum number of seconds and requests.

type OneLevel ¶

type OneLevel struct {
	// contains filtered or unexported fields
}

This strategy crawls the root URL and collects URLs up to one level deep. It returns a list of collected URLs.

func NewOneLevel ¶

func NewOneLevel(url *url.URL) *OneLevel

NewOneLevel creates a new instance of the OneLevel strategy.

func (*OneLevel) Run ¶

func (s *OneLevel) Run() []string

Run starts the web crawling process using the OneLevel strategy. It takes the root URL as input and returns a list of collected URLs.

type Recursive ¶

type Recursive struct {
	// contains filtered or unexported fields
}

This strategy implements asearch approach to crawl URLs recursively, discovering new URLs at each level and continuing the crawling process until there are no more unvisited URLs.

func NewRecursive ¶

func NewRecursive(url *url.URL) *Recursive

NewRecursive creates a new instance of the Recursive strategy.

func (*Recursive) Run ¶

func (s *Recursive) Run() []string

Run starts the web crawling process using the Recursive strategy. It returns a list of visited URLs.

type RecursiveParallel ¶

type RecursiveParallel struct {
	// contains filtered or unexported fields
}

RecursiveParallel implements a parallelized version of the Recursive strategy.

func NewRecursiveParallel ¶

func NewRecursiveParallel(url *url.URL) *RecursiveParallel

NewRecursiveParallel creates a new instance of the RecursiveParallel strategy.

func (*RecursiveParallel) Run ¶

func (s *RecursiveParallel) Run() []string

Run starts the web crawling process using the RecursiveParallel strategy. It takes the root URL as input and returns a list of visited URLs.

type RecursiveParallelWithLimits ¶

type RecursiveParallelWithLimits struct {
	// contains filtered or unexported fields
}

RecursiveParallelWithLimits implements a parallelized version of the Recursive strategy It also has a limit of http requests and time.

func NewRecursiveParallelWithLimits ¶

func NewRecursiveParallelWithLimits(url *url.URL, limits Limits) *RecursiveParallelWithLimits

RecursiveParallelWithLimits creates a new instance of the RecursiveParallelWithLimits strategy.

func (*RecursiveParallelWithLimits) Run ¶

func (s *RecursiveParallelWithLimits) Run() []string

Run starts the web crawling process using the RecursiveParallelWithLimits strategy. It takes the root URL as input and returns a list of visited URLs.

type RecursiveWithLimits ¶

type RecursiveWithLimits struct {
	// contains filtered or unexported fields
}

RecursiveWithLimits implements the same strategy as the Recursive strategy, but adding limits to the number of requests and time.

func NewRecursiveWithLimits ¶

func NewRecursiveWithLimits(url *url.URL, limits Limits) *RecursiveWithLimits

NewRecursiveWithLimits creates a new instance of the Recursive strategy.

func (*RecursiveWithLimits) Run ¶

func (s *RecursiveWithLimits) Run() []string

Run starts the web crawling process using the Recursive strategy with limits. It returns a list of visited URLs.

type Strategy ¶

type Strategy interface {
	Run() []string
}

Strategy represents a web crawling strategy.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL