crawler

package
v0.0.0-...-151a24d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 17, 2023 License: GPL-3.0 Imports: 8 Imported by: 0

Documentation

Overview

Package crawler implements a web crawler.

A web crawler is a program that automatically traverses the web by downloading the pages and following the links from page to page.

This package contains a web crawler that can be configured to use different strategies to crawl the web. The strategies are implemented in the strategies.go file.

The crawler can be configured to use a strategy and a set of limits. The limits are the maximum number of seconds and requests that the crawler can make. The crawler will stop when it reaches the limits. The limits are not hard limits and the crawler may exceed them by a small amount. The crawler will stop as soon as it can.

Types of strategies

The crawler can be configured to use one of the following strategies:

## Recursive

This strategy implements a search approach to crawl URLs recursively, discovering new URLs at each level and continuing the crawling process until there are no more unvisited URLs.

## Recursive with limits

This strategy implements a search approach to crawl URLs recursively, discovering new URLs at each level and continuing the crawling process until there are no more unvisited URLs or the limits are reached.

## Parallel

This strategy implements a search approach to crawl URLs in parallel, discovering new URLs at each level and continuing the crawling process until there are no more unvisited URLs.

## Parallel with limits

This strategy implements a search approach to crawl URLs in parallel, discovering new URLs at each level and continuing the crawling process until there are no more unvisited URLs or the limits are reached.

Usage

The following example shows how to use the crawler package to crawl a website using the Recursive strategy:

package main

import (
	"fmt"
	"log"
	"net/url"

	"github.com/paconte/gocrawler"
	"github.com/paconte/crawler/strategies"
)

func main() {

	// Create a new URL to crawl.
	url, err := url.Parse("https://www.example.com")
	if err != nil {
		log.Fatal(err)
	}

	// Create a new crawler with the Recursive strategy.
	c := crawler.New(url, strategies.NewRecursive(url))

	// Start crawling.
	visited := c.Crawl()

	// Print the visited URLs.
	fmt.Println(visited)
}

The following example shows how to use the crawler package to crawl a website using the Recursive with limits strategy:

package main

import (
	"fmt"
	"log"
	"net/url"
	"time"

	"github.com/paconte/gocrawler"
	"github.com/paconte/crawler/strategies"
)

func main() {

	// Create a new URL to crawl.
	url, err := url.Parse("https://www.example.com")
	if err != nil {
		log.Fatal(err)
	}

	// Create a new crawler with the Recursive with limits strategy.
	c := crawler.New(url, strategies.NewRecursiveWithLimits(url, crawler.Limits{
		Milliseconds: 1000,
		Requests:     100,
	}))

	// Start crawling.
	visited := c.Crawl()

	// Print the visited URLs.
	fmt.Println(visited)
}

Pipeline

The crawler package uses a pipeline to crawl the web. The pipeline is composed of the following stages:

## Download

The Download stage downloads the content of a URL and returns a string

## Parse

The Parse stage parses the content of a URL and returns a slice of URLs

## Extract

The Extract stage extracts the URLs that match the root URL and returns a slice

## Collect

The Collect stage collects the URLs and returns a map

## MapToList

The MapToList stage converts a map to a slice

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CollectMap

func CollectMap(links <-chan string) map[string]bool

CollectMap reads strings from the input channel and collects them into a map. It returns a map containing the collected strings as keys, with a value of true.

func Download

func Download(url ...string) <-chan *http.Response

Download asynchronously downloads the specified URLs and returns a channel of *http.Response. Each response will be sent on the channel as it becomes available. The returned channel will be closed once all downloads are complete.

func Extract

func Extract(nodes <-chan *html.Node, url *url.URL) <-chan string

Parse asynchronously parses the HTML nodes in the *http.Response objects received on the input channel. It returns a channel of *html.Node containing the parsed nodes. The returned channel will be closed once all parsing is complete.

func GetSubdomains

func GetSubdomains(node *html.Node, domain *url.URL) map[string]bool

GetSubdomains recursively extracts subdomains from an HTML node. It returns a map of subdomains found in the HTML node.

func IsSubdomain

func IsSubdomain(link string, domain *url.URL) bool

IsSubdomain checks if a given link is a subdomain of the specified domain. It returns true if the link is a subdomain, false otherwise. If the domain is exactly the same as the link, it returns false.

func MapToList

func MapToList(links map[string]bool) []string

MapToList converts a map of strings to a slice of strings. It returns a slice containing all the keys from the input map.

func Parse

func Parse(nodes <-chan *http.Response) <-chan *html.Node

Parse asynchronously parses the HTML nodes in the *http.Response objects received on the input channel. It returns a channel of *html.Node containing the parsed nodes. The returned channel will be closed once all parsing is complete.

func Run

func Run(rootUrl string, strategy string, limits Limits) ([]string, error)

Run starts the web crawling process with the specified root URL. It returns the result of the crawl or an error if any occurred.

Types

type Limits

type Limits struct {
	Milliseconds int
	Requests     int
}

Limits represents the limits of a strategy. It contains the maximum number of seconds and requests.

type OneLevel

type OneLevel struct {
	// contains filtered or unexported fields
}

This strategy crawls the root URL and collects URLs up to one level deep. It returns a list of collected URLs.

func NewOneLevel

func NewOneLevel(url *url.URL) *OneLevel

NewOneLevel creates a new instance of the OneLevel strategy.

func (*OneLevel) Run

func (s *OneLevel) Run() []string

Run starts the web crawling process using the OneLevel strategy. It takes the root URL as input and returns a list of collected URLs.

type Recursive

type Recursive struct {
	// contains filtered or unexported fields
}

This strategy implements asearch approach to crawl URLs recursively, discovering new URLs at each level and continuing the crawling process until there are no more unvisited URLs.

func NewRecursive

func NewRecursive(url *url.URL) *Recursive

NewRecursive creates a new instance of the Recursive strategy.

func (*Recursive) Run

func (s *Recursive) Run() []string

Run starts the web crawling process using the Recursive strategy. It returns a list of visited URLs.

type RecursiveParallel

type RecursiveParallel struct {
	// contains filtered or unexported fields
}

RecursiveParallel implements a parallelized version of the Recursive strategy.

func NewRecursiveParallel

func NewRecursiveParallel(url *url.URL) *RecursiveParallel

NewRecursiveParallel creates a new instance of the RecursiveParallel strategy.

func (*RecursiveParallel) Run

func (s *RecursiveParallel) Run() []string

Run starts the web crawling process using the RecursiveParallel strategy. It takes the root URL as input and returns a list of visited URLs.

type RecursiveParallelWithLimits

type RecursiveParallelWithLimits struct {
	// contains filtered or unexported fields
}

RecursiveParallelWithLimits implements a parallelized version of the Recursive strategy It also has a limit of http requests and time.

func NewRecursiveParallelWithLimits

func NewRecursiveParallelWithLimits(url *url.URL, limits Limits) *RecursiveParallelWithLimits

RecursiveParallelWithLimits creates a new instance of the RecursiveParallelWithLimits strategy.

func (*RecursiveParallelWithLimits) Run

Run starts the web crawling process using the RecursiveParallelWithLimits strategy. It takes the root URL as input and returns a list of visited URLs.

type RecursiveWithLimits

type RecursiveWithLimits struct {
	// contains filtered or unexported fields
}

RecursiveWithLimits implements the same strategy as the Recursive strategy, but adding limits to the number of requests and time.

func NewRecursiveWithLimits

func NewRecursiveWithLimits(url *url.URL, limits Limits) *RecursiveWithLimits

NewRecursiveWithLimits creates a new instance of the Recursive strategy.

func (*RecursiveWithLimits) Run

func (s *RecursiveWithLimits) Run() []string

Run starts the web crawling process using the Recursive strategy with limits. It returns a list of visited URLs.

type Strategy

type Strategy interface {
	Run() []string
}

Strategy represents a web crawling strategy.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL