gocrawler

package module
v0.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 1, 2023 License: MIT Imports: 16 Imported by: 0

README

gocrawler

A simple concurrent webcrawler package written in Go.

Packages

Package Description
gocrawler (main) Main crawler logic with a customisable LinkExtractor to allow users to determine how links are extracted, and ResponseMatcher to filter out unwanted responses.
logger (internal) Sets up charmbracelet/log to make logging less boring
rhttp (internal) Wrapper over net/http with provided backoff and retry policies that can be customised

Usage

Examples of how to use the crawler package can be found in the example directory.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DefaultLinkExtractor

func DefaultLinkExtractor(c *Client, currLink string, resp []byte) []string

DefaultLinkExtractor looks for <a href="..."> tags and extracts the link if the host is not blacklisted. This function assumes that if the href value is a relative path, it is relative to the current URL.

func IsClientErrorResponse

func IsClientErrorResponse(resp *http.Response) bool

This matches all responses that return a 4xx status code

func IsHtmlContent

func IsHtmlContent(resp *http.Response) bool

This matches all responses that return a 2xx status code and have a Content-Type header that contains "text/html"

func IsNoopResponse

func IsNoopResponse(resp *http.Response) bool

This matches all responses

func IsOkResponse

func IsOkResponse(resp *http.Response) bool

This matches all responses that return a 200 status code

func IsServerErrorResponse

func IsServerErrorResponse(resp *http.Response) bool

This matches all responses that return a 5xx status code

Types

type Client

type Client struct {
	MaxDepth        int
	NetMutex        sync.RWMutex
	PageMutex       sync.RWMutex
	HostBlacklist   map[string]struct{}
	VisitedNetInfo  map[string][]NetworkInfo
	VisitedPageInfo map[string]PageInfo
	// contains filtered or unexported fields
}

func New

func New(ctx context.Context, config *Config, rm []ResponseMatcher, le LinkExtractor) *Client

New creates a new crawler client using the context to allow for cancellation, the crawler config, and list of response matchers to filter out responses.

Note that the ordering of the response matchers matter, the first matcher to return false will cause the link to be skipped.

func (*Client) Crawl

func (c *Client) Crawl(ctx context.Context, currDepth int, currLink, parent string)

Crawl is called recursively to crawl the supplied URL and all outgoing links which is extracted by the supplied LinkExtractor. The crawl will stop when the MaxDepth is reached or if the context is cancelled.

type Config

type Config struct {
	BlacklistHosts map[string]struct{} // hosts to blacklist
	MaxDepth       int                 // max depth from seed
	MaxRetries     int                 // max retries for HTTP requests
	MaxRPS         float64             // max requests per second
	ProxyURL       *url.URL            // proxy URL, if any. useful to avoid IP bans
	SeedURLs       []string            // where to start crawling from
	Timeout        time.Duration       // timeout for HTTP requests
}

type IPInfo

type IPInfo struct {
	IP       string `json:"ip"`
	Location string `json:"location"`
	ASNumber string `json:"as_number"`
}

type LinkExtractor

type LinkExtractor func(c *Client, currLink string, resp []byte) []string

Takes in a map of blacklisted hosts and the response body and returns a slice of links

type NetworkInfo

type NetworkInfo struct {
	RemoteIPInfo  []IPInfo `json:"remote_ip_info"`
	AvgResponseMs int64    `json:"avg_response_ms"`
	PathCount     int      `json:"path_count"`
	VisitedPaths  []string `json:"visited_paths"`

	// These values are not exported to JSON
	TotalResponseTimeMs int64               `json:"-"`
	VisitedPathSet      map[string]struct{} `json:"-"`
}

type PageInfo

type PageInfo struct {
	Depth  int      `json:"depth"`
	Parent string   `json:"parent"`
	Links  []string `json:"links"`

	// These values are not exported to JSON
	Content []byte `json:"-"`
}

type ResponseMatcher

type ResponseMatcher func(resp *http.Response) bool

ResponseMatcher is a function that takes an http.Response and returns a boolean to indicate whether or not the contents of the URL should be processed (e.g extract links)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL