crawler

package
v0.0.0-...-a75fe09 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 14, 2022 License: GPL-3.0 Imports: 14 Imported by: 0

Documentation

Index

Constants

View Source
const (
	DiscoverRequestType string = "DISCOVER"
	ExtractRequestType  string = "EXTRACT"
)

Variables

This section is empty.

Functions

func NewRequest

func NewRequest(method string, url string, body io.Reader) *http.Request

Types

type Call

type Call struct {
	*http.Request
	RequestType string
}

Call wraps around a http.Request and adds a RequestType which should be either DiscoverRequestType or ExtractRequestType. In which the former will be used only to discover new URLs, and the latter will be stored locally for further processing.

func NewCall

func NewCall(r *http.Request, RequestType string) *Call

type Crawler

type Crawler interface {
	attribute.Taggable
	Crawl(c *Call) *Data
}

Crawler crawls any URL and returns Data containing what it has found, it also implements attribute.Taggable allowing it to tag said Data.

type Data

type Data struct {
	*attribute.Tag
	Call       *Call
	Data       string
	FoundCalls []*Call
	Error      error
}

Data contains all data that was found by a Crawler.Crawl call, the Call itself, and a collection of found calls.

func NewData

func NewData(t *attribute.Tag, call *Call, data string, foundCalls []*Call, err error) *Data

type HtmlCrawler

type HtmlCrawler struct {
	*attribute.Tag
	// contains filtered or unexported fields
}

HtmlCrawler crawls http(s) urls and returns their raw data, as it uses http.Client.Do() nothing is rendered so data hidden in API calls will not be fetched. HtmlCrawler is concurrency safe and keeps a registry of all found URLs.

func NewHtmlCrawler

func NewHtmlCrawler(c *http.Client) *HtmlCrawler

func (*HtmlCrawler) AddDiscoveryUrlRegex

func (hc *HtmlCrawler) AddDiscoveryUrlRegex(expr string)

AddDiscoveryUrlRegex registers a new regex expression that is used to match URLs that should be collected for discovery.

func (*HtmlCrawler) AddExtractUrlRegex

func (hc *HtmlCrawler) AddExtractUrlRegex(expr string)

AddExtractUrlRegex registers a new regex expression that is used to match URLs that should be collected for extraction.

func (*HtmlCrawler) Crawl

func (hc *HtmlCrawler) Crawl(c *Call) *Data

Crawl crawls the given Call and returns the data and URLs it has found while doing so.

func (*HtmlCrawler) SetTag

func (hc *HtmlCrawler) SetTag(t *attribute.Tag)

type Manager

type Manager struct {
	// contains filtered or unexported fields
}

Manager oversees all registered Crawler instances.

func NewManager

func NewManager(db *database.Db) *Manager

func (*Manager) RegisterCrawler

func (m *Manager) RegisterCrawler(c Crawler, calls []*Call)

func (*Manager) RegisterCrawlers

func (m *Manager) RegisterCrawlers(crawlers map[Crawler][]*Call)

func (*Manager) Start

func (m *Manager) Start(amountOfWorkers int)

Start begins crawling using the provided Crawler and Call instances, a supervisor.Supervisor instance is used to crawl concurrently.

type RestCrawler

type RestCrawler struct {
	*attribute.Tag
	// contains filtered or unexported fields
}

RestCrawler crawls REST APIs using the provided Call instance.

func NewRestCrawler

func NewRestCrawler(c *http.Client) *RestCrawler

NewRestCrawler returns a new instance of RestCrawler.

func (*RestCrawler) Crawl

func (rc *RestCrawler) Crawl(c *Call) *Data

Crawl starts crawling based on the given Call instance and returns a Data instance containing the response as a string and any other relevant data found along the way.

func (*RestCrawler) SetTag

func (rc *RestCrawler) SetTag(t *attribute.Tag)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL