crawler

package

v0.0.0-...-a75fe09 Latest Latest Go to latest Published: Oct 14, 2022 License: GPL-3.0 Imports: 14 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/mmaaskant/gro-crop-scraper

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
func NewRequest(method string, url string, body io.Reader) *http.Request
type Call
- func NewCall(r *http.Request, RequestType string) *Call
type Crawler
type Data
- func NewData(t *attribute.Tag, call *Call, data string, foundCalls []*Call, err error) *Data
type HtmlCrawler
- func NewHtmlCrawler(c *http.Client) *HtmlCrawler
type Manager
- func NewManager(db *database.Db) *Manager
type RestCrawler
- func NewRestCrawler(c *http.Client) *RestCrawler
- func (rc *RestCrawler) Crawl(c *Call) *Data
- func (rc *RestCrawler) SetTag(t *attribute.Tag)

Constants ¶

View Source

const (
	DiscoverRequestType string = "DISCOVER"
	ExtractRequestType  string = "EXTRACT"
)

Variables ¶

This section is empty.

Functions ¶

func NewRequest ¶

func NewRequest(method string, url string, body io.Reader) *http.Request

Types ¶

type Call ¶

type Call struct {
	*http.Request
	RequestType string
}

Call wraps around a http.Request and adds a RequestType which should be either DiscoverRequestType or ExtractRequestType. In which the former will be used only to discover new URLs, and the latter will be stored locally for further processing.

func NewCall ¶

func NewCall(r *http.Request, RequestType string) *Call

type Crawler ¶

type Crawler interface {
	attribute.Taggable
	Crawl(c *Call) *Data
}

Crawler crawls any URL and returns Data containing what it has found, it also implements attribute.Taggable allowing it to tag said Data.

type Data ¶

type Data struct {
	*attribute.Tag
	Call       *Call
	Data       string
	FoundCalls []*Call
	Error      error
}

Data contains all data that was found by a Crawler.Crawl call, the Call itself, and a collection of found calls.

func NewData ¶

func NewData(t *attribute.Tag, call *Call, data string, foundCalls []*Call, err error) *Data

type HtmlCrawler ¶

type HtmlCrawler struct {
	*attribute.Tag
	// contains filtered or unexported fields
}

HtmlCrawler crawls http(s) urls and returns their raw data, as it uses http.Client.Do() nothing is rendered so data hidden in API calls will not be fetched. HtmlCrawler is concurrency safe and keeps a registry of all found URLs.

func NewHtmlCrawler ¶

func NewHtmlCrawler(c *http.Client) *HtmlCrawler

func (*HtmlCrawler) AddDiscoveryUrlRegex ¶

func (hc *HtmlCrawler) AddDiscoveryUrlRegex(expr string)

AddDiscoveryUrlRegex registers a new regex expression that is used to match URLs that should be collected for discovery.

func (*HtmlCrawler) AddExtractUrlRegex ¶

func (hc *HtmlCrawler) AddExtractUrlRegex(expr string)

AddExtractUrlRegex registers a new regex expression that is used to match URLs that should be collected for extraction.

func (*HtmlCrawler) Crawl ¶

func (hc *HtmlCrawler) Crawl(c *Call) *Data

Crawl crawls the given Call and returns the data and URLs it has found while doing so.

func (*HtmlCrawler) SetTag ¶

func (hc *HtmlCrawler) SetTag(t *attribute.Tag)

type Manager ¶

type Manager struct {
	// contains filtered or unexported fields
}

Manager oversees all registered Crawler instances.

func NewManager ¶

func NewManager(db *database.Db) *Manager

func (*Manager) RegisterCrawler ¶

func (m *Manager) RegisterCrawler(c Crawler, calls []*Call)

func (*Manager) RegisterCrawlers ¶

func (m *Manager) RegisterCrawlers(crawlers map[Crawler][]*Call)

func (*Manager) Start ¶

func (m *Manager) Start(amountOfWorkers int)

Start begins crawling using the provided Crawler and Call instances, a supervisor.Supervisor instance is used to crawl concurrently.

type RestCrawler ¶

type RestCrawler struct {
	*attribute.Tag
	// contains filtered or unexported fields
}

RestCrawler crawls REST APIs using the provided Call instance.

func NewRestCrawler ¶

func NewRestCrawler(c *http.Client) *RestCrawler

NewRestCrawler returns a new instance of RestCrawler.

func (*RestCrawler) Crawl ¶

func (rc *RestCrawler) Crawl(c *Call) *Data

Crawl starts crawling based on the given Call instance and returns a Data instance containing the response as a string and any other relevant data found along the way.

func (*RestCrawler) SetTag ¶

func (rc *RestCrawler) SetTag(t *attribute.Tag)

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL