crawler

package module
v0.0.0-...-2f0050d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 4, 2020 License: Apache-2.0 Imports: 6 Imported by: 3

Documentation

Overview

Package crawler defines all the functionality for page crawling

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CrawlResult

type CrawlResult struct {
	URL   string `json:"URL"`
	Title string `json:"Title"`
}

CrawlResult defines the result of crawled single page.

type Crawler

type Crawler struct {
	ID         string        `json:"ID"`
	BaseURL    string        `json:"BaseURL"`
	StartURL   string        `json:"StartURL"`
	PagesLimit int           `json:"PagesLimit"`
	Results    []CrawlResult `json:"Results"`
}

Crawler defines a default crawler

func (*Crawler) Crawl

func (c *Crawler) Crawl() error

Crawl crawls the whole host of give startURL and saves data(URLs and Titles) to Crawler struct.

func (*Crawler) CrawlPage

func (c *Crawler) CrawlPage(url string) ([]string, CrawlResult, error)

CrawlPage crawls single page, returns links as []string, CrawlResult(Page URL and Title) and error.

func (*Crawler) FormatRelative

func (c *Crawler) FormatRelative(urls map[string]int) (formatedUrls []string)

FormatRelative formats relative links to an absolute links if encounter them during crawling.

func (c *Crawler) GetLinks(doc *goquery.Document) []string

GetLinks return all the a[href] values from the goquery.Document

func (*Crawler) GetRequest

func (c *Crawler) GetRequest(url string) (*goquery.Document, error)

GetRequest is a helper function for CrawlPage. It makes a request to a page and returns goquery.Document and error.

func (*Crawler) GetResult

func (c *Crawler) GetResult(doc *goquery.Document, url string) CrawlResult

GetResult returns a CrawlResult from a single page

func (*Crawler) ParseBase

func (c *Crawler) ParseBase() error

ParseBase parses basic url of host.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL