crawler

package

v0.9.2 Latest Latest Go to latest Published: Apr 25, 2024 License: Apache-2.0 Imports: 35 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/pzaino/thecrowler

Links

Open Source Insights

Documentation ¶

Overview ¶

Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.

Index ¶

func ConnectSelenium(sel SeleniumInstance, browseType int) (selenium.WebDriver, error)
func CrawlWebsite(tID *sync.WaitGroup, db cdb.Handler, source cdb.Source, sel SeleniumInstance, ...)
func DefaultActionConfig(url string) cfg.SourceConfig
func DefaultCrawlingConfig(url string) cfg.SourceConfig
func FuzzURL(baseURL string, rule rules.CrawlingRule) ([]string, error)
func IsValidURL(u string) bool
func NewSeleniumService(c cfg.Selenium) (*selenium.Service, error)
func QuitSelenium(wd *selenium.WebDriver)
func ReturnSeleniumInstance(wg *sync.WaitGroup, pCtx *processContext, sel *SeleniumInstance)
func StartCrawler(cf cfg.Config)
func StopSelenium(sel *selenium.Service) error
func UpdateSourceState(db cdb.Handler, sourceURL string, crawlError error)
type MetaTag
type PageInfo
type ScraperRuleEngine
type Screenshot
- func TakeScreenshot(wd *selenium.WebDriver, filename string) (Screenshot, error)
type SeleniumInstance

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func ConnectSelenium ¶

func ConnectSelenium(sel SeleniumInstance, browseType int) (selenium.WebDriver, error)

ConnectSelenium is responsible for connecting to the Selenium server instance

func CrawlWebsite ¶

func CrawlWebsite(tID *sync.WaitGroup, db cdb.Handler, source cdb.Source, sel SeleniumInstance, SeleniumInstances chan SeleniumInstance, re *rules.RuleEngine)

CrawlWebsite is responsible for crawling a website, it's the main entry point and it's called from the main.go when there is a Source to crawl.

func DefaultActionConfig ¶

func DefaultActionConfig(url string) cfg.SourceConfig

func DefaultCrawlingConfig ¶

func DefaultCrawlingConfig(url string) cfg.SourceConfig

func FuzzURL ¶ added in v0.9.2

func FuzzURL(baseURL string, rule rules.CrawlingRule) ([]string, error)

FuzzURL takes a base URL and a CrawlingRule, generating fuzzed URLs based on the rule's parameters.

func IsValidURL ¶

func IsValidURL(u string) bool

IsValidURL checks if the string is a valid URL.

func NewSeleniumService ¶

func NewSeleniumService(c cfg.Selenium) (*selenium.Service, error)

NewSeleniumService is responsible for initializing Selenium Driver The commented out code could be used to initialize a local Selenium server instead of using only a container based one. However, I found that the container based Selenium server is more stable and reliable than the local one. and it's obviously easier to setup and more secure.

func QuitSelenium ¶

func QuitSelenium(wd *selenium.WebDriver)

QuitSelenium is responsible for quitting the Selenium server instance

func ReturnSeleniumInstance ¶ added in v0.9.2

func ReturnSeleniumInstance(wg *sync.WaitGroup, pCtx *processContext, sel *SeleniumInstance)

ReturnSeleniumInstance is responsible for returning the Selenium server instance

func StartCrawler ¶

func StartCrawler(cf cfg.Config)

StartCrawler is responsible for initializing the crawler

func StopSelenium ¶

func StopSelenium(sel *selenium.Service) error

StopSelenium Stops the Selenium server instance (if local)

func UpdateSourceState ¶ added in v0.9.2

func UpdateSourceState(db cdb.Handler, sourceURL string, crawlError error)

UpdateSourceState is responsible for updating the state of a Source in the database after crawling it (it does consider errors too)

Types ¶

type MetaTag ¶

type MetaTag struct {
	Name    string
	Content string
}

MetaTag represents a single meta tag, including its name and content.

type PageInfo ¶

type PageInfo struct {
	URL string // The URL of the web page.

	Title        string             // The title of the web page.
	Summary      string             // A summary of the web page content.
	BodyText     string             // The main body text of the web page.
	HTML         string             // The HTML content of the web page.
	MetaTags     []MetaTag          // The meta tags of the web page.
	Keywords     map[string]string  // The keywords of the web page.
	DetectedType string             // The detected document type of the web page.
	DetectedLang string             // The detected language of the web page.
	NetInfo      *neti.NetInfo      // The network information of the web page.
	HTTPInfo     *httpi.HTTPDetails // The HTTP header information of the web page.
	ScrapedData  string             // The scraped data from the web page.
	Links        []string           // The links found in the web page.
	// contains filtered or unexported fields
}

PageInfo represents the information of a web page.

type ScraperRuleEngine ¶

type ScraperRuleEngine struct {
	*rs.RuleEngine // generic rule engine
}

ScraperRuleEngine extends RuleEngine from the ruleset package

type Screenshot ¶

type Screenshot struct {
	IndexID         int64  `json:"index_id"`
	ScreenshotLink  string `json:"screenshot_link"`
	Height          int    `json:"height"`
	Width           int    `json:"width"`
	ByteSize        int    `json:"byte_size"`
	ThumbnailHeight int    `json:"thumbnail_height"`
	ThumbnailWidth  int    `json:"thumbnail_width"`
	ThumbnailLink   string `json:"thumbnail_link"`
	Format          string `json:"format"`
}

Screenshot represents the metadata of a webpage screenshot

func TakeScreenshot ¶

func TakeScreenshot(wd *selenium.WebDriver, filename string) (Screenshot, error)

TakeScreenshot is responsible for taking a screenshot of the current page

type SeleniumInstance ¶

type SeleniumInstance struct {
	Service *selenium.Service
	Config  cfg.Selenium
}

SeleniumInstance holds a Selenium service and its configuration

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL