crawler

package
v0.9.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 25, 2024 License: Apache-2.0 Imports: 35 Imported by: 0

Documentation

Overview

Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.

Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.

Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.

Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.

Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.

Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ConnectSelenium

func ConnectSelenium(sel SeleniumInstance, browseType int) (selenium.WebDriver, error)

ConnectSelenium is responsible for connecting to the Selenium server instance

func CrawlWebsite

func CrawlWebsite(tID *sync.WaitGroup, db cdb.Handler, source cdb.Source, sel SeleniumInstance, SeleniumInstances chan SeleniumInstance, re *rules.RuleEngine)

CrawlWebsite is responsible for crawling a website, it's the main entry point and it's called from the main.go when there is a Source to crawl.

func DefaultActionConfig

func DefaultActionConfig(url string) cfg.SourceConfig

func DefaultCrawlingConfig

func DefaultCrawlingConfig(url string) cfg.SourceConfig

func FuzzURL added in v0.9.2

func FuzzURL(baseURL string, rule rules.CrawlingRule) ([]string, error)

FuzzURL takes a base URL and a CrawlingRule, generating fuzzed URLs based on the rule's parameters.

func IsValidURL

func IsValidURL(u string) bool

IsValidURL checks if the string is a valid URL.

func NewSeleniumService

func NewSeleniumService(c cfg.Selenium) (*selenium.Service, error)

NewSeleniumService is responsible for initializing Selenium Driver The commented out code could be used to initialize a local Selenium server instead of using only a container based one. However, I found that the container based Selenium server is more stable and reliable than the local one. and it's obviously easier to setup and more secure.

func QuitSelenium

func QuitSelenium(wd *selenium.WebDriver)

QuitSelenium is responsible for quitting the Selenium server instance

func ReturnSeleniumInstance added in v0.9.2

func ReturnSeleniumInstance(wg *sync.WaitGroup, pCtx *processContext, sel *SeleniumInstance)

ReturnSeleniumInstance is responsible for returning the Selenium server instance

func StartCrawler

func StartCrawler(cf cfg.Config)

StartCrawler is responsible for initializing the crawler

func StopSelenium

func StopSelenium(sel *selenium.Service) error

StopSelenium Stops the Selenium server instance (if local)

func UpdateSourceState added in v0.9.2

func UpdateSourceState(db cdb.Handler, sourceURL string, crawlError error)

UpdateSourceState is responsible for updating the state of a Source in the database after crawling it (it does consider errors too)

Types

type MetaTag

type MetaTag struct {
	Name    string
	Content string
}

MetaTag represents a single meta tag, including its name and content.

type PageInfo

type PageInfo struct {
	URL string // The URL of the web page.

	Title        string             // The title of the web page.
	Summary      string             // A summary of the web page content.
	BodyText     string             // The main body text of the web page.
	HTML         string             // The HTML content of the web page.
	MetaTags     []MetaTag          // The meta tags of the web page.
	Keywords     map[string]string  // The keywords of the web page.
	DetectedType string             // The detected document type of the web page.
	DetectedLang string             // The detected language of the web page.
	NetInfo      *neti.NetInfo      // The network information of the web page.
	HTTPInfo     *httpi.HTTPDetails // The HTTP header information of the web page.
	ScrapedData  string             // The scraped data from the web page.
	Links        []string           // The links found in the web page.
	// contains filtered or unexported fields
}

PageInfo represents the information of a web page.

type ScraperRuleEngine

type ScraperRuleEngine struct {
	*rs.RuleEngine // generic rule engine
}

ScraperRuleEngine extends RuleEngine from the ruleset package

type Screenshot

type Screenshot struct {
	IndexID         int64  `json:"index_id"`
	ScreenshotLink  string `json:"screenshot_link"`
	Height          int    `json:"height"`
	Width           int    `json:"width"`
	ByteSize        int    `json:"byte_size"`
	ThumbnailHeight int    `json:"thumbnail_height"`
	ThumbnailWidth  int    `json:"thumbnail_width"`
	ThumbnailLink   string `json:"thumbnail_link"`
	Format          string `json:"format"`
}

Screenshot represents the metadata of a webpage screenshot

func TakeScreenshot

func TakeScreenshot(wd *selenium.WebDriver, filename string) (Screenshot, error)

TakeScreenshot is responsible for taking a screenshot of the current page

type SeleniumInstance

type SeleniumInstance struct {
	Service *selenium.Service
	Config  cfg.Selenium
}

SeleniumInstance holds a Selenium service and its configuration

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL