Documentation ¶
Overview ¶
Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.
Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.
Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.
Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.
Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.
Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.
Index ¶
- func ConnectSelenium(sel SeleniumInstance, browseType int) (selenium.WebDriver, error)
- func CrawlWebsite(tID *sync.WaitGroup, db cdb.Handler, source cdb.Source, sel SeleniumInstance, ...)
- func DefaultActionConfig(url string) cfg.SourceConfig
- func DefaultCrawlingConfig(url string) cfg.SourceConfig
- func FuzzURL(baseURL string, rule rules.CrawlingRule) ([]string, error)
- func IsValidURL(u string) bool
- func NewSeleniumService(c cfg.Selenium) (*selenium.Service, error)
- func QuitSelenium(wd *selenium.WebDriver)
- func ReturnSeleniumInstance(wg *sync.WaitGroup, pCtx *processContext, sel *SeleniumInstance)
- func StartCrawler(cf cfg.Config)
- func StopSelenium(sel *selenium.Service) error
- func UpdateSourceState(db cdb.Handler, sourceURL string, crawlError error)
- type MetaTag
- type PageInfo
- type ScraperRuleEngine
- type Screenshot
- type SeleniumInstance
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ConnectSelenium ¶
func ConnectSelenium(sel SeleniumInstance, browseType int) (selenium.WebDriver, error)
ConnectSelenium is responsible for connecting to the Selenium server instance
func CrawlWebsite ¶
func CrawlWebsite(tID *sync.WaitGroup, db cdb.Handler, source cdb.Source, sel SeleniumInstance, SeleniumInstances chan SeleniumInstance, re *rules.RuleEngine)
CrawlWebsite is responsible for crawling a website, it's the main entry point and it's called from the main.go when there is a Source to crawl.
func DefaultActionConfig ¶
func DefaultActionConfig(url string) cfg.SourceConfig
func DefaultCrawlingConfig ¶
func DefaultCrawlingConfig(url string) cfg.SourceConfig
func FuzzURL ¶ added in v0.9.2
func FuzzURL(baseURL string, rule rules.CrawlingRule) ([]string, error)
FuzzURL takes a base URL and a CrawlingRule, generating fuzzed URLs based on the rule's parameters.
func NewSeleniumService ¶
NewSeleniumService is responsible for initializing Selenium Driver The commented out code could be used to initialize a local Selenium server instead of using only a container based one. However, I found that the container based Selenium server is more stable and reliable than the local one. and it's obviously easier to setup and more secure.
func QuitSelenium ¶
QuitSelenium is responsible for quitting the Selenium server instance
func ReturnSeleniumInstance ¶ added in v0.9.2
func ReturnSeleniumInstance(wg *sync.WaitGroup, pCtx *processContext, sel *SeleniumInstance)
ReturnSeleniumInstance is responsible for returning the Selenium server instance
func StartCrawler ¶
StartCrawler is responsible for initializing the crawler
func StopSelenium ¶
StopSelenium Stops the Selenium server instance (if local)
Types ¶
type PageInfo ¶
type PageInfo struct { URL string // The URL of the web page. Title string // The title of the web page. Summary string // A summary of the web page content. BodyText string // The main body text of the web page. HTML string // The HTML content of the web page. MetaTags []MetaTag // The meta tags of the web page. Keywords map[string]string // The keywords of the web page. DetectedType string // The detected document type of the web page. DetectedLang string // The detected language of the web page. NetInfo *neti.NetInfo // The network information of the web page. HTTPInfo *httpi.HTTPDetails // The HTTP header information of the web page. ScrapedData string // The scraped data from the web page. Links []string // The links found in the web page. // contains filtered or unexported fields }
PageInfo represents the information of a web page.
type ScraperRuleEngine ¶
type ScraperRuleEngine struct {
*rs.RuleEngine // generic rule engine
}
ScraperRuleEngine extends RuleEngine from the ruleset package
type Screenshot ¶
type Screenshot struct { IndexID int64 `json:"index_id"` ScreenshotLink string `json:"screenshot_link"` Height int `json:"height"` Width int `json:"width"` ByteSize int `json:"byte_size"` ThumbnailHeight int `json:"thumbnail_height"` ThumbnailWidth int `json:"thumbnail_width"` ThumbnailLink string `json:"thumbnail_link"` Format string `json:"format"` }
Screenshot represents the metadata of a webpage screenshot
func TakeScreenshot ¶
func TakeScreenshot(wd *selenium.WebDriver, filename string) (Screenshot, error)
TakeScreenshot is responsible for taking a screenshot of the current page