Documentation ¶
Overview ¶
Package app provides the clamber crawling package.
To initiate a crawl, create a Crawler with an empty sync.WaitGroup and struct map. DbWaitGroup is needed to ensure the clamber process does not exit before the crawler is done writing to the database. AlreadyCrawled keeps track of the URLs which have been crawled already in that crawl process. The rest are self explanatory.
crawler := app.Crawler{ DbWaitGroup: sync.WaitGroup{}, AlreadyCrawled: make(map[string]struct{}), Logger: log.Logger, Config: app.Config, Db: app.DbStore, }
Create a page object with the starting URL of your crawl.
page := &app.Page{Url: "https://golang.org"}
Call Crawl on the Crawler object, passing in your page, and the depth of the crawl you want.
crawler.Crawl(result, 5)
Ensure your go process does not end before the crawled data has been saved to dgraph. If you need more logic to execute first, put the line below after this, as your application will hang on Wait() until we're done writing.
crawler.DbWaitGroup.Wait()
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
Types ¶
type Crawler ¶
type Crawler struct { AlreadyCrawled map[string]struct{} sync.Mutex DbWaitGroup sync.WaitGroup BgWaitGroup sync.WaitGroup BgNotified bool BgWaitNotified bool Config common.Config Db common.DbStore Logger log.Logger BackgroundCrawlDepth int CrawlUid uuid.UUID StartUrl string }
Crawler holds objects related to the crawler
func (*Crawler) Crawl ¶
Crawl function adds page to db (in a goroutine so it doesn't stop initiating other crawls), gets the child pages then initiates crawls for each one.
type DistributedCrawler ¶
type DistributedCrawler struct { }
Crawler holds objects related to the crawler