Documentation ¶
Overview ¶
Package app provides the clamber crawling package.
To initiate a crawl, create a Crawler with an empty sync.WaitGroup and struct map. DbWaitGroup is needed to ensure the clamber process does not exit before the crawler is done writing to the database. AlreadyCrawled keeps track of the URLs which have been crawled already in that crawl process. The rest are self explanatory.
crawler := app.Crawler{ DbWaitGroup: sync.WaitGroup{}, AlreadyCrawled: make(map[string]struct{}), Logger: log.Logger, Store: app.DbStore, }
Create a page object with the starting URL of your crawl.
page := &app.Page{Url: "https://golang.org"}
Call Crawl on the Crawler object, passing in your page, and the depth of the crawl you want.
crawler.Crawl(result, 5)
Ensure your go process does not end before the crawled data has been saved to dgraph. If you need more logic to execute first, put the line below after this, as your application will hang on Wait() until we're done writing.
crawler.DbWaitGroup.Wait()
Index ¶
- type Crawler
- func (crawler *Crawler) Crawl(currentPage *page.Page)
- func (crawler *Crawler) Create(currentPage *page.Page) (err error)
- func (crawler *Crawler) FindOrCreateLink(ctx *context.Context, parentUid string, currentUid string) (err error)
- func (crawler *Crawler) FindOrCreatePage(ctx *context.Context, p *page.Page) (uid string, err error)
- func (crawler *Crawler) Get(currentPage *page.Page) (resp *http.Response, err error)
- func (crawler *Crawler) Start() (err error)
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Crawler ¶
type Crawler struct { AlreadyCrawled map[string]struct{} sync.Mutex DbWaitGroup sync.WaitGroup BgWaitGroup sync.WaitGroup BgNotified bool BgWaitNotified bool Store *relationship.Store BackgroundCrawlDepth int CrawlUid uuid.UUID Queue *queue.Queue }
Crawler holds objects related to the crawler
func (*Crawler) Crawl ¶
Crawl function adds page to db (in a goroutine so it doesn't stop initiating other crawls), gets the child pages then initiates crawls for each one.
func (*Crawler) Create ¶
Create function checks for current page, creates if doesn't exist. Checks for parent page, creates if doesn't exist. Checks for edge between them, creates if doesn't exist.