app

package module
v0.0.0-...-111903a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 10, 2019 License: Apache-2.0 Imports: 11 Imported by: 0

Documentation

Overview

Package app provides the clamber crawling package.

To initiate a crawl, create a Crawler with an empty sync.WaitGroup and struct map. DbWaitGroup is needed to ensure the clamber process does not exit before the crawler is done writing to the database. AlreadyCrawled keeps track of the URLs which have been crawled already in that crawl process. The rest are self explanatory.

crawler := app.Crawler{
	DbWaitGroup: sync.WaitGroup{},
	AlreadyCrawled: make(map[string]struct{}),
	Logger: log.Logger,
	Config: app.Config,
	Db: app.DbStore,
}

Create a page object with the starting URL of your crawl.

page := &app.Page{Url: "https://golang.org"}

Call Crawl on the Crawler object, passing in your page, and the depth of the crawl you want.

crawler.Crawl(result, 5)

Ensure your go process does not end before the crawled data has been saved to dgraph. If you need more logic to execute first, put the line below after this, as your application will hang on Wait() until we're done writing.

crawler.DbWaitGroup.Wait()

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func InitFlags

func InitFlags(appFlags *Flags)

InitFlags loads flags into global var AppFlags

Types

type Crawler

type Crawler struct {
	AlreadyCrawled map[string]struct{}
	sync.Mutex
	DbWaitGroup          sync.WaitGroup
	BgWaitGroup          sync.WaitGroup
	BgNotified           bool
	BgWaitNotified       bool
	Config               common.Config
	Db                   common.DbStore
	Logger               log.Logger
	BackgroundCrawlDepth int
	CrawlUid             uuid.UUID
	StartUrl             string
}

Crawler holds objects related to the crawler

func (*Crawler) Crawl

func (crawler *Crawler) Crawl(currentPage *common.Page)

Crawl function adds page to db (in a goroutine so it doesn't stop initiating other crawls), gets the child pages then initiates crawls for each one.

func (*Crawler) Create

func (crawler *Crawler) Create(currentPage *common.Page) (err error)

Create function checks for current page, creates if doesn't exist. Checks for parent page, creates if doesn't exist. Checks for edge between them, creates if doesn't exist.

func (*Crawler) Get

func (crawler *Crawler) Get(currentPage *common.Page) (resp *http.Response, err error)

Get function manages HTTP request for page

type DistributedCrawler

type DistributedCrawler struct {
}

Crawler holds objects related to the crawler

type Flags

type Flags struct {
	ConfigFile *string
	Port       *int
	Verbose    *bool
}

Flags holds the app Flags

var (
	// AppFlags makes a global Flag struct
	AppFlags Flags
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL