app

package module

v0.0.0-...-111903a Latest Latest Go to latest Published: Dec 10, 2019 License: Apache-2.0 Imports: 11 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/stevenayers/clamber

Documentation ¶

Overview ¶

Package app provides the clamber crawling package.

To initiate a crawl, create a Crawler with an empty sync.WaitGroup and struct map. DbWaitGroup is needed to ensure the clamber process does not exit before the crawler is done writing to the database. AlreadyCrawled keeps track of the URLs which have been crawled already in that crawl process. The rest are self explanatory.

crawler := app.Crawler{
	DbWaitGroup: sync.WaitGroup{},
	AlreadyCrawled: make(map[string]struct{}),
	Logger: log.Logger,
	Config: app.Config,
	Db: app.DbStore,
}

Create a page object with the starting URL of your crawl.

page := &app.Page{Url: "https://golang.org"}

Call Crawl on the Crawler object, passing in your page, and the depth of the crawl you want.

crawler.Crawl(result, 5)

Ensure your go process does not end before the crawled data has been saved to dgraph. If you need more logic to execute first, put the line below after this, as your application will hang on Wait() until we're done writing.

crawler.DbWaitGroup.Wait()

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func InitFlags ¶

func InitFlags(appFlags *Flags)

InitFlags loads flags into global var AppFlags

Types ¶

type Crawler ¶

type Crawler struct {
	AlreadyCrawled map[string]struct{}
	sync.Mutex
	DbWaitGroup          sync.WaitGroup
	BgWaitGroup          sync.WaitGroup
	BgNotified           bool
	BgWaitNotified       bool
	Config               common.Config
	Db                   common.DbStore
	Logger               log.Logger
	BackgroundCrawlDepth int
	CrawlUid             uuid.UUID
	StartUrl             string
}

Crawler holds objects related to the crawler

func (*Crawler) Crawl ¶

func (crawler *Crawler) Crawl(currentPage *common.Page)

Crawl function adds page to db (in a goroutine so it doesn't stop initiating other crawls), gets the child pages then initiates crawls for each one.

func (*Crawler) Create ¶

func (crawler *Crawler) Create(currentPage *common.Page) (err error)

Create function checks for current page, creates if doesn't exist. Checks for parent page, creates if doesn't exist. Checks for edge between them, creates if doesn't exist.

func (*Crawler) Get ¶

func (crawler *Crawler) Get(currentPage *common.Page) (resp *http.Response, err error)

Get function manages HTTP request for page

type DistributedCrawler ¶

type DistributedCrawler struct {
}

Crawler holds objects related to the crawler

type Flags ¶

type Flags struct {
	ConfigFile *string
	Port       *int
	Verbose    *bool
}

Flags holds the app Flags

var (
	// AppFlags makes a global Flag struct
	AppFlags Flags
)

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL