crawler

package module
v0.0.0-...-673f271 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 27, 2016 License: MIT Imports: 12 Imported by: 0

README

go-webcrawler

Build Status

a simple, concurrent , distributed, go-implemented web crawler framework.

Installation

go get github.com/nladuo/go-webcrawler  

Dependencies

go get github.com/samuel/go-zookeeper
go get github.com/jinzhu/gorm
go get github.com/nladuo/DLocker
go get github.com/PuerkitoBio/goquery

About the Modes of go-webcrawler

go-webcrawler is a simple web crawler framework to let you build concurrent and distributed web crawler application. There are three modes of go-webcrawler: local memory mode, local sql mode and distributed sql mode.

Local Memory Mode

In this mode, the framework would store the intermediate data directly into memory. If the url list's size of your web crawler application would not grow exponentially, or you PC's memory is big enough to utilize, you can use this mode.

Local Sql Mode

In this mode, the framework would store the intermediate data into a sql database. Because of using an ORM framework for database manipulation, you can use sqlite3, postgreSQL, mysql and so on... You would not worry about the the massive request urls running out of your PC's memory.

Distributed Sql Mode

Same as the Local Sql Mode, The Distributed Sql Mode would store the intermediate data into a sql database too. The difference between them is that the distributed one need zookeeper for coordination.You can check out the zookeeper configuration here.

Examples

The examples give you a quick access about go-webcrawler.

1.github stars crawler

Local Memory Mode Example
Local Sql Mode Example
Distributed Sql Mode Example

2.douban movie top250 crawler

Local Memory Mode Example
Local Sql Mode Example
Distributed Sql Mode Example

License

MIT

Documentation

Index

Constants

View Source
const (
	ErrShutDownCrawler string = "Cannot ShutDown the crwaler when "
)

Variables

This section is empty.

Functions

This section is empty.

Types

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

func NewDistributedSqlCrawler

func NewDistributedSqlCrawler(db *gorm.DB, config *model.DistributedConfig) *Crawler

used for distributed mode,need zookeeper and a sql database to store the internal data. Make sure you sql database can be accessed by all the server

func NewLocalMemCrawler

func NewLocalMemCrawler(threadNum int) *Crawler

local mode, store the internal data into a queue, suitable for simple application.

func NewLocalSqlCrawler

func NewLocalSqlCrawler(db *gorm.DB, threadNum int) *Crawler

local mode, need a sql database to store the internal data and to spare the memory use.

func (*Crawler) AddBaseTask

func (this *Crawler) AddBaseTask(task model.Task)

only the master crawler excute the Crawler.AddBaseTask So, if you are under the Distributed Mode, you can just change the config.json and make your crawler work distributedly.

func (*Crawler) AddParser

func (this *Crawler) AddParser(parser model.Parser)

func (*Crawler) Run

func (this *Crawler) Run()

func (*Crawler) SetPProfPort

func (this *Crawler) SetPProfPort(port string)

for debug

func (*Crawler) SetProxyGenerator

func (this *Crawler) SetProxyGenerator(generater model.ProxyGenerator)

func (*Crawler) SetProxyTimeOut

func (this *Crawler) SetProxyTimeOut(timeout time.Duration)

func (*Crawler) WaitForShutDown

func (this *Crawler) WaitForShutDown()

shutdown when task listbecome empty

Directories

Path Synopsis
example
douban_movie_top250
the local memory mode crawler
the local memory mode crawler
douban_movie_top250/distributed_sql_mode
the distributed sql mode crawler
the distributed sql mode crawler
douban_movie_top250/local_sql_mode
the local sql mode crawler
the local sql mode crawler
github_stars
the local memory mode crawler
the local memory mode crawler
github_stars/distributed_sql_mode
the distributed sql mode crawler
the distributed sql mode crawler
github_stars/local_sql_mode
the local sql mode crawler
the local sql mode crawler

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL