engine

package
v0.1.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 19, 2023 License: MIT Imports: 11 Imported by: 0

Documentation

Overview

Package engine @description implements engines of the cralwer

Index

Constants

This section is empty.

Variables

View Source
var Store = &CrawlerStore{
	list: []*fetcher.Task{},
	Hash: map[string]*fetcher.Task{},
}

Store is the global CrawlerStore instance

Functions

func AddJSReq added in v0.1.0

func AddJSReq(jreq map[string]interface{}) []*fetcher.Request

AddJSReq 添加 js 动态爬取规则

func AddJSReqs added in v0.1.0

func AddJSReqs(jreqs []map[string]interface{}) []*fetcher.Request

AddJSReqs 添加 js 动态爬取规则

func GetFields added in v0.1.1

func GetFields(taskName, ruleName string) []string

GetFields returns fields by taskName and ruleName

Types

type Crawler added in v0.0.9

type Crawler struct {

	// store the visited fetcher.Request
	Visited     map[string]bool
	VisitedLock sync.Mutex
	// contains filtered or unexported fields
}

Crawler represents the global crawl instance

func NewCrawler added in v0.0.9

func NewCrawler(opts ...Option) *Crawler

func (*Crawler) CreateWork added in v0.0.9

func (c *Crawler) CreateWork()

func (*Crawler) HandleResult added in v0.0.9

func (c *Crawler) HandleResult()

func (*Crawler) HasVisited added in v0.0.9

func (c *Crawler) HasVisited(r *fetcher.Request) bool

func (*Crawler) Run added in v0.0.9

func (c *Crawler) Run()

func (*Crawler) Schedule added in v0.0.9

func (c *Crawler) Schedule()

func (*Crawler) SetFailure added in v0.0.9

func (c *Crawler) SetFailure(req *fetcher.Request)

func (*Crawler) StoreVisited added in v0.0.9

func (c *Crawler) StoreVisited(reqs ...*fetcher.Request)

type CrawlerStore added in v0.1.0

type CrawlerStore struct {
	Hash map[string]*fetcher.Task
	// contains filtered or unexported fields
}

CrawlerStore scores the crawler tasks

func (*CrawlerStore) Add added in v0.1.0

func (cs *CrawlerStore) Add(task *fetcher.Task)

Add adds a task to the global crawler instance

func (*CrawlerStore) AddJSTask added in v0.1.0

func (cs *CrawlerStore) AddJSTask(m *fetcher.TaskModel)

AddJSTask 添加 js 动态爬取任务

type Option

type Option func(opts *options)

func WithChannelBuffer added in v0.1.3

func WithChannelBuffer(bufferSize int) Option

func WithFetcher

func WithFetcher(f fetcher.Fetcher) Option

func WithLogger

func WithLogger(logger *zap.Logger) Option

func WithScheduler added in v0.0.9

func WithScheduler(scheduler Scheduler) Option

func WithSeeds

func WithSeeds(seeds []*fetcher.Task) Option

func WithWorkCount

func WithWorkCount(workCount int) Option

type Schedule added in v0.0.9

type Schedule struct {
	Logger *zap.Logger
	// contains filtered or unexported fields
}

func NewSchedule added in v0.0.9

func NewSchedule() *Schedule

func (*Schedule) Pull added in v0.0.9

func (s *Schedule) Pull() *fetcher.Request

func (*Schedule) Push added in v0.0.9

func (s *Schedule) Push(reqs ...*fetcher.Request)

func (*Schedule) Schedule added in v0.0.9

func (s *Schedule) Schedule()

type Scheduler

type Scheduler interface {

	// Schedule starts the scheduler
	Schedule()

	// Push the request into scheduler
	Push(...*fetcher.Request)

	// Pull a request from scheduler
	Pull() *fetcher.Request
}

Scheduler defines the behavior of scheduing crawl request

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL