fetcher

package
v0.1.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 19, 2023 License: MIT Imports: 23 Imported by: 0

Documentation

Overview

Package fetcher @description implements a crawler fetcher

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DetermineEncoding added in v0.1.4

func DetermineEncoding(r *bufio.Reader) encoding.Encoding

DetermineEncoding returns the encoder of the html content

Types

type BaseFetcher

type BaseFetcher struct{}

BaseFetcher is a basic implementation of Fetcher

func (BaseFetcher) Get

func (bf BaseFetcher) Get(r *Request) ([]byte, error)

type BrowserFetcher

type BrowserFetcher struct {
	Timeout time.Duration
	Proxy   proxy.Func
	Logger  *zap.Logger
}

BrowserFetcher is a fetcher which simulates browser

func (BrowserFetcher) Get

func (b BrowserFetcher) Get(r *Request) ([]byte, error)

type Context added in v0.1.0

type Context struct {
	Body []byte
	Req  *Request
}

Context is the crawling context

func (*Context) Output added in v0.1.1

func (c *Context) Output(data interface{}) *collector.OutputData

func (*Context) OutputJS added in v0.1.0

func (c *Context) OutputJS(reg string) ParseResult

OutputJS 用于 JS 代码中解析正则表达式,获取爬取结果

func (*Context) OutputStruct added in v0.1.3

func (c *Context) OutputStruct(dataStruct collector.DataStruct) *collector.OutputData

func (*Context) ParseJSReg added in v0.1.0

func (c *Context) ParseJSReg(ruleName string, reg string) ParseResult

ParseJSReg 用于 JS 代码中解析正则表达式,获取请求任务列表

type Fetcher

type Fetcher interface {

	// Get Fetch the html content according to url
	Get(url *Request) ([]byte, error)
}

Fetcher defines the crawler engine behaviors

type ParseResult

type ParseResult struct {
	Requests []*Request
	Items    []interface{}
}

ParseResult defines the result after parsing crawled response

type Property added in v0.1.0

type Property struct {
	// The unique signature of the Task
	Name     string        `json:"name"`
	Url      string        `json:"url"`
	Cookie   string        `json:"cookie"`
	WaitTime time.Duration `json:"wait_time"`
	// Mark whether the site can be crawled repeated
	Reload   bool  `json:"reload"`
	MaxDepth int64 `json:"max_depth"`
	// Headers needs to be added to http headers
	Headers map[string]string `json:"headers"`
}

type RedirectFetcher added in v0.1.3

type RedirectFetcher struct {
	Timeout time.Duration
	Proxy   proxy.Func
	Logger  *zap.Logger
}

RedirectFetcher is a fetcher that deals with redirected links

func (RedirectFetcher) Get added in v0.1.3

func (b RedirectFetcher) Get(r *Request) ([]byte, error)

type Request

type Request struct {
	Task     *Task
	Url      string
	Method   string
	Depth    int64
	Priority int64
	RuleName string
	TempData *Temp
	// contains filtered or unexported fields
}

Request represents a single crawler request

func (Request) Check added in v0.0.9

func (r Request) Check() error

Check 对 request 进行合法性检查

func (Request) UniqueSign added in v0.0.9

func (r Request) UniqueSign() string

UniqueSign builds the unique sign for each request

type Rule added in v0.1.0

type Rule struct {
	ItemFields []string
	ParseFunc  func(*Context) (ParseResult, error)
}

Rule represents the rule corresponding to the request

type RuleModel added in v0.1.0

type RuleModel struct {
	Name      string `json:"name"`
	ParseFunc string `json:"parse_script"`
}

type RuleTree added in v0.1.0

type RuleTree struct {

	// the entry of crawling rules
	Root func() ([]*Request, error)

	// the hashmap of rules
	// key: rule's name
	// value: the specific rule
	Trunk map[string]*Rule
}

type Task added in v0.0.9

type Task struct {
	Property

	Visited     map[string]bool
	VisitedLock sync.Mutex

	RootReq *Request
	Fetcher Fetcher
	Rule    RuleTree

	Logger  *log.Logger
	Storage collector.Store
	Limiter limiter.MultiLimiter
}

Task represents a complete crawl task

type TaskModel added in v0.1.0

type TaskModel struct {
	Property
	Root  string      `json:"root_script"`
	Rules []RuleModel `json:"rules"`
}

type Temp added in v0.1.1

type Temp struct {
	// contains filtered or unexported fields
}

func (*Temp) Copy added in v0.1.4

func (t *Temp) Copy() *Temp

func (*Temp) Get added in v0.1.1

func (t *Temp) Get(key string) interface{}

func (*Temp) Set added in v0.1.1

func (t *Temp) Set(key string, value interface{}) error

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL