spider

package
v0.0.0-...-2d91a95 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 7, 2017 License: Apache-2.0 Imports: 25 Imported by: 0

Documentation

Index

Constants

View Source
const (
	KeyIn      = util.UseKeyIn // If Spider.KeyIn is used, set the initial value to USE_KeyIn in the rule
	Limit      = math.MaxInt64 // If you want to customize the control limit in the rule, the Limit initial value must be Limit
	ForcedStop = "- Take the initiative to terminate Spider -"
)
View Source
const (
	// alarm clock
	A = iota
	// countdown
	T
)

Variables

View Source
var Species = &SpiderSpecies{
	list: []*Spider{},
	hash: map[string]*Spider{},
}

Examples of global spider species

Functions

func PutContext

func PutContext(ctx *Context)

Types

type Bell

type Bell struct {
	Hour int
	Min  int
	Sec  int
}

type Clock

type Clock struct {
	// contains filtered or unexported fields
}

type Context

type Context struct {
	Request  *request.Request // original request
	Response *http.Response   // response stream, where URL is copied from * request.Request

	sync.Mutex
	// contains filtered or unexported fields
}

Context is a struct ...

func GetContext

func GetContext(sp *Spider, req *request.Request) *Context

func (*Context) AddQueue

func (context *Context) AddQueue(req *request.Request) *Context

generate and add a request to the queue. Request.URL and Request.Rule must be set. Request.Spider does not need to be set manually (set by the system automatically). Request.EnableCookie is set in the Spider field and invalidated in the rule request. The following fields have default values, not set: Request.Method defaults to the GET method; Request.DialTimeout defaults to the constant request.DefaultDialTimeout, less than 0 does not limit the waiting time; Request.ConnTimeout defaults to the constant request.DefaultConnTimeout, less than 0 when the download timeout is not restricted; Request.TryTimes defaults to the request request.DefaultTryTimes, less than 0 does not limit the number of failed overloads; Request.RedirectTimes by default does not limit the number of redirects, less than 0 to prohibit redirects; Request.RetryPause defaults to constant request.DefaultRetryPause; Request.DownloaderID specified downloader ID, 0 for the default Surf high concurrent downloader, full functionality, 1 for the PhantomJS downloader, features strong break, slow, low concurrent. default auto fill Referer.

func (*Context) Aid

func (context *Context) Aid(aid map[string]interface{}, ruleName ...string) interface{}

call the specified function under the auxiliary function AidFunc (). specify the matching AidFunc with ruleName, and defaults to the current rule when it is empty.

func (*Context) CopyRequest

func (context *Context) CopyRequest() *request.Request

Get a copy of the original request.

func (*Context) CopyTemps

func (context *Context) CopyTemps() request.Temp

Get a copy of the requested cache data.

func (*Context) CreatItem

func (context *Context) CreatItem(item map[int]interface{}, ruleName ...string) map[string]interface{}

CreatItem -> Generate text results. specify the matching ItemFields field with ruleName, and defaults to the current rule when it is empty.

func (*Context) FileOutput

func (context *Context) FileOutput(name ...string)

FileOutput ... name Specifies the file name, which is the default to keep the original file name unchanged.

func (*Context) GetCookie

func (context *Context) GetCookie() string

Get the response to the cookie.

func (*Context) GetDom

func (context *Context) GetDom() *goquery.Document

GetHtmlParser returns goquery object binded to target crawl result.

func (*Context) GetError

func (context *Context) GetError() error

Get download error.

func (*Context) GetHeader

func (context *Context) GetHeader() http.Header

Get the response header information.

func (*Context) GetHost

func (context *Context) GetHost() string

func (*Context) GetItemField

func (context *Context) GetItemField(index int, ruleName ...string) (field string)

By the index index to obtain the result field name, do not exist when the empty string, If ruleName is empty, the default is the current rule.

func (*Context) GetItemFieldIndex

func (context *Context) GetItemFieldIndex(field string, ruleName ...string) (index int)

Get the index subscript from the result field name, the index is -1 when there is no, If ruleName is empty, the default is the current rule.

func (*Context) GetItemFields

func (context *Context) GetItemFields(ruleName ...string) []string

Get the list of result field names.

func (*Context) GetKeyIn

func (context *Context) GetKeyIn() string

Get the custom configuration.

func (*Context) GetLimit

func (context *Context) GetLimit() int

Get the acquisition limit.

func (*Context) GetMethod

func (context *Context) GetMethod() string

func (*Context) GetName

func (context *Context) GetName() string

Get the spider name.

func (*Context) GetReferer

func (context *Context) GetReferer() string

func (*Context) GetRequest

func (context *Context) GetRequest() *request.Request

Get the original request.

func (*Context) GetRequestHeader

func (context *Context) GetRequestHeader() http.Header

Get the request header information.

func (*Context) GetResponse

func (context *Context) GetResponse() *http.Response

Get the response flow.

func (*Context) GetRule

func (context *Context) GetRule(ruleName string) (*Rule, bool)

Get the specified rule.

func (*Context) GetRuleName

func (context *Context) GetRuleName() string

Get the current rule name.

func (*Context) GetRules

func (context *Context) GetRules() map[string]*Rule

Get the rule tree.

func (*Context) GetSpider

func (context *Context) GetSpider() *Spider

Get the spider name.

func (*Context) GetStatusCode

func (context *Context) GetStatusCode() int

Get the response status code.

func (*Context) GetTemp

func (context *Context) GetTemp(key string, defaultValue interface{}) interface{}

Get the temporary cache data in the request defaultValue can not interface {} (nil)

func (*Context) GetTemps

func (context *Context) GetTemps() request.Temp

Get all the cached data in the request

func (*Context) GetText

func (context *Context) GetText() string

GetBodyStr returns plain string crawled.

func (*Context) GetURL

func (context *Context) GetURL() string

Get the URL from the original request to ensure that the URL before and after the request is exactly equal and the Chinese is not encoded.

func (*Context) JsAddQueue

func (context *Context) JsAddQueue(jreq map[string]interface{}) *Context

for dynamic rule add request.

func (*Context) Log

func (*Context) Log() logs.Logs

Get the log interface instance.

func (*Context) Output

func (context *Context) Output(item interface{}, ruleName ...string)

Output text results. item type is map [int] interface {}, according to ruleName existing ItemFields field output, When the item type is map [string] interface {}, the ItemFields field that does not exist for ruleName will be automatically added, When the rule name is empty, the default current rule.

func (*Context) Parse

func (context *Context) Parse(ruleName ...string) *Context

parse the response flow. specify the matching ParseFunc field with ruleName, and the default call to Root ().

func (*Context) PullFiles

func (context *Context) PullFiles() (fs []data.FileCell)

func (*Context) PullItems

func (context *Context) PullItems() (ds []data.DataCell)

func (*Context) ResetText

func (context *Context) ResetText(body string) *Context

reset the downloaded text content,

func (*Context) RunTimer

func (context *Context) RunTimer(id string) bool

Start the timer and get if the timer can continue to use.

func (*Context) SetError

func (context *Context) SetError(err error)

mark the download error.

func (*Context) SetKeyIn

func (context *Context) SetKeyIn(keyIn string) *Context

set the custom configuration.

func (*Context) SetLimit

func (context *Context) SetLimit(max int) *Context

set the acquisition limit.

func (*Context) SetPausetime

func (context *Context) SetPausetime(pause int64, runtime ...bool) *Context

Custom pause interval (random: Pausetime / 2 ~ Pausetime * 2), higher priority than external pass. overwrite existing values ​​if and only if runtime [0] is true.

func (*Context) SetReferer

func (context *Context) SetReferer(referer string) *Context

func (*Context) SetResponse

func (context *Context) SetResponse(resp *http.Response) *Context

func (*Context) SetTemp

func (context *Context) SetTemp(key string, value interface{}) *Context

Save the temporary data in the request.

func (*Context) SetTimer

func (context *Context) SetTimer(id string, tol time.Duration, bell *Bell) bool

set the timer, @id is a unique identifier for the timer, @ Bell == nil when the countdown, then @ tol for sleep long, @bell! = Nil for the alarm, this time @ tol used to specify the time to wake up (from now encountered from the first bell to the bell).

func (*Context) SetURL

func (context *Context) SetURL(url string) *Context

func (*Context) UpsertItemField

func (context *Context) UpsertItemField(field string, ruleName ...string) (index int)

dynamically append the result field name to the specified rule and get the index position, already exists to get the original index position, If ruleName is empty, the default is the current rule.

type Rule

type Rule struct {
	ItemFields []string                                           // result field list (optional, write guaranteed field order)
	ParseFunc  func(*Context)                                     // Content parsing function
	AidFunc    func(*Context, map[string]interface{}) interface{} // General helper function
}

Collect the rule node

type RuleModel

type RuleModel struct {
	Name      string `xml:"name,attr"`
	ParseFunc string `xml:"ParseFunc>Script"`
	AidFunc   string `xml:"AidFunc>Script"`
}

RuleModel is a strcut for ...

type RuleTree

type RuleTree struct {
	Root  func(*Context)   // root node (execute entry)
	Trunk map[string]*Rule // node hash table (execution acquisition process)
}

Collect the rule tree

type Spider

type Spider struct {
	// The following fields are defined by the user
	Name            string                                                       // User interface displays the name (should be guaranteed uniqueness)
	Description     string                                                       // The user interface displays the description
	Pausetime       int64                                                        // random pause interval (50% ~ 200%), if the rules are directly defined, it is not covered by interface
	Limit           int64                                                        // default limit request number, 0 is not limited; if the rule is defined as Limit, then use the rules of the custom limit program
	KeyIn           string                                                       // Customize the input configuration information, set the initial value to KeyIn in the rule before use
	EnableCookie    bool                                                         // all requests are using cookie records
	NotDefaultField bool                                                         // whether to disable the output of the default field currentLink / parentLink / downloadTime
	Namespace       func(spider *Spider) string                                  // namespace, used to output files, named paths
	SubNamespace    func(spider *Spider, dataCell map[string]interface{}) string // Secondary naming, used to output files, named paths, can depend on specific data content
	RuleTree        *RuleTree                                                    // define a specific collection rule tree
	// contains filtered or unexported fields
}

spider rules

func (*Spider) CanStop

func (spider *Spider) CanStop() bool

func (*Spider) Copy

func (spider *Spider) Copy() *Spider

return a copy of your own

func (*Spider) Defer

func (spider *Spider) Defer()

Exit the job before finishing the job

func (*Spider) DoHistory

func (spider *Spider) DoHistory(req *request.Request, ok bool) bool

Returns whether the request was added as a new failure to the end of the queue

func (*Spider) GetDescription

func (spider *Spider) GetDescription() string

Get the spider description

func (*Spider) GetEnableCookie

func (spider *Spider) GetEnableCookie() bool

control whether all requests use cookies

func (*Spider) GetID

func (spider *Spider) GetID() int

Get spider ID

func (*Spider) GetItemField

func (spider *Spider) GetItemField(rule *Rule, index int) (field string)

returns the value of the result field name does not exist when the empty string is returned

func (*Spider) GetItemFieldIndex

func (spider *Spider) GetItemFieldIndex(rule *Rule, field string) (index int)

returns the index of the result field name does not exist when the index is -1

func (*Spider) GetItemFields

func (spider *Spider) GetItemFields(rule *Rule) []string

Specify the list of field names for the result of the rule

func (*Spider) GetKeyIn

func (spider *Spider) GetKeyIn() string

Get custom configuration information

func (*Spider) GetLimit

func (spider *Spider) GetLimit() int64

Get the acquisition limit <0 means that the number of requests is limited > 0 indicates that the custom limit scheme is used in the rule

func (*Spider) GetName

func (spider *Spider) GetName() string

Get the spider name

func (*Spider) GetRule

func (spider *Spider) GetRule(ruleName string) (*Rule, bool)

security returns the specified rule

func (*Spider) GetRules

func (spider *Spider) GetRules() map[string]*Rule

return to the rule tree

func (*Spider) GetSubName

func (spider *Spider) GetSubName() string

Get the spider secondary name

func (*Spider) IsStopping

func (spider *Spider) IsStopping() bool

func (*Spider) MustGetRule

func (spider *Spider) MustGetRule(ruleName string) *Rule

returns the specified rule

func (*Spider) OutDefaultField

func (spider *Spider) OutDefaultField() bool

whether to output the default added field URL / ParentURL / downloadTime

func (Spider) Register

func (spider Spider) Register() *Spider

add yourself to the spider menu

func (*Spider) ReqmatrixInit

func (spider *Spider) ReqmatrixInit() *Spider

func (*Spider) RequestFree

func (spider *Spider) RequestFree()

func (*Spider) RequestLen

func (spider *Spider) RequestLen() int

func (*Spider) RequestPull

func (spider *Spider) RequestPull() *request.Request

func (*Spider) RequestPush

func (spider *Spider) RequestPush(req *request.Request)

func (*Spider) RequestUse

func (spider *Spider) RequestUse()

func (*Spider) RunTimer

func (spider *Spider) RunTimer(id string) bool

start the timer and return to the timer if it can continue to use

func (*Spider) SetID

func (spider *Spider) SetID(id int)

set spider ID

func (*Spider) SetKeyIn

func (spider *Spider) SetKeyIn(keyword string)

Set up custom configuration information

func (*Spider) SetLimit

func (spider *Spider) SetLimit(max int64)

set the acquisition limit <0 means that the number of requests is limited > 0 indicates that the custom limit scheme is used in the rule

func (*Spider) SetPausetime

func (spider *Spider) SetPausetime(pause int64, runtime ...bool)

Custom pause time pause [0] ~ (pause [0] + pause [1]), higher priority than external pass overwrite existing values ​​if and only if runtime [0] is true

func (*Spider) SetTimer

func (spider *Spider) SetTimer(id string, tol time.Duration, bell *Bell) bool

set the timer @id is uniquely identified by the timer @ Bell == nil when the countdown, then @ tol for sleep long @bell! = Nil for the alarm, this time @ tol used to specify the time to wake up (from now encountered from the first bell to the bell)

func (*Spider) Start

func (spider *Spider) Start()

start the spider

func (*Spider) Stop

func (spider *Spider) Stop()

Active crash crawler run the connection

func (*Spider) TryFlushFailure

func (spider *Spider) TryFlushFailure()

func (*Spider) TryFlushSuccess

func (spider *Spider) TryFlushSuccess()

func (*Spider) UpsertItemField

func (spider *Spider) UpsertItemField(rule *Rule, field string) (index int)

dynamically appends the result field name to the specified rule and returns the index position already exists to return to the original index position

type SpiderModel

type SpiderModel struct {
	Name            string      `xml:"Name"`
	Description     string      `xml:"Description"`
	Pausetime       int64       `xml:"Pausetime"`
	EnableLimit     bool        `xml:"EnableLimit"`
	EnableKeyIn     bool        `xml:"EnableKeyIn"`
	EnableCookie    bool        `xml:"EnableCookie"`
	NotDefaultField bool        `xml:"NotDefaultField"`
	Namespace       string      `xml:"Namespace>Script"`
	SubNamespace    string      `xml:"SubNamespace>Script"`
	Root            string      `xml:"Root>Script"`
	Trunk           []RuleModel `xml:"Rule"`
}

SpiderModel is rule interpreter model

type SpiderSpecies

type SpiderSpecies struct {
	// contains filtered or unexported fields
}

List of spider species

func (*SpiderSpecies) Add

func (spiderSpecies *SpiderSpecies) Add(sp *Spider) *Spider

Add a new category to the spider list

func (*SpiderSpecies) Get

func (spiderSpecies *SpiderSpecies) Get() []*Spider

Get all spider species

func (*SpiderSpecies) GetByName

func (spiderSpecies *SpiderSpecies) GetByName(name string) *Spider

type Timer

type Timer struct {
	sync.RWMutex
	// contains filtered or unexported fields
}

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL