goscrapy

package module
v0.0.0-...-75cde0a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 7, 2022 License: Apache-2.0 Imports: 15 Imported by: 0

README

goscrapy

Go Report Card License

goscrapy is a web crawling and web scraping framework written by golang. The architecture of goscrapy is similar to python-scrpay, and described as following:

Data Flow

Easy to use

An example usage is available at here.

goscrapy-cli is a command line tool which is helpful to auto-generate code template for you, see goscrapy-cli for more details.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Context

type Context struct {
	context.Context
	// contains filtered or unexported fields
}

Context represents the scraping and crawling context

func (*Context) Document

func (ctx *Context) Document() *goquery.Document

Document returns HTML document

func (*Context) Request

func (ctx *Context) Request() *Request

Request returns the crawling request

func (*Context) Response

func (ctx *Context) Response() *Response

Response returns the downloading response

type DefaultDownloader

type DefaultDownloader struct {
	// contains filtered or unexported fields
}

DefaultDownloader a simple downloader implementation

func (*DefaultDownloader) Download

func (dd *DefaultDownloader) Download(req *Request) (*Response, error)

Download sends http request and using goquery to get http document

func (*DefaultDownloader) SetHTTPClient

func (dd *DefaultDownloader) SetHTTPClient(client *http.Client)

SetHTTPClient set http client using to fetch pages

type Downloader

type Downloader interface {
	Download(*Request) (*Response, error)
}

Downloader is an interface that representing the ability to download data from internet. It is responsible for fetching web pages, the downloading response will be took over by engine, in turn, fed to spiders.

type Engine

type Engine struct {
	// contains filtered or unexported fields
}

Engine represents scraping engine, it is responsible for managing the data flow among scheduler, downloader and spiders.

func New

func New(opts ...Option) *Engine

New create a new goscrapy engine

func (*Engine) RegisterPipelines

func (e *Engine) RegisterPipelines(pipelines ...Pipeline)

RegisterPipelines register pipelines

func (*Engine) RegisterSipders

func (e *Engine) RegisterSipders(spiders ...Spider)

RegisterSipders add working spiders

func (*Engine) Start

func (e *Engine) Start()

Start starts engine

func (*Engine) Stop

func (e *Engine) Stop()

Stop stops engine

type FIFOScheduler

type FIFOScheduler struct {
	// contains filtered or unexported fields
}

FIFOScheduler default scheduler implementation

func NewFIFOScheduler

func NewFIFOScheduler() *FIFOScheduler

NewFIFOScheduler create a new fifo scheduler with queue size of 100

func (*FIFOScheduler) HasMore

func (ds *FIFOScheduler) HasMore() bool

HasMore returns true if there are more request to be scheduled

func (*FIFOScheduler) PopRequest

func (ds *FIFOScheduler) PopRequest() (req *Request, hadMore bool)

PopRequest returns next request

func (*FIFOScheduler) PushRequest

func (ds *FIFOScheduler) PushRequest(req *Request) (ok bool)

PushRequest add request

func (*FIFOScheduler) Start

func (ds *FIFOScheduler) Start() error

Start starts scheduler

func (*FIFOScheduler) Stop

func (ds *FIFOScheduler) Stop() error

Stop stops scheduler

type Items

type Items struct {
	sync.Map
	// contains filtered or unexported fields
}

Items items

func NewItems

func NewItems(name string) *Items

NewItems new items with specified name, goscrapy pipeline will make the decision whether to handle an item based on the item's name.

func (*Items) Name

func (item *Items) Name() string

Name returns items' name

type Option

type Option func(e *Engine)

Option engine option

func MaxCrawlingDepth

func MaxCrawlingDepth(depth int) Option

MaxCrawlingDepth returns an Option that sets the max crawling depth. The engine will drop Requests that have current depth exceeded the maximum limit.

func SetConcurrency

func SetConcurrency(num int) Option

SetConcurrency set concurrency

func UseDownloader

func UseDownloader(d Downloader) Option

UseDownloader set downloader

func UseLogger

func UseLogger(lg logger.Logger) Option

UseLogger set logger

func UseScheduler

func UseScheduler(s Scheduler) Option

UseScheduler set scheduler

func WithDelay

func WithDelay(delay time.Duration) Option

WithDelay set the duration to wait before handling next request.

func WithRequestMiddlewares

func WithRequestMiddlewares(middlewares ...RequestHandleFunc) Option

WithRequestMiddlewares registers request middlewares. Requests will be processed by request middlewares just before passing to downloader.

for aborting request in middleware, using Request.Abrot()

for example:
func ReqMiddleware(req *goscrapy.Request) error {
	if req.URL == "http://www.example.com" {
		req.Abort()
		return nil
	}

	return nil
}

func WithResponseMiddlewares

func WithResponseMiddlewares(middlewares ...ResponseHandleFunc) Option

WithResponseMiddlewares registers response middlewares. Response will be processed by response middlewares right after downloader finishes downloading and takes over response to engine

type Pipeline

type Pipeline interface {
	Name() string       // returns pipeline's name
	ItemList() []string // returns all items' name that this pipeline cares about
	Handle(items *Items) error
}

Pipeline pipeline

type Request

type Request struct {
	Method string      `json:"method,omitempty"`
	URL    string      `json:"url,omitempty"`
	Header http.Header `json:"header,omitempty"`
	Query  url.Values  `json:"query,omitempty"`
	// using to decide scheduling sequence. It only means something when using a
	// scheduler that schedules requests based on request weight.
	Weight int
	// contains filtered or unexported fields
}

Request represents crawling request

func (*Request) Abort

func (r *Request) Abort()

Abort aborts current request, you could use it at your request middleware to make sure certain request will not be handled by downloader and spiders.

func (*Request) ContextValue

func (r *Request) ContextValue(key string) interface{}

ContextValue returns the value associated with this request for key, or nil if no value is associated with key.

func (*Request) IsAborted

func (r *Request) IsAborted() bool

IsAborted returns true if the current request was aborted.

func (*Request) WithContextValue

func (r *Request) WithContextValue(key string, value interface{})

WithContextValue sets the value into request associated with the key.

type RequestHandleFunc

type RequestHandleFunc func(*Request) error

RequestHandleFunc request handler func

type Response

type Response struct {
	Status     string `json:"status,omitempty"`      // e.g. "200 OK"
	StatusCode int    `json:"status_code,omitempty"` // e.g. 200
	// Request represents request that was send to obtain this response.
	Request *Request `json:"request,omitempty"`
	// Document represents an HTML document to be manipulated.
	Document *goquery.Document `json:"-"`
	// Body represents the response body.
	Body []byte `json:"-"`
	// ContentLength records the length of the associated content. more details see http.Response.
	ContentLength int64 `json:"content_length,omitempty"`
	// Header represents response header, maps header keys to values.
	Header http.Header `json:"header,omitempty"`
}

Response represents crawling response

type ResponseHandleFunc

type ResponseHandleFunc func(*Response) error

ResponseHandleFunc response handler func

type Scheduler

type Scheduler interface {
	Start() error                             // start scheduler
	Stop() error                              // stop scheduler
	PopRequest() (req *Request, hasMore bool) // return next request from scheduler
	PushRequest(req *Request) (ok bool)       // add new request into scheduler
	HasMore() bool                            // returns true if there are more request to be scheduled
}

Scheduler scheduler is responsible for managing all scraping and crawling request

type Spider

type Spider interface {
	Name() string
	StartRequests() []*Request
	URLMatcher() URLMatcher
	Parse(ctx *Context) (*Items, []*Request, error)
}

Spider is an interface that defines how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behavior for crawling and parsing pages for a particular site (or, in some cases, a group of sites). For spiders, the scraping cycle goes through something like this:

  1. Using initial Requests generated by StartRequests to crawl the first URLs.
  2. Parsing the response (web page), then return items object (structured data) and request objects. Those requests will be added into scheduler by goscrapy engine and downloaded by downloader in the future.

type StringMatcher

type StringMatcher struct {
	// contains filtered or unexported fields
}

StringMatcher static string matcher

func NewStaticStringMatcher

func NewStaticStringMatcher(str string) *StringMatcher

NewStaticStringMatcher new static string matcher

func (*StringMatcher) Match

func (m *StringMatcher) Match(url string) bool

Match returns true if url is matched

type URLMatcher

type URLMatcher interface {
	// Match returns true if url is matched
	Match(url string) bool
}

URLMatcher url matcher

type URLRegExpMatcher

type URLRegExpMatcher struct {
	// contains filtered or unexported fields
}

URLRegExpMatcher url regexp matcher

func NewRegexpMatcher

func NewRegexpMatcher(str string) *URLRegExpMatcher

NewRegexpMatcher new URL matcher

func (*URLRegExpMatcher) Match

func (m *URLRegExpMatcher) Match(url string) bool

Match returns true if url is matched

type WeightedScheduler

type WeightedScheduler struct {
	// contains filtered or unexported fields
}

WeightedScheduler scheduler

func NewWeightedScheduler

func NewWeightedScheduler() *WeightedScheduler

NewWeightedScheduler new a weighted scheduler which is implemented based on max-heap.

func (*WeightedScheduler) HasMore

func (sched *WeightedScheduler) HasMore() bool

HasMore returns true if queue has more request

func (*WeightedScheduler) Len

func (sched *WeightedScheduler) Len() int

Len returns the number of elements in the collection.

func (*WeightedScheduler) Less

func (sched *WeightedScheduler) Less(i, j int) bool

Less reports whether the element with index i should sort before the element with index j.

func (*WeightedScheduler) Pop

func (sched *WeightedScheduler) Pop() interface{}

Pop remove and return element Len() - 1. It's aimed to implement sort.Interface. For pushing request onto schduler using PopRequest instead.

func (*WeightedScheduler) PopRequest

func (sched *WeightedScheduler) PopRequest() (req *Request, ok bool)

PopRequest pop request

func (*WeightedScheduler) Push

func (sched *WeightedScheduler) Push(x interface{})

Push pushes value onto heap, it's aimed to implement sort.Interface. For pushing request onto schduler using PushRequest instead.

func (*WeightedScheduler) PushRequest

func (sched *WeightedScheduler) PushRequest(req *Request) (ok bool)

PushRequest push request

func (*WeightedScheduler) Start

func (sched *WeightedScheduler) Start() error

Start start

func (*WeightedScheduler) Stop

func (sched *WeightedScheduler) Stop() error

Stop stop

func (*WeightedScheduler) Swap

func (sched *WeightedScheduler) Swap(i, j int)

Swap swaps the elements with indexes i and j.

Directories

Path Synopsis
cmd
pkg

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL