Documentation ¶
Index ¶
- Variables
- func ParseHTML(resp *http.Response) (*html.Node, error)
- func ParseJSON(resp *http.Response) (*jsonquery.Node, error)
- func ParseXML(resp *http.Response) (*xmlquery.Node, error)
- type Crawler
- func (c *Crawler) Crawl(req *http.Request) error
- func (c *Crawler) EnqueueURL(URL string) error
- func (c *Crawler) Handle(pattern string, handler Handler)
- func (c *Crawler) Handler(res *http.Response) (h Handler, pattern string)
- func (c *Crawler) StartURLs(URLs []string)
- func (c *Crawler) UseCompression() *Crawler
- func (c *Crawler) UseCookies() *Crawler
- func (c *Crawler) UseMiddleware(m ...Middleware) *Crawler
- func (c *Crawler) UsePipeline(p ...Pipeline) *Crawler
- func (c *Crawler) UseProxy(proxyURL *url.URL) *Crawler
- func (c *Crawler) UseRobotstxt() *Crawler
- type Handler
- type HandlerFunc
- type HttpMessageHandler
- type HttpMessageHandlerFunc
- type Item
- type Logger
- type MediaType
- type Middleware
- type Pipeline
- type PipelineHandler
- type PipelineHandlerFunc
- type ProxyKey
Constants ¶
This section is empty.
Variables ¶
var NilLogger nilLogger
NilLogger is a Logger that will not logging any message.
Functions ¶
Types ¶
type Crawler ¶
type Crawler struct { // CheckRedirect specifies the policy for handling redirects. CheckRedirect func(req *http.Request, via []*http.Request) error // MaxConcurrentRequests specifies the maximum number of concurrent // requests that will be performed. // Default is 16. MaxConcurrentRequests int // MaxConcurrentRequestsPerHost specifies the maximum number of // concurrent requests that will be performed to any single domain. // Default is 1. MaxConcurrentRequestsPerSite int // RequestTimeout specifies a time to wait before the HTTP Request times out. // Default is 30s. RequestTimeout time.Duration // DownloadDelay specifies delay time to wait before access same website. // Default is 0.25s. DownloadDelay time.Duration // MaxConcurrentItems specifies the maximum number of concurrent items // to process parallel in the pipeline. // Default is 32. MaxConcurrentItems int // UserAgent specifies the user-agent for the remote server. UserAgent string // ErrorLog specifies an optional logger for errors HTTP transports // and unexpected behavior from handlers. // If nil, logging goes to os.Stderr via the log package's // standard logger. ErrorLog Logger // Exit is an optional channel whose closure indicates that the Crawler // instance should be stop work and exit. Exit <-chan struct{} // contains filtered or unexported fields }
Crawler is core of web crawl server that provides crawl websites and calls pipeline to process for received data from their pages.
func NewCrawler ¶
func NewCrawler() *Crawler
NewCrawler returns a new Crawler with default settings.
func (*Crawler) EnqueueURL ¶
EnqueueURL puts given URL into the backup URLs queue.
func (*Crawler) Handle ¶
Handle registers the Handler for the given pattern. If pattern is "*" means will matches all requests if no any pattern matches.
func (*Crawler) UseCompression ¶
UseCompression enables the HTTP compression middleware to supports gzip, deflate for HTTP Request/Response.
func (*Crawler) UseCookies ¶
UseCookies enables the cookies middleware to working.
func (*Crawler) UseMiddleware ¶
func (c *Crawler) UseMiddleware(m ...Middleware) *Crawler
UseMiddleware adds a Middleware to the crawler.
func (*Crawler) UsePipeline ¶
UsePipeline adds a Pipeline to the crawler.
func (*Crawler) UseRobotstxt ¶
UseRobotstxt enables support robots.txt.
type Handler ¶
Handler is the HTTP Response handler interface that defines how to extract scraped items from their pages.
ServeSpider should be write got Item to the Channel.
func VoidHandler ¶
func VoidHandler() Handler
VoidHandler returns a Handler that without doing anything.
type HandlerFunc ¶
HandlerFunc is an adapter to allow the use of ordinary functions as Spider.
func (HandlerFunc) ServeSpider ¶
func (f HandlerFunc) ServeSpider(c chan<- Item, resp *http.Response)
ServeSpider performs extract data from received HTTP response and write it into the Channel c.
type HttpMessageHandler ¶
HttpMessageHandler is an interface that receives an HTTP request and returns an HTTP response.
type HttpMessageHandlerFunc ¶
HttpMessageHandlerFunc is an adapter to allow the use of ordinary functions as HttpMessageHandler.
type MediaType ¶
type MediaType struct { // Type is the HTTP content type represents. such as // "text/html", "image/jpeg". Type string // Charset is the HTTP content encoding represents. Charset string }
MediaType describe the content type of an HTTP request or HTTP response.
func ParseMediaType ¶
ParseMediaType parsing a specified string v to MediaType struct.
func (MediaType) ContentType ¶
ContentType returns the HTTP header content-type value.
type Middleware ¶
type Middleware func(HttpMessageHandler) HttpMessageHandler
Middleware is the HTTP message transport middle layer that send HTTP request passed one message Handler to the next message Handler until returns an HTTP response.
func CompressionMiddleware ¶
func CompressionMiddleware() Middleware
CompressionMiddleware is a middleware to allows compressed (gzip, deflate) traffic to be sent/received from sites.
func CookiesMiddleware ¶
func CookiesMiddleware() Middleware
CookiesMiddleware is an HTTP cookies middleware to allows cookies to tracking for each of HTTP requests.
func ProxyMiddleware ¶
ProxyMiddleware is an HTTP proxy middleware to take HTTP Request use the HTTP proxy to access remote sites.
ProxyMiddleware supports HTTP/HTTPS,SOCKS5 protocol list. etc http://127.0.0.1:8080 or https://127.0.0.1:8080 or socks5://127.0.0.1:1080
func RobotstxtMiddleware ¶
func RobotstxtMiddleware() Middleware
RobotstxtMiddleware is a middleware for robots.txt, make HTTP request is more polite.
type Pipeline ¶
type Pipeline func(PipelineHandler) PipelineHandler
Pipeline allows perform value Item passed one PipelineHandler to the next PipelineHandler in the chain.
type PipelineHandler ¶
type PipelineHandler interface {
ServePipeline(Item)
}
PipelineHandler is an interface for a handler in pipeline.
type PipelineHandlerFunc ¶
type PipelineHandlerFunc func(Item)
PipelineHandlerFunc is an adapter to allow the use of ordinary functions as PipelineHandler.
func (PipelineHandlerFunc) ServePipeline ¶
func (f PipelineHandlerFunc) ServePipeline(v Item)
ServePipeline performs for given Item data.