Documentation ¶
Overview ¶
A scraper library for aggressively fast testing of local sites
This package is not actually for downloading websites, which is better accomplished by using wget with some clever flags. This is about scanning over a local website, using custom functions to verify the structural sanity of the pages.
Example ¶
s, err := NewScraper(ScraperConfig{ // Real code would use HttpRetriever as the Retriever. Retriever: testHtmlRetriever, Bucket: NewCountBucket(1), Remarks: os.Stdout, Debug: os.Stdout, }) if err != nil { fmt.Println(err) return } base_url := "http://www.example.com/" s.Routes.AppendPrefix(base_url, func(req ScraperRequest, resp ServerResponse) { // Get HTML tree root, _ := resp.Parse() // Verify that there is only one <head> element num_heads := len(root.Find("head")) if num_heads != 1 { req.Remarks.Printf("%d heads, expected 1!\n", num_heads) } // Queue up any links on this page, for further scraping root.Find("a").Queue() }) s.Scrape(base_url) s.Wait()
Output: http://www.example.com/: Found a route http://www.example.com/first: Found a route http://www.example.com/second: Found a route http://www.example.com/third: Found a route
Index ¶
- func HttpRetriever(req ScraperRequest) (*http.Response, error)
- type Bucket
- type CountBucket
- type Node
- type NodeSet
- type RequestAuth
- type RequestContext
- type ResponseStats
- type Retriever
- type Route
- type RouteAction
- type RouteSet
- type SRQueuer
- type Scraper
- type ScraperConfig
- type ScraperRequest
- type ServerResponse
- type StringTest
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func HttpRetriever ¶
func HttpRetriever(req ScraperRequest) (*http.Response, error)
Retrieves pages via HTTP or HTTPS, depending on URL.
Types ¶
type Bucket ¶
A bucket allows for filtering requests according to previous (or parallel) requests. See CountBucket for an example of the kind of stuff Buckets can do.
type CountBucket ¶
type CountBucket struct { MaxHits int // contains filtered or unexported fields }
Allows up to MaxHits requests for each unique URL. Use with MaxHits = 1 as a simple deduplicator.
func NewCountBucket ¶
func NewCountBucket(max_hits int) *CountBucket
func (*CountBucket) Check ¶
func (b *CountBucket) Check(url string) bool
func (*CountBucket) SetMaxHits ¶
func (b *CountBucket) SetMaxHits(max_hits int)
Gets around some issues with casting.
type Node ¶
Wrapper for html.Node with selector capabilities.
type NodeSet ¶
type NodeSet []Node
func WrapNodes ¶
func WrapNodes(raw_nodes []*html.Node, req *ScraperRequest) NodeSet
Turn a slice of []*html.Node into a NodeSet.
func (NodeSet) Attr ¶
Return a slice of attr values for each element in the Nodeset, where the attr name is equivalent to the one given.
func (NodeSet) Queue ¶
func (ns NodeSet) Queue()
Queue the href values for a NodeSet, so that those URLs are appended to the scraper queue.
Each node is queued by its req. You could conceivably have nodes from multiple ScraperRequests in the same NodeSet, and call Queue on the set with non-crazy results, but it's kind of a bizarre and unlikely use case.
type RequestAuth ¶
type RequestContext ¶
type RequestContext struct {
Referer string
}
Used to convey information about the page and context where this request was queued. This can be useful for complex bucketing, or simply tracking down which pages are referring to 404 links.
type ResponseStats ¶
Some statistics on the completed request, for example, the time required to retrieve the file from the network.
type Retriever ¶
type Retriever func(ScraperRequest) (*http.Response, error)
In a production environment, you will always use HttpRetriever.
type Route ¶
type Route struct { Selector StringTest Action RouteAction }
A binding between a URL-matching function, and an action to perform on pages where the URL matches.
You always want to set up your Routes before starting to scrape, or else none of the scraped pages will match.
type RouteAction ¶
type RouteAction func(req ScraperRequest, resp ServerResponse)
Callback that's run for each parsed page.
type RouteSet ¶
type RouteSet struct {
Routes []Route
}
A slice of Routes. Order is important!
func NewRouteSet ¶
func NewRouteSet() *RouteSet
func (*RouteSet) AppendExact ¶
func (rs *RouteSet) AppendExact(url string, action RouteAction)
Shorthand to add a new Route at the end of the set, where exact URL matching is used (see StringTestExact).
func (*RouteSet) AppendPrefix ¶
func (rs *RouteSet) AppendPrefix(prefix string, action RouteAction)
Shorthand to add a new Route at the end of the set, where prefix URL matching is used (see StringTestPrefix).
type SRQueuer ¶
type SRQueuer interface { CreateRequest(string) ScraperRequest DoRequest(ScraperRequest) }
Scraper implements this. But we leave it as an interface so that we can mock it in tests.
type Scraper ¶
type Scraper struct { Routes *RouteSet // contains filtered or unexported fields }
func NewScraper ¶
func NewScraper(config ScraperConfig) (Scraper, error)
May return an error if config validation fails.
func (*Scraper) CreateRequest ¶
func (s *Scraper) CreateRequest(url string) ScraperRequest
Creates a new ScraperRequest with its properties all initialized.
func (*Scraper) DoRequest ¶
func (s *Scraper) DoRequest(req ScraperRequest)
Scrape a URL based on the given ScraperRequest.
type ScraperConfig ¶
type ScraperConfig struct { Retriever Retriever Bucket Bucket Remarks io.Writer Debug io.Writer Auth *RequestAuth }
func (ScraperConfig) Validate ¶
func (sc ScraperConfig) Validate() error
type ScraperRequest ¶
type ScraperRequest struct { Url string RequestQueue SRQueuer Remarks *log.Logger Debug *log.Logger Context RequestContext Auth *RequestAuth }
Represents a single request.
This is handed to all RouteActions.
func (ScraperRequest) ContextualizeUrl ¶
func (sr ScraperRequest) ContextualizeUrl(rel_url string) (string, error)
De-relativize a URL based on the existing request's URL.
This is how "/foo/" URLs queued up for scraping are turned into more actionable "http://origin.host.name/foo/" URLs.
Current behavior is to only give an absolute URL out, if the request's URL is absolute. The contextualization is only as good as the request URL's "absoluteness". You won't get an error if the result is ambiguous. THIS MAY CHANGE IN FUTURE RELEASES.
func (ScraperRequest) QueueAnother ¶
func (sr ScraperRequest) QueueAnother(queue_url string)
Queue another URL for scraping. Duplicate queued items are ignored.
type ServerResponse ¶
type ServerResponse struct { Response *http.Response Request ScraperRequest Stats ResponseStats }
Represents a response from the server, containing the original http.Response object, extra contextual/statistic data, and various convenience functions.
func GetResponse ¶
func GetResponse(req ScraperRequest, ret Retriever) (ServerResponse, error)
Get a ServerResponse based on a ScraperRequest and a Retriever.
Handles things like ResponseStats, so that Retrievers can just focus on handing off an http.Response.
func (ServerResponse) Parse ¶
func (resp ServerResponse) Parse() (Node, error)
Get the HTML node contents of the scraped data.
This consumes resp.Response.Body.
type StringTest ¶
Used to determine if a URL fits a Route.
func StringTestExact ¶
func StringTestExact(pattern string) StringTest
Create a callback that only returns true for exact matches.
func StringTestPrefix ¶
func StringTestPrefix(pattern string) StringTest
Create a callback that only returns true if the URL starts with the given prefix.