scrap

package module
v0.0.0-...-46d0f06 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 18, 2014 License: GPL-2.0 Imports: 10 Imported by: 0

README

scrap

Scraper written in Golang for high-performance site testing/validation.

See http://godoc.org/github.com/campadrenalin/scrap for more information. This library was originally developed to be part of the automated testing infrastructure for www.inspire.com.

Documentation

Overview

A scraper library for aggressively fast testing of local sites

This package is not actually for downloading websites, which is better accomplished by using wget with some clever flags. This is about scanning over a local website, using custom functions to verify the structural sanity of the pages.

Example
s, err := NewScraper(ScraperConfig{
	// Real code would use HttpRetriever as the Retriever.
	Retriever: testHtmlRetriever,
	Bucket:    NewCountBucket(1),
	Remarks:   os.Stdout,
	Debug:     os.Stdout,
})
if err != nil {
	fmt.Println(err)
	return
}

base_url := "http://www.example.com/"
s.Routes.AppendPrefix(base_url, func(req ScraperRequest, resp ServerResponse) {
	// Get HTML tree
	root, _ := resp.Parse()

	// Verify that there is only one <head> element
	num_heads := len(root.Find("head"))
	if num_heads != 1 {
		req.Remarks.Printf("%d heads, expected 1!\n", num_heads)
	}

	// Queue up any links on this page, for further scraping
	root.Find("a").Queue()
})

s.Scrape(base_url)
s.Wait()
Output:

http://www.example.com/: Found a route
http://www.example.com/first: Found a route
http://www.example.com/second: Found a route
http://www.example.com/third: Found a route

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func HttpRetriever

func HttpRetriever(req ScraperRequest) (*http.Response, error)

Retrieves pages via HTTP or HTTPS, depending on URL.

Types

type Bucket

type Bucket interface {
	Check(url string) bool
}

A bucket allows for filtering requests according to previous (or parallel) requests. See CountBucket for an example of the kind of stuff Buckets can do.

type CountBucket

type CountBucket struct {
	MaxHits int
	// contains filtered or unexported fields
}

Allows up to MaxHits requests for each unique URL. Use with MaxHits = 1 as a simple deduplicator.

func NewCountBucket

func NewCountBucket(max_hits int) *CountBucket

func (*CountBucket) Check

func (b *CountBucket) Check(url string) bool

func (*CountBucket) SetMaxHits

func (b *CountBucket) SetMaxHits(max_hits int)

Gets around some issues with casting.

type Node

type Node struct {
	*html.Node
	// contains filtered or unexported fields
}

Wrapper for html.Node with selector capabilities.

func (*Node) Find

func (n *Node) Find(sel string) NodeSet

Find a set of descendent nodes based on CSS3 selector.

func (Node) Queue

func (n Node) Queue()

Queue this node's 'href' attr value as a URL to scrape.

type NodeSet

type NodeSet []Node

func WrapNodes

func WrapNodes(raw_nodes []*html.Node, req *ScraperRequest) NodeSet

Turn a slice of []*html.Node into a NodeSet.

func (NodeSet) Attr

func (ns NodeSet) Attr(name string) []string

Return a slice of attr values for each element in the Nodeset, where the attr name is equivalent to the one given.

func (NodeSet) Queue

func (ns NodeSet) Queue()

Queue the href values for a NodeSet, so that those URLs are appended to the scraper queue.

Each node is queued by its req. You could conceivably have nodes from multiple ScraperRequests in the same NodeSet, and call Queue on the set with non-crazy results, but it's kind of a bizarre and unlikely use case.

type RequestAuth

type RequestAuth struct {
	Username string
	Password string
}

type RequestContext

type RequestContext struct {
	Referer string
}

Used to convey information about the page and context where this request was queued. This can be useful for complex bucketing, or simply tracking down which pages are referring to 404 links.

type ResponseStats

type ResponseStats struct {
	Start    time.Time
	Duration time.Duration
}

Some statistics on the completed request, for example, the time required to retrieve the file from the network.

type Retriever

type Retriever func(ScraperRequest) (*http.Response, error)

In a production environment, you will always use HttpRetriever.

type Route

type Route struct {
	Selector StringTest
	Action   RouteAction
}

A binding between a URL-matching function, and an action to perform on pages where the URL matches.

You always want to set up your Routes before starting to scrape, or else none of the scraped pages will match.

func (Route) Matches

func (r Route) Matches(url string) bool

Does the given url match this Route? Used by the Scraper to select the first matching Route.

func (Route) Run

func (r Route) Run(req ScraperRequest, ret Retriever, wg *sync.WaitGroup)

Runs r.Action in a goroutine, subscribing it on the WaitGroup

type RouteAction

type RouteAction func(req ScraperRequest, resp ServerResponse)

Callback that's run for each parsed page.

type RouteSet

type RouteSet struct {
	Routes []Route
}

A slice of Routes. Order is important!

func NewRouteSet

func NewRouteSet() *RouteSet

func (*RouteSet) Append

func (rs *RouteSet) Append(r Route)

Add a new Route at the end of the set.

func (*RouteSet) AppendExact

func (rs *RouteSet) AppendExact(url string, action RouteAction)

Shorthand to add a new Route at the end of the set, where exact URL matching is used (see StringTestExact).

func (*RouteSet) AppendPrefix

func (rs *RouteSet) AppendPrefix(prefix string, action RouteAction)

Shorthand to add a new Route at the end of the set, where prefix URL matching is used (see StringTestPrefix).

func (*RouteSet) MatchUrl

func (rs *RouteSet) MatchUrl(url string) (Route, bool)

Return the first Route where the URL matches according to the Route's matching function. Order is important!

type SRQueuer

type SRQueuer interface {
	CreateRequest(string) ScraperRequest
	DoRequest(ScraperRequest)
}

Scraper implements this. But we leave it as an interface so that we can mock it in tests.

type Scraper

type Scraper struct {
	Routes *RouteSet
	// contains filtered or unexported fields
}

func NewScraper

func NewScraper(config ScraperConfig) (Scraper, error)

May return an error if config validation fails.

func (*Scraper) CreateRequest

func (s *Scraper) CreateRequest(url string) ScraperRequest

Creates a new ScraperRequest with its properties all initialized.

func (*Scraper) DoRequest

func (s *Scraper) DoRequest(req ScraperRequest)

Scrape a URL based on the given ScraperRequest.

func (*Scraper) Scrape

func (s *Scraper) Scrape(url string)

Convenience function to create and queue a new request.

func (*Scraper) Wait

func (s *Scraper) Wait()

Wait for all outstanding queued items to finish. You almost always want to do this, so that your main function doesn't end (thus ending the entire process) while you still have all your goroutines out in limbo.

type ScraperConfig

type ScraperConfig struct {
	Retriever Retriever
	Bucket    Bucket
	Remarks   io.Writer
	Debug     io.Writer
	Auth      *RequestAuth
}

func (ScraperConfig) Validate

func (sc ScraperConfig) Validate() error

type ScraperRequest

type ScraperRequest struct {
	Url          string
	RequestQueue SRQueuer
	Remarks      *log.Logger
	Debug        *log.Logger
	Context      RequestContext
	Auth         *RequestAuth
}

Represents a single request.

This is handed to all RouteActions.

func (ScraperRequest) ContextualizeUrl

func (sr ScraperRequest) ContextualizeUrl(rel_url string) (string, error)

De-relativize a URL based on the existing request's URL.

This is how "/foo/" URLs queued up for scraping are turned into more actionable "http://origin.host.name/foo/" URLs.

Current behavior is to only give an absolute URL out, if the request's URL is absolute. The contextualization is only as good as the request URL's "absoluteness". You won't get an error if the result is ambiguous. THIS MAY CHANGE IN FUTURE RELEASES.

func (ScraperRequest) QueueAnother

func (sr ScraperRequest) QueueAnother(queue_url string)

Queue another URL for scraping. Duplicate queued items are ignored.

type ServerResponse

type ServerResponse struct {
	Response *http.Response
	Request  ScraperRequest
	Stats    ResponseStats
}

Represents a response from the server, containing the original http.Response object, extra contextual/statistic data, and various convenience functions.

func GetResponse

func GetResponse(req ScraperRequest, ret Retriever) (ServerResponse, error)

Get a ServerResponse based on a ScraperRequest and a Retriever.

Handles things like ResponseStats, so that Retrievers can just focus on handing off an http.Response.

func (ServerResponse) Parse

func (resp ServerResponse) Parse() (Node, error)

Get the HTML node contents of the scraped data.

This consumes resp.Response.Body.

type StringTest

type StringTest func(string) bool

Used to determine if a URL fits a Route.

func StringTestExact

func StringTestExact(pattern string) StringTest

Create a callback that only returns true for exact matches.

func StringTestPrefix

func StringTestPrefix(pattern string) StringTest

Create a callback that only returns true if the URL starts with the given prefix.

Directories

Path Synopsis
Additional Bucket types for more complex filtering
Additional Bucket types for more complex filtering

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL