scrap

package module

v0.0.0-...-46d0f06 Latest Latest Go to latest Published: Jun 18, 2014 License: GPL-2.0 Imports: 10 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/campadrenalin/scrap

Links

Open Source Insights

README ¶

scrap

Scraper written in Golang for high-performance site testing/validation.

See http://godoc.org/github.com/campadrenalin/scrap for more information. This library was originally developed to be part of the automated testing infrastructure for www.inspire.com.

Documentation ¶

Overview ¶

A scraper library for aggressively fast testing of local sites

This package is not actually for downloading websites, which is better accomplished by using wget with some clever flags. This is about scanning over a local website, using custom functions to verify the structural sanity of the pages.

Example ¶

s, err := NewScraper(ScraperConfig{
	// Real code would use HttpRetriever as the Retriever.
	Retriever: testHtmlRetriever,
	Bucket:    NewCountBucket(1),
	Remarks:   os.Stdout,
	Debug:     os.Stdout,
})
if err != nil {
	fmt.Println(err)
	return
}

base_url := "http://www.example.com/"
s.Routes.AppendPrefix(base_url, func(req ScraperRequest, resp ServerResponse) {
	// Get HTML tree
	root, _ := resp.Parse()

	// Verify that there is only one <head> element
	num_heads := len(root.Find("head"))
	if num_heads != 1 {
		req.Remarks.Printf("%d heads, expected 1!\n", num_heads)
	}

	// Queue up any links on this page, for further scraping
	root.Find("a").Queue()
})

s.Scrape(base_url)
s.Wait()

Output:

http://www.example.com/: Found a route
http://www.example.com/first: Found a route
http://www.example.com/second: Found a route
http://www.example.com/third: Found a route

Index ¶

func HttpRetriever(req ScraperRequest) (*http.Response, error)
type Bucket
type CountBucket
- func NewCountBucket(max_hits int) *CountBucket
- func (b *CountBucket) Check(url string) bool
- func (b *CountBucket) SetMaxHits(max_hits int)
type Node
- func (n *Node) Find(sel string) NodeSet
- func (n Node) Queue()
type NodeSet
- func WrapNodes(raw_nodes []*html.Node, req *ScraperRequest) NodeSet
- func (ns NodeSet) Attr(name string) []string
- func (ns NodeSet) Queue()
type RequestAuth
type RequestContext
type ResponseStats
type Retriever
type Route
- func (r Route) Matches(url string) bool
- func (r Route) Run(req ScraperRequest, ret Retriever, wg *sync.WaitGroup)
type RouteAction
type RouteSet
- func NewRouteSet() *RouteSet
- func (rs *RouteSet) Append(r Route)
- func (rs *RouteSet) AppendExact(url string, action RouteAction)
- func (rs *RouteSet) AppendPrefix(prefix string, action RouteAction)
- func (rs *RouteSet) MatchUrl(url string) (Route, bool)
type SRQueuer
type Scraper
- func NewScraper(config ScraperConfig) (Scraper, error)
- func (s *Scraper) CreateRequest(url string) ScraperRequest
- func (s *Scraper) DoRequest(req ScraperRequest)
- func (s *Scraper) Scrape(url string)
- func (s *Scraper) Wait()
type ScraperConfig
- func (sc ScraperConfig) Validate() error
type ScraperRequest
- func (sr ScraperRequest) ContextualizeUrl(rel_url string) (string, error)
- func (sr ScraperRequest) QueueAnother(queue_url string)
type ServerResponse
- func GetResponse(req ScraperRequest, ret Retriever) (ServerResponse, error)
- func (resp ServerResponse) Parse() (Node, error)
type StringTest
- func StringTestExact(pattern string) StringTest
- func StringTestPrefix(pattern string) StringTest

Examples ¶

Package

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func HttpRetriever ¶

func HttpRetriever(req ScraperRequest) (*http.Response, error)

Retrieves pages via HTTP or HTTPS, depending on URL.

Types ¶

type Bucket ¶

type Bucket interface {
	Check(url string) bool
}

A bucket allows for filtering requests according to previous (or parallel) requests. See CountBucket for an example of the kind of stuff Buckets can do.

type CountBucket ¶

type CountBucket struct {
	MaxHits int
	// contains filtered or unexported fields
}

Allows up to MaxHits requests for each unique URL. Use with MaxHits = 1 as a simple deduplicator.

func NewCountBucket ¶

func NewCountBucket(max_hits int) *CountBucket

func (*CountBucket) Check ¶

func (b *CountBucket) Check(url string) bool

func (*CountBucket) SetMaxHits ¶

func (b *CountBucket) SetMaxHits(max_hits int)

Gets around some issues with casting.

type Node ¶

type Node struct {
	*html.Node
	// contains filtered or unexported fields
}

Wrapper for html.Node with selector capabilities.

func (*Node) Find ¶

func (n *Node) Find(sel string) NodeSet

Find a set of descendent nodes based on CSS3 selector.

func (Node) Queue ¶

func (n Node) Queue()

Queue this node's 'href' attr value as a URL to scrape.

type NodeSet ¶

type NodeSet []Node

func WrapNodes ¶

func WrapNodes(raw_nodes []*html.Node, req *ScraperRequest) NodeSet

Turn a slice of []*html.Node into a NodeSet.

func (NodeSet) Attr ¶

func (ns NodeSet) Attr(name string) []string

Return a slice of attr values for each element in the Nodeset, where the attr name is equivalent to the one given.

func (NodeSet) Queue ¶

func (ns NodeSet) Queue()

Queue the href values for a NodeSet, so that those URLs are appended to the scraper queue.

Each node is queued by its req. You could conceivably have nodes from multiple ScraperRequests in the same NodeSet, and call Queue on the set with non-crazy results, but it's kind of a bizarre and unlikely use case.

type RequestAuth ¶

type RequestAuth struct {
	Username string
	Password string
}

type RequestContext ¶

type RequestContext struct {
	Referer string
}

Used to convey information about the page and context where this request was queued. This can be useful for complex bucketing, or simply tracking down which pages are referring to 404 links.

type ResponseStats ¶

type ResponseStats struct {
	Start    time.Time
	Duration time.Duration
}

Some statistics on the completed request, for example, the time required to retrieve the file from the network.

type Retriever ¶

type Retriever func(ScraperRequest) (*http.Response, error)

In a production environment, you will always use HttpRetriever.

type Route ¶

type Route struct {
	Selector StringTest
	Action   RouteAction
}

A binding between a URL-matching function, and an action to perform on pages where the URL matches.

You always want to set up your Routes before starting to scrape, or else none of the scraped pages will match.

func (Route) Matches ¶

func (r Route) Matches(url string) bool

Does the given url match this Route? Used by the Scraper to select the first matching Route.

func (Route) Run ¶

func (r Route) Run(req ScraperRequest, ret Retriever, wg *sync.WaitGroup)

Runs r.Action in a goroutine, subscribing it on the WaitGroup

type RouteAction ¶

type RouteAction func(req ScraperRequest, resp ServerResponse)

Callback that's run for each parsed page.

type RouteSet ¶

type RouteSet struct {
	Routes []Route
}

A slice of Routes. Order is important!

func NewRouteSet ¶

func NewRouteSet() *RouteSet

func (*RouteSet) Append ¶

func (rs *RouteSet) Append(r Route)

Add a new Route at the end of the set.

func (*RouteSet) AppendExact ¶

func (rs *RouteSet) AppendExact(url string, action RouteAction)

Shorthand to add a new Route at the end of the set, where exact URL matching is used (see StringTestExact).

func (*RouteSet) AppendPrefix ¶

func (rs *RouteSet) AppendPrefix(prefix string, action RouteAction)

Shorthand to add a new Route at the end of the set, where prefix URL matching is used (see StringTestPrefix).

func (*RouteSet) MatchUrl ¶

func (rs *RouteSet) MatchUrl(url string) (Route, bool)

Return the first Route where the URL matches according to the Route's matching function. Order is important!

type SRQueuer ¶

type SRQueuer interface {
	CreateRequest(string) ScraperRequest
	DoRequest(ScraperRequest)
}

Scraper implements this. But we leave it as an interface so that we can mock it in tests.

type Scraper ¶

type Scraper struct {
	Routes *RouteSet
	// contains filtered or unexported fields
}

func NewScraper ¶

func NewScraper(config ScraperConfig) (Scraper, error)

May return an error if config validation fails.

func (*Scraper) CreateRequest ¶

func (s *Scraper) CreateRequest(url string) ScraperRequest

Creates a new ScraperRequest with its properties all initialized.

func (*Scraper) DoRequest ¶

func (s *Scraper) DoRequest(req ScraperRequest)

Scrape a URL based on the given ScraperRequest.

func (*Scraper) Scrape ¶

func (s *Scraper) Scrape(url string)

Convenience function to create and queue a new request.

func (*Scraper) Wait ¶

func (s *Scraper) Wait()

Wait for all outstanding queued items to finish. You almost always want to do this, so that your main function doesn't end (thus ending the entire process) while you still have all your goroutines out in limbo.

type ScraperConfig ¶

type ScraperConfig struct {
	Retriever Retriever
	Bucket    Bucket
	Remarks   io.Writer
	Debug     io.Writer
	Auth      *RequestAuth
}

func (ScraperConfig) Validate ¶

func (sc ScraperConfig) Validate() error

type ScraperRequest ¶

type ScraperRequest struct {
	Url          string
	RequestQueue SRQueuer
	Remarks      *log.Logger
	Debug        *log.Logger
	Context      RequestContext
	Auth         *RequestAuth
}

Represents a single request.

This is handed to all RouteActions.

func (ScraperRequest) ContextualizeUrl ¶

func (sr ScraperRequest) ContextualizeUrl(rel_url string) (string, error)

De-relativize a URL based on the existing request's URL.

This is how "/foo/" URLs queued up for scraping are turned into more actionable "http://origin.host.name/foo/" URLs.

Current behavior is to only give an absolute URL out, if the request's URL is absolute. The contextualization is only as good as the request URL's "absoluteness". You won't get an error if the result is ambiguous. THIS MAY CHANGE IN FUTURE RELEASES.

func (ScraperRequest) QueueAnother ¶

func (sr ScraperRequest) QueueAnother(queue_url string)

Queue another URL for scraping. Duplicate queued items are ignored.

type ServerResponse ¶

type ServerResponse struct {
	Response *http.Response
	Request  ScraperRequest
	Stats    ResponseStats
}

Represents a response from the server, containing the original http.Response object, extra contextual/statistic data, and various convenience functions.

func GetResponse ¶

func GetResponse(req ScraperRequest, ret Retriever) (ServerResponse, error)

Get a ServerResponse based on a ScraperRequest and a Retriever.

Handles things like ResponseStats, so that Retrievers can just focus on handing off an http.Response.

func (ServerResponse) Parse ¶

func (resp ServerResponse) Parse() (Node, error)

Get the HTML node contents of the scraped data.

This consumes resp.Response.Body.

type StringTest ¶

type StringTest func(string) bool

Used to determine if a URL fits a Route.

func StringTestExact ¶

func StringTestExact(pattern string) StringTest

Create a callback that only returns true for exact matches.

func StringTestPrefix ¶

func StringTestPrefix(pattern string) StringTest

Create a callback that only returns true if the URL starts with the given prefix.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
buckets Additional Bucket types for more complex filtering	Additional Bucket types for more complex filtering

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL