scraper

package module

v0.0.3 Latest Latest Go to latest Published: Feb 18, 2021 License: Apache-2.0 Imports: 7 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/quittymr/scraper

Links

Open Source Insights

README ¶

Scraper

Is a straightforward Go web-scraper with a simple, flexible interface, inspired by BeautifulSoup.

Quickstart

Create a Scraper from any io.ReadCloser compatible type:

// http.Response.Body
response, _ := http.Get("URL goes here")
page, _ := scraper.NewFromBuffer(response.Body)

// os.File
fileHandle, _ := os.Open("file name goes here")
page, _ := NewFromBuffer(fileHandle)

Construct a Scraper.Filter with one or more criteria:

filter := scraper.Filter{
   Tag: "div",
   Attributes: scraper.Attributes{
      "id":    "div-1",
      "class": "tp-modal",
   },
}

Use the Filter to run a concurrent search on your Scraper page.
Every returned element is a Scraper page that can be searched:

for element := range page.FindAll(filter) {
   for link := range element.FindAll(Filter{Tag:"a"}) {
      fmt.Printf("URL: %v found under %v", link.Attributes()["href"], element.Type())
   }
}

Next steps

~~Find and FindOne implementations~~
~~Concurrent scraping~~
~~Resilience for broken pages (BeautifulSoup-esque)~~
Support for wildcards in attributes
Tests
Full documentation

Documentation ¶

Overview ¶

Package scraper provides a straightforward interface for scraping web content.

Index ¶

func ContentMissingError() error
func MarshallingError(err error) error
func RenderingError(err error) error
type Attributes
type EmptyTarget
type Filter
type Scraper
- func NewFromBuffer(buffer io.ReadCloser) (*Scraper, error)
- func NewFromNode(node *html.Node) (*Scraper, error)
type Target

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func ContentMissingError ¶

func ContentMissingError() error

func MarshallingError ¶

func MarshallingError(err error) error

func RenderingError ¶

func RenderingError(err error) error

Types ¶

type Attributes ¶

type Attributes map[string]string

Attributes specifies tag attributes to be searched for using the Scraper's Find methods. It is a convenience shorthand for `map[string]string` and can contain any number of attribute sets. Note that multiple parameters are resolved with an `&&` operator.

scraperInstance.FindAll(scraper.Filter(Attributes:scraper.Attributes{"class":"someClass"}))

type EmptyTarget ¶

type EmptyTarget struct {
	// contains filtered or unexported fields
}

func (EmptyTarget) Content ¶

func (EmptyTarget) Content() *html.Node

func (EmptyTarget) IsValid ¶

func (EmptyTarget) IsValid() bool

func (EmptyTarget) Render ¶

func (EmptyTarget) Render() (string, error)

func (EmptyTarget) RenderingError ¶

func (EmptyTarget) RenderingError() error

type Filter ¶

type Filter struct {
	Tag        string
	Attributes Attributes
	IsExact    bool
	// contains filtered or unexported fields
}

Filter is the input to the Scraper's Find methods. It can be populated by a tag type, parameters (see `Attributes`) or both. Note that multiple filter arguments are resolved with an `&&` operator.

scraperInstance.FindAll(scraper.Filter{Tag:"div"})

type Scraper ¶

type Scraper struct {
	// contains filtered or unexported fields
}

Scraper is the base type used to scrape content. Do not instantiate it directly - rather use one of the provided scraper.New functions

func NewFromBuffer ¶

func NewFromBuffer(buffer io.ReadCloser) (*Scraper, error)

NewFromBuffer instantiates a new Scraper instance from a given `http.Response` (net/http). You should consider using `NewFromURI` if your requested resource is trivial to get. Note that this function will close the `Body` handle for you.

func NewFromNode ¶

func NewFromNode(node *html.Node) (*Scraper, error)

NewFromNode instantiates a new Scraper instance from a given `html.Node` (golang.org/x/net/html). It is used internally to allow scraping the results of a previous scrape, but provided here if you want to build a hybrid.

func (Scraper) Attributes ¶

func (scraper Scraper) Attributes() Attributes

Attributes returns a map of all attributes on the node

func (Scraper) Content ¶

func (scraper Scraper) Content() *html.Node

Content returns the node the Scraper instance is wrapping. It should be considered a lower-level API

func (Scraper) Find ¶

func (scraper Scraper) Find(filter Filter) *Scraper

Find returns the first node matching the provided Filter. Note that this method is currently very inefficient and needs to be reimplemented

func (Scraper) FindAll ¶

func (scraper Scraper) FindAll(filter Filter) <-chan *Scraper

FindAll returns all nodes matching the provided Filter TODO: better way to track completion?

func (Scraper) Render ¶

func (scraper Scraper) Render() (string, error)

Render returns a rendered version of the Scraper's content. Note that the rendering is best-effort (see golang.org/x/net/html/render.go)

func (Scraper) Text ¶

func (scraper Scraper) Text() (string, bool)

Text returns the text embedded in the node. If other tags are nested under it, it will return an empty string and false OK

func (Scraper) TextOptimistic ¶

func (scraper Scraper) TextOptimistic() string

TextO is an optimistic version of Text that will simply return an empty string if anything goes wrong (see Text docs). It is useful for inlining operations if you trust your inputs

func (Scraper) Type ¶

func (scraper Scraper) Type() string

Type returns the tag type for HTML nodes. For text nodes, it will return the text itself

type Target ¶

type Target interface {
	// Render returns a pretty-rendered version of the target's scope
	Render() (string, error)
	// Render returns the tree-structure representation of the target
	Content() *html.Node
	IsValid() bool
}

Target represents a scope that can be parsed into structured data or rendered as such. It is an implementation detail meant to allow better encapsulation for the different ways of instantiating a Scraper.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
mocks

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL