scraper

package module
v0.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 18, 2021 License: Apache-2.0 Imports: 7 Imported by: 0

README

Scraper

Is a straightforward Go web-scraper with a simple, flexible interface, inspired by BeautifulSoup.

Quickstart

  1. Create a Scraper from any io.ReadCloser compatible type:

    // http.Response.Body
    response, _ := http.Get("URL goes here")
    page, _ := scraper.NewFromBuffer(response.Body)
    
    // os.File
    fileHandle, _ := os.Open("file name goes here")
    page, _ := NewFromBuffer(fileHandle)
    
  2. Construct a Scraper.Filter with one or more criteria:

    filter := scraper.Filter{
       Tag: "div",
       Attributes: scraper.Attributes{
          "id":    "div-1",
          "class": "tp-modal",
       },
    }
    
  3. Use the Filter to run a concurrent search on your Scraper page.
    Every returned element is a Scraper page that can be searched:

    for element := range page.FindAll(filter) {
       for link := range element.FindAll(Filter{Tag:"a"}) {
          fmt.Printf("URL: %v found under %v", link.Attributes()["href"], element.Type())
       }
    }
    

Next steps

  • Find and FindOne implementations
  • Concurrent scraping
  • Resilience for broken pages (BeautifulSoup-esque)
  • Support for wildcards in attributes
  • Tests
  • Full documentation

Documentation

Overview

Package scraper provides a straightforward interface for scraping web content.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ContentMissingError

func ContentMissingError() error

func MarshallingError

func MarshallingError(err error) error

func RenderingError

func RenderingError(err error) error

Types

type Attributes

type Attributes map[string]string

Attributes specifies tag attributes to be searched for using the Scraper's Find methods. It is a convenience shorthand for `map[string]string` and can contain any number of attribute sets. Note that multiple parameters are resolved with an `&&` operator.

scraperInstance.FindAll(scraper.Filter(Attributes:scraper.Attributes{"class":"someClass"}))

type EmptyTarget

type EmptyTarget struct {
	// contains filtered or unexported fields
}

func (EmptyTarget) Content

func (EmptyTarget) Content() *html.Node

func (EmptyTarget) IsValid

func (EmptyTarget) IsValid() bool

func (EmptyTarget) Render

func (EmptyTarget) Render() (string, error)

func (EmptyTarget) RenderingError

func (EmptyTarget) RenderingError() error

type Filter

type Filter struct {
	Tag        string
	Attributes Attributes
	IsExact    bool
	// contains filtered or unexported fields
}

Filter is the input to the Scraper's Find methods. It can be populated by a tag type, parameters (see `Attributes`) or both. Note that multiple filter arguments are resolved with an `&&` operator.

scraperInstance.FindAll(scraper.Filter{Tag:"div"})

type Scraper

type Scraper struct {
	// contains filtered or unexported fields
}

Scraper is the base type used to scrape content. Do not instantiate it directly - rather use one of the provided scraper.New functions

func NewFromBuffer

func NewFromBuffer(buffer io.ReadCloser) (*Scraper, error)

NewFromBuffer instantiates a new Scraper instance from a given `http.Response` (net/http). You should consider using `NewFromURI` if your requested resource is trivial to get. Note that this function will close the `Body` handle for you.

func NewFromNode

func NewFromNode(node *html.Node) (*Scraper, error)

NewFromNode instantiates a new Scraper instance from a given `html.Node` (golang.org/x/net/html). It is used internally to allow scraping the results of a previous scrape, but provided here if you want to build a hybrid.

func (Scraper) Attributes

func (scraper Scraper) Attributes() Attributes

Attributes returns a map of all attributes on the node

func (Scraper) Content

func (scraper Scraper) Content() *html.Node

Content returns the node the Scraper instance is wrapping. It should be considered a lower-level API

func (Scraper) Find

func (scraper Scraper) Find(filter Filter) *Scraper

Find returns the first node matching the provided Filter. Note that this method is currently very inefficient and needs to be reimplemented

func (Scraper) FindAll

func (scraper Scraper) FindAll(filter Filter) <-chan *Scraper

FindAll returns all nodes matching the provided Filter TODO: better way to track completion?

func (Scraper) Render

func (scraper Scraper) Render() (string, error)

Render returns a rendered version of the Scraper's content. Note that the rendering is best-effort (see golang.org/x/net/html/render.go)

func (Scraper) Text

func (scraper Scraper) Text() (string, bool)

Text returns the text embedded in the node. If other tags are nested under it, it will return an empty string and false OK

func (Scraper) TextOptimistic

func (scraper Scraper) TextOptimistic() string

TextO is an optimistic version of Text that will simply return an empty string if anything goes wrong (see Text docs). It is useful for inlining operations if you trust your inputs

func (Scraper) Type

func (scraper Scraper) Type() string

Type returns the tag type for HTML nodes. For text nodes, it will return the text itself

type Target

type Target interface {
	// Render returns a pretty-rendered version of the target's scope
	Render() (string, error)
	// Render returns the tree-structure representation of the target
	Content() *html.Node
	IsValid() bool
}

Target represents a scope that can be parsed into structured data or rendered as such. It is an implementation detail meant to allow better encapsulation for the different ways of instantiating a Scraper.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL