scraper

package module
v0.0.0-...-9035759 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 20, 2019 License: Apache-2.0 Imports: 9 Imported by: 0

README

Scraper - HTMl Unmarshaling for G

Go Report Card Build Status GoDoc Coverage Status

Scraper is a Go package that parses HTML documents and unmarshals them into Go structs based on CSS selectors. Selectors are specified using the "scraper" struct field tag. For documentation and examples, please see the GoDoc

Documentation

Overview

Package scraper provides a means to parse and unmarshal HTML into Go structs. Usage is best described by example:

package main

import (
	"fmt"

	"github.com/mh-orange/scraper"
)

type MyType struct {
	Name string `scraper:"#name"`
	URL  string `scraper:"a" scrapeType:"attr:href"`
}

func main() {
	document := `<html><body><h1 id="name">Hello Scraper!</h1><a href="https://github.org/mh-orange/scraper">Scraper</a> is Grrrrrreat!</body></html>`
	v := &MyType{}
	err := scraper.Unmarshal([]byte(document), v)
	if err != nil {
		panic(err.Error())
	}
	fmt.Printf("%+v\n", v)
	// &{Name:Hello Scraper! URL:https://github.org/mh-orange/scraper}
}

Structs are unmarshaled by matching CSS selectors to elements in an html document tree. Scraper uses the wonderful Cascadia (https://github.com/andybalholm/cascadia) package to parse and match CSS selectors.

To specify matching and unmarshaling rules, use the "scraper" and "scrapeType" struct field tags. The "scraper" tag is used to define the CSS selector and the "scrapeType" indicates whether the value should be the text content or an attribute of the matching element. The default type (if the scrapeTag is omitted) is to use the text content. For example, to match an element with the id "name" and capture its text content:

type MyType struct {
	Name string `scraper:"#name"`
}

Another example, which uses the href attribute of a matching "a" tag:

type MyType struct {
	URL string `scraper:"a" scrapeType:"attr:href"`
}

Note that the attribute name is specified after the type (attr) and a separating colon.

Types that implement encoding.BinaryUnmarshaler or encoding.TextUnmarshaler are honored:

type Name struct {
	First string
	Last  string
}

func (n *Name) UnmarshalText(text []byte) (err error) {
	tokens := strings.Split(string(text), ", ")
	if len(tokens) == 2 {
		n.Last = tokens[0]
		n.First = tokens[1]
	} else {
		err = errors.New("Wanted comma separated last and first names")
	}
	return err
}

type Class struct {
	Students []Name `scraper:"ul li"`
}

Index

Examples

Constants

View Source
const (
	// SelectorTagName is used to reflect the appropriate struct field tag.  The SelectorTagName
	// is the tag used to specify a CSS selector to match for the field
	SelectorTagName = "scraper"

	// TypeTagName (scrapeType) is the tag used to specify what kind of value lookup should be performed.  The
	// default is `text` and simply gathers the text nodes from the matching html subtree.  The
	// alternative type is `attr` which will assign value based on a matching attribute.  The
	// attribute name (for the matched node) is specified following a colon
	TypeTagName = "scrapeType"
)

Scraper uses struct field tags to determine how to unmarshal an HTML element tree into a type. This is similar to how encoding/json uses tags to match json field names to struct field names. There are two tags that scraper uses in its processing, `scraper` and `scrapeType`. Example:

type MyType struct {
  URL string `scraper:"a.myurl" scrapeType:"attr:href"` // parses the href attribute from the matching a
}

Variables

View Source
var (
	// ErrUnknownTagType indicates that the scraperType tag is an unknown value
	ErrUnknownTagType = errors.New("Unknown tag type ")
)

Functions

func Unmarshal

func Unmarshal(text []byte, v interface{}) error

Unmarshal will parse the input text and unmarshal it into v

Example
package main

import (
	"fmt"

	"github.com/mh-orange/scraper"
)

func main() {
	// Parse and unmarshal an HTML document into a very basic Go struct
	document := `<html><body><h1 id="name">Hello Scraper!</h1><a href="https://github.org/mh-orange/scraper">Scraper</a> is Grrrrrreat!</body></html>`
	v := &struct {
		// Name is assigned the text content from the element with the ID "name"
		Name string `scraper:"#name"`

		// URL is assigned the HREF attribute of the first A element found
		URL string `scraper:"a" scrapeType:"attr:href"`
	}{}
	err := scraper.Unmarshal([]byte(document), v)
	if err != nil {
		panic(err.Error())
	}
	fmt.Printf("%+v\n", v)
}
Output:

&{Name:Hello Scraper! URL:https://github.org/mh-orange/scraper}
Example (Nested)
package main

import (
	"fmt"

	"github.com/mh-orange/scraper"
)

func main() {
	// Scraper can be used to unmarshal structs with other structs
	// in them
	document := `
		<html>
			<body>
				<h1 id="name">Hello Scraper!</h1>
				<ul>
					<li>Item 1</li>
					<li>Item 2</li>
					<li>Item 3</li>
				</ul>
			</body>
		</html>`
	v := &struct {
		// Name is assigned the text content from the element with the ID "name"
		Name string `scraper:"#name"`

		// Items is matched with the ul tag and then names is matched by the
		// li tags within.  Nested structs will be unmarshaled with the matching
		// _subtree_ not the entire document
		Items struct {
			Names []string `scraper:"li"`
		} `scraper:"ul"`
	}{}
	err := scraper.Unmarshal([]byte(document), v)
	if err != nil {
		panic(err.Error())
	}
	fmt.Printf("%+v\n", v)
}
Output:

&{Name:Hello Scraper! Items:{Names:[Item 1 Item 2 Item 3]}}
Example (Slice)
package main

import (
	"fmt"

	"github.com/mh-orange/scraper"
)

func main() {
	// Scraper can be used to unmarshal structs with slices
	// of things as well
	document := `
		<html>
			<body>
				<h1 id="name">Hello Scraper!</h1>
				<ul>
					<li>Item 1</li>
					<li>Item 2</li>
					<li>Item 3</li>
				</ul>
			</body>
		</html>`
	v := &struct {
		// Name is assigned the text content from the element with the ID "name"
		Name string `scraper:"#name"`

		// Items is appended with the text content of each element matching the
		// "ul li" CSS selector
		Items []string `scraper:"ul li"`
	}{}
	err := scraper.Unmarshal([]byte(document), v)
	if err != nil {
		panic(err.Error())
	}
	fmt.Printf("%+v\n", v)
}
Output:

&{Name:Hello Scraper! Items:[Item 1 Item 2 Item 3]}

Types

type BinaryUnmarshaler

type BinaryUnmarshaler interface {
	encoding.BinaryUnmarshaler
}

BinaryUnmarshaler is the interface implemented by an object that can unmarshal the byte string (either text content or attribute) from an element matched by a scraper seleector

type Decoder

type Decoder struct {
	// contains filtered or unexported fields
}

Decoder will read from an io.Reader, parse the content into a root *html.Node and then unmarshal the content into a receiver

Example
package main

import (
	"fmt"
	"strings"

	"github.com/mh-orange/scraper"
)

func main() {
	// Decoder is useful for unmarshaling from an input stream
	document := `<html><body><h1 id="name">Hello Scraper!</h1></body></html>`
	v := &struct {
		// Name is assigned the text content from the element with the ID "name"
		Name string `scraper:"#name"`
	}{}

	reader := strings.NewReader(document)
	scraper.NewDecoder(reader).Decode(v)
	fmt.Printf("%+v\n", v)
}
Output:

&{Name:Hello Scraper!}

func NewDecoder

func NewDecoder(r io.Reader, options ...Option) *Decoder

NewDecoder initializes a decoder for the given reader and options

func (*Decoder) Decode

func (dec *Decoder) Decode(v interface{}) error

Decode the input stream and unmarshal it into v

type HTMLUnmarshaler

type HTMLUnmarshaler interface {
	UnmarshalHTML(*html.Node) error
}

HTMLUnmarshaler is the interface implemented by types that can unmarshal parsed html directly. The input is a parsed element tree starting at the element that matched the CSS selector specified in the scraper tag

type InvalidUnmarshalError

type InvalidUnmarshalError struct {
	Type reflect.Type
	Want reflect.Kind
}

An InvalidUnmarshalError describes an invalid argument passed to Unmarshal. (The argument to Unmarshal must be a non-nil pointer.)

func (*InvalidUnmarshalError) Error

func (e *InvalidUnmarshalError) Error() string

type Option

type Option func(*Unmarshaler) error

Option updates an Unmarshaler with various capabilities

func TrimSpace

func TrimSpace() Option

TrimSpace tells the unmarshaller to trim values using strings.TrimSpace when a field is set, the value (either text content or attribute value) will be trimmed prior to type conversion and assignment

type TextUnmarshaler

type TextUnmarshaler interface {
	encoding.TextUnmarshaler
}

TextUnmarshaler is the interface implemented by an object that can unmarshal the byte string (either text content or attribute) from an element matched by a scraper seleector

Example
package main

import (
	"errors"
	"fmt"
	"strings"

	"github.com/mh-orange/scraper"
)

type Name struct {
	First string
	Last  string
}

func (n *Name) UnmarshalText(text []byte) (err error) {
	tokens := strings.Split(string(text), ", ")
	if len(tokens) == 2 {
		n.Last = tokens[0]
		n.First = tokens[1]
	} else {
		err = errors.New("Wanted comma separated last and first names")
	}
	return err
}

type Class struct {
	Students []Name `scraper:"ul li"`
}

func main() {
	document := `
		<html>
			<body>
				<h1 id="name">Class Roster</h1>
				<ul>
					<li>Stone, John</li>
					<li>Priya, Ponnappa</li>
					<li>Wong, Mia</li>
				</ul>
			</body>
		</html>`
	v := &Class{}
	err := scraper.Unmarshal([]byte(document), v)
	if err != nil {
		panic(err.Error())
	}
	fmt.Printf("%+v\n", v)
}
Output:

&{Students:[{First:John Last:Stone} {First:Ponnappa Last:Priya} {First:Mia Last:Wong}]}

type UnmarshalTypeError

type UnmarshalTypeError struct {
	Value string       // description of value - "bool", "array", "number -5"
	Type  reflect.Type // type of Go value it could not be assigned to
}

An UnmarshalTypeError describes a value that was not appropriate for a value of a specific Go type.

func (*UnmarshalTypeError) Error

func (e *UnmarshalTypeError) Error() string

type Unmarshaler

type Unmarshaler struct {
	// contains filtered or unexported fields
}

Unmarshaler processes an HTML tree and unmarshals/parses it into a receiver. The unmarshaler looks for struct field tags matching `scraper` and `scrapeType`

func NewUnmarshaler

func NewUnmarshaler(root *html.Node, options ...Option) (u *Unmarshaler)

NewUnmarshaler creates a scraper Unmarshaler with its root set to the input *html.Node and setting any options given. If any of the options generate an error, then that error is passed through upon calling Unmarshal. This allows for chaining the NewUnmarshaler function with Unmarshal:

err := NewUnmarshaler(root).Unmarshal(v)

func (*Unmarshaler) Unmarshal

func (u *Unmarshaler) Unmarshal(v interface{}) (err error)

Unmarshal the document into v

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL