scraper

package module

v0.0.0-...-9035759 Latest Latest Go to latest Published: Sep 20, 2019 License: Apache-2.0 Imports: 9 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/mh-orange/scraper

Links

Open Source Insights

README ¶

Scraper - HTMl Unmarshaling for G

Scraper is a Go package that parses HTML documents and unmarshals them into Go structs based on CSS selectors. Selectors are specified using the "scraper" struct field tag. For documentation and examples, please see the GoDoc

Documentation ¶

Overview ¶

Package scraper provides a means to parse and unmarshal HTML into Go structs. Usage is best described by example:

package main

import (
	"fmt"

	"github.com/mh-orange/scraper"
)

type MyType struct {
	Name string `scraper:"#name"`
	URL  string `scraper:"a" scrapeType:"attr:href"`
}

func main() {
	document := `<html><body><h1 id="name">Hello Scraper!</h1><a href="https://github.org/mh-orange/scraper">Scraper</a> is Grrrrrreat!</body></html>`
	v := &MyType{}
	err := scraper.Unmarshal([]byte(document), v)
	if err != nil {
		panic(err.Error())
	}
	fmt.Printf("%+v\n", v)
	// &{Name:Hello Scraper! URL:https://github.org/mh-orange/scraper}
}

Structs are unmarshaled by matching CSS selectors to elements in an html document tree. Scraper uses the wonderful Cascadia (https://github.com/andybalholm/cascadia) package to parse and match CSS selectors.

To specify matching and unmarshaling rules, use the "scraper" and "scrapeType" struct field tags. The "scraper" tag is used to define the CSS selector and the "scrapeType" indicates whether the value should be the text content or an attribute of the matching element. The default type (if the scrapeTag is omitted) is to use the text content. For example, to match an element with the id "name" and capture its text content:

type MyType struct {
	Name string `scraper:"#name"`
}

Another example, which uses the href attribute of a matching "a" tag:

type MyType struct {
	URL string `scraper:"a" scrapeType:"attr:href"`
}

Note that the attribute name is specified after the type (attr) and a separating colon.

Types that implement encoding.BinaryUnmarshaler or encoding.TextUnmarshaler are honored:

type Name struct {
	First string
	Last  string
}

func (n *Name) UnmarshalText(text []byte) (err error) {
	tokens := strings.Split(string(text), ", ")
	if len(tokens) == 2 {
		n.Last = tokens[0]
		n.First = tokens[1]
	} else {
		err = errors.New("Wanted comma separated last and first names")
	}
	return err
}

type Class struct {
	Students []Name `scraper:"ul li"`
}

Index ¶

Constants
Variables
func Unmarshal(text []byte, v interface{}) error
type BinaryUnmarshaler
type Decoder
- func NewDecoder(r io.Reader, options ...Option) *Decoder
- func (dec *Decoder) Decode(v interface{}) error
type HTMLUnmarshaler
type InvalidUnmarshalError
- func (e *InvalidUnmarshalError) Error() string
type Option
- func TrimSpace() Option
type TextUnmarshaler
type UnmarshalTypeError
- func (e *UnmarshalTypeError) Error() string
type Unmarshaler
- func NewUnmarshaler(root *html.Node, options ...Option) (u *Unmarshaler)
- func (u *Unmarshaler) Unmarshal(v interface{}) (err error)

Constants ¶

View Source

const (
	// SelectorTagName is used to reflect the appropriate struct field tag.  The SelectorTagName
	// is the tag used to specify a CSS selector to match for the field
	SelectorTagName = "scraper"

	// TypeTagName (scrapeType) is the tag used to specify what kind of value lookup should be performed.  The
	// default is `text` and simply gathers the text nodes from the matching html subtree.  The
	// alternative type is `attr` which will assign value based on a matching attribute.  The
	// attribute name (for the matched node) is specified following a colon
	TypeTagName = "scrapeType"
)

Scraper uses struct field tags to determine how to unmarshal an HTML element tree into a type. This is similar to how encoding/json uses tags to match json field names to struct field names. There are two tags that scraper uses in its processing, `scraper` and `scrapeType`. Example:

type MyType struct {
  URL string `scraper:"a.myurl" scrapeType:"attr:href"` // parses the href attribute from the matching a
}

Variables ¶

View Source

var (
	// ErrUnknownTagType indicates that the scraperType tag is an unknown value
	ErrUnknownTagType = errors.New("Unknown tag type ")
)

Functions ¶

func Unmarshal ¶

func Unmarshal(text []byte, v interface{}) error

Unmarshal will parse the input text and unmarshal it into v

Example ¶

package main

import (
	"fmt"

	"github.com/mh-orange/scraper"
)

func main() {
	// Parse and unmarshal an HTML document into a very basic Go struct
	document := `<html><body><h1 id="name">Hello Scraper!</h1><a href="https://github.org/mh-orange/scraper">Scraper</a> is Grrrrrreat!</body></html>`
	v := &struct {
		// Name is assigned the text content from the element with the ID "name"
		Name string `scraper:"#name"`

		// URL is assigned the HREF attribute of the first A element found
		URL string `scraper:"a" scrapeType:"attr:href"`
	}{}
	err := scraper.Unmarshal([]byte(document), v)
	if err != nil {
		panic(err.Error())
	}
	fmt.Printf("%+v\n", v)
}

Output:

&{Name:Hello Scraper! URL:https://github.org/mh-orange/scraper}

Example (Nested) ¶

package main

import (
	"fmt"

	"github.com/mh-orange/scraper"
)

func main() {
	// Scraper can be used to unmarshal structs with other structs
	// in them
	document := `
		<html>
			<body>
				<h1 id="name">Hello Scraper!</h1>
				<ul>
					<li>Item 1</li>
					<li>Item 2</li>
					<li>Item 3</li>
				</ul>
			</body>
		</html>`
	v := &struct {
		// Name is assigned the text content from the element with the ID "name"
		Name string `scraper:"#name"`

		// Items is matched with the ul tag and then names is matched by the
		// li tags within.  Nested structs will be unmarshaled with the matching
		// _subtree_ not the entire document
		Items struct {
			Names []string `scraper:"li"`
		} `scraper:"ul"`
	}{}
	err := scraper.Unmarshal([]byte(document), v)
	if err != nil {
		panic(err.Error())
	}
	fmt.Printf("%+v\n", v)
}

Output:

&{Name:Hello Scraper! Items:{Names:[Item 1 Item 2 Item 3]}}

Example (Slice) ¶

package main

import (
	"fmt"

	"github.com/mh-orange/scraper"
)

func main() {
	// Scraper can be used to unmarshal structs with slices
	// of things as well
	document := `
		<html>
			<body>
				<h1 id="name">Hello Scraper!</h1>
				<ul>
					<li>Item 1</li>
					<li>Item 2</li>
					<li>Item 3</li>
				</ul>
			</body>
		</html>`
	v := &struct {
		// Name is assigned the text content from the element with the ID "name"
		Name string `scraper:"#name"`

		// Items is appended with the text content of each element matching the
		// "ul li" CSS selector
		Items []string `scraper:"ul li"`
	}{}
	err := scraper.Unmarshal([]byte(document), v)
	if err != nil {
		panic(err.Error())
	}
	fmt.Printf("%+v\n", v)
}

Output:

&{Name:Hello Scraper! Items:[Item 1 Item 2 Item 3]}

Types ¶

type BinaryUnmarshaler ¶

type BinaryUnmarshaler interface {
	encoding.BinaryUnmarshaler
}

BinaryUnmarshaler is the interface implemented by an object that can unmarshal the byte string (either text content or attribute) from an element matched by a scraper seleector

type Decoder ¶

type Decoder struct {
	// contains filtered or unexported fields
}

Decoder will read from an io.Reader, parse the content into a root *html.Node and then unmarshal the content into a receiver

Example ¶

package main

import (
	"fmt"
	"strings"

	"github.com/mh-orange/scraper"
)

func main() {
	// Decoder is useful for unmarshaling from an input stream
	document := `<html><body><h1 id="name">Hello Scraper!</h1></body></html>`
	v := &struct {
		// Name is assigned the text content from the element with the ID "name"
		Name string `scraper:"#name"`
	}{}

	reader := strings.NewReader(document)
	scraper.NewDecoder(reader).Decode(v)
	fmt.Printf("%+v\n", v)
}

Output:

&{Name:Hello Scraper!}

func NewDecoder ¶

func NewDecoder(r io.Reader, options ...Option) *Decoder

NewDecoder initializes a decoder for the given reader and options

func (*Decoder) Decode ¶

func (dec *Decoder) Decode(v interface{}) error

Decode the input stream and unmarshal it into v

type HTMLUnmarshaler ¶

type HTMLUnmarshaler interface {
	UnmarshalHTML(*html.Node) error
}

HTMLUnmarshaler is the interface implemented by types that can unmarshal parsed html directly. The input is a parsed element tree starting at the element that matched the CSS selector specified in the scraper tag

type InvalidUnmarshalError ¶

type InvalidUnmarshalError struct {
	Type reflect.Type
	Want reflect.Kind
}

An InvalidUnmarshalError describes an invalid argument passed to Unmarshal. (The argument to Unmarshal must be a non-nil pointer.)

func (*InvalidUnmarshalError) Error ¶

func (e *InvalidUnmarshalError) Error() string

type Option ¶

type Option func(*Unmarshaler) error

Option updates an Unmarshaler with various capabilities

func TrimSpace ¶

func TrimSpace() Option

TrimSpace tells the unmarshaller to trim values using strings.TrimSpace when a field is set, the value (either text content or attribute value) will be trimmed prior to type conversion and assignment

type TextUnmarshaler ¶

type TextUnmarshaler interface {
	encoding.TextUnmarshaler
}

TextUnmarshaler is the interface implemented by an object that can unmarshal the byte string (either text content or attribute) from an element matched by a scraper seleector

Example ¶

package main

import (
	"errors"
	"fmt"
	"strings"

	"github.com/mh-orange/scraper"
)

type Name struct {
	First string
	Last  string
}

func (n *Name) UnmarshalText(text []byte) (err error) {
	tokens := strings.Split(string(text), ", ")
	if len(tokens) == 2 {
		n.Last = tokens[0]
		n.First = tokens[1]
	} else {
		err = errors.New("Wanted comma separated last and first names")
	}
	return err
}

type Class struct {
	Students []Name `scraper:"ul li"`
}

func main() {
	document := `
		<html>
			<body>
				<h1 id="name">Class Roster</h1>
				<ul>
					<li>Stone, John</li>
					<li>Priya, Ponnappa</li>
					<li>Wong, Mia</li>
				</ul>
			</body>
		</html>`
	v := &Class{}
	err := scraper.Unmarshal([]byte(document), v)
	if err != nil {
		panic(err.Error())
	}
	fmt.Printf("%+v\n", v)
}

Output:

&{Students:[{First:John Last:Stone} {First:Ponnappa Last:Priya} {First:Mia Last:Wong}]}

type UnmarshalTypeError ¶

type UnmarshalTypeError struct {
	Value string       // description of value - "bool", "array", "number -5"
	Type  reflect.Type // type of Go value it could not be assigned to
}

An UnmarshalTypeError describes a value that was not appropriate for a value of a specific Go type.

func (*UnmarshalTypeError) Error ¶

func (e *UnmarshalTypeError) Error() string

type Unmarshaler ¶

type Unmarshaler struct {
	// contains filtered or unexported fields
}

Unmarshaler processes an HTML tree and unmarshals/parses it into a receiver. The unmarshaler looks for struct field tags matching `scraper` and `scrapeType`

func NewUnmarshaler ¶

func NewUnmarshaler(root *html.Node, options ...Option) (u *Unmarshaler)

NewUnmarshaler creates a scraper Unmarshaler with its root set to the input *html.Node and setting any options given. If any of the options generate an error, then that error is passed through upon calling Unmarshal. This allows for chaining the NewUnmarshaler function with Unmarshal:

err := NewUnmarshaler(root).Unmarshal(v)

func (*Unmarshaler) Unmarshal ¶

func (u *Unmarshaler) Unmarshal(v interface{}) (err error)

Unmarshal the document into v

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL