Documentation ¶
Overview ¶
Package scraper provides a means to parse and unmarshal HTML into Go structs. Usage is best described by example:
package main import ( "fmt" "github.com/mh-orange/scraper" ) type MyType struct { Name string `scraper:"#name"` URL string `scraper:"a" scrapeType:"attr:href"` } func main() { document := `<html><body><h1 id="name">Hello Scraper!</h1><a href="https://github.org/mh-orange/scraper">Scraper</a> is Grrrrrreat!</body></html>` v := &MyType{} err := scraper.Unmarshal([]byte(document), v) if err != nil { panic(err.Error()) } fmt.Printf("%+v\n", v) // &{Name:Hello Scraper! URL:https://github.org/mh-orange/scraper} }
Structs are unmarshaled by matching CSS selectors to elements in an html document tree. Scraper uses the wonderful Cascadia (https://github.com/andybalholm/cascadia) package to parse and match CSS selectors.
To specify matching and unmarshaling rules, use the "scraper" and "scrapeType" struct field tags. The "scraper" tag is used to define the CSS selector and the "scrapeType" indicates whether the value should be the text content or an attribute of the matching element. The default type (if the scrapeTag is omitted) is to use the text content. For example, to match an element with the id "name" and capture its text content:
type MyType struct { Name string `scraper:"#name"` }
Another example, which uses the href attribute of a matching "a" tag:
type MyType struct { URL string `scraper:"a" scrapeType:"attr:href"` }
Note that the attribute name is specified after the type (attr) and a separating colon.
Types that implement encoding.BinaryUnmarshaler or encoding.TextUnmarshaler are honored:
type Name struct { First string Last string } func (n *Name) UnmarshalText(text []byte) (err error) { tokens := strings.Split(string(text), ", ") if len(tokens) == 2 { n.Last = tokens[0] n.First = tokens[1] } else { err = errors.New("Wanted comma separated last and first names") } return err } type Class struct { Students []Name `scraper:"ul li"` }
Index ¶
Examples ¶
Constants ¶
const ( // SelectorTagName is used to reflect the appropriate struct field tag. The SelectorTagName // is the tag used to specify a CSS selector to match for the field SelectorTagName = "scraper" // TypeTagName (scrapeType) is the tag used to specify what kind of value lookup should be performed. The // default is `text` and simply gathers the text nodes from the matching html subtree. The // alternative type is `attr` which will assign value based on a matching attribute. The // attribute name (for the matched node) is specified following a colon TypeTagName = "scrapeType" )
Scraper uses struct field tags to determine how to unmarshal an HTML element tree into a type. This is similar to how encoding/json uses tags to match json field names to struct field names. There are two tags that scraper uses in its processing, `scraper` and `scrapeType`. Example:
type MyType struct { URL string `scraper:"a.myurl" scrapeType:"attr:href"` // parses the href attribute from the matching a }
Variables ¶
var ( // ErrUnknownTagType indicates that the scraperType tag is an unknown value ErrUnknownTagType = errors.New("Unknown tag type ") )
Functions ¶
func Unmarshal ¶
Unmarshal will parse the input text and unmarshal it into v
Example ¶
package main import ( "fmt" "github.com/mh-orange/scraper" ) func main() { // Parse and unmarshal an HTML document into a very basic Go struct document := `<html><body><h1 id="name">Hello Scraper!</h1><a href="https://github.org/mh-orange/scraper">Scraper</a> is Grrrrrreat!</body></html>` v := &struct { // Name is assigned the text content from the element with the ID "name" Name string `scraper:"#name"` // URL is assigned the HREF attribute of the first A element found URL string `scraper:"a" scrapeType:"attr:href"` }{} err := scraper.Unmarshal([]byte(document), v) if err != nil { panic(err.Error()) } fmt.Printf("%+v\n", v) }
Output: &{Name:Hello Scraper! URL:https://github.org/mh-orange/scraper}
Example (Nested) ¶
package main import ( "fmt" "github.com/mh-orange/scraper" ) func main() { // Scraper can be used to unmarshal structs with other structs // in them document := ` <html> <body> <h1 id="name">Hello Scraper!</h1> <ul> <li>Item 1</li> <li>Item 2</li> <li>Item 3</li> </ul> </body> </html>` v := &struct { // Name is assigned the text content from the element with the ID "name" Name string `scraper:"#name"` // Items is matched with the ul tag and then names is matched by the // li tags within. Nested structs will be unmarshaled with the matching // _subtree_ not the entire document Items struct { Names []string `scraper:"li"` } `scraper:"ul"` }{} err := scraper.Unmarshal([]byte(document), v) if err != nil { panic(err.Error()) } fmt.Printf("%+v\n", v) }
Output: &{Name:Hello Scraper! Items:{Names:[Item 1 Item 2 Item 3]}}
Example (Slice) ¶
package main import ( "fmt" "github.com/mh-orange/scraper" ) func main() { // Scraper can be used to unmarshal structs with slices // of things as well document := ` <html> <body> <h1 id="name">Hello Scraper!</h1> <ul> <li>Item 1</li> <li>Item 2</li> <li>Item 3</li> </ul> </body> </html>` v := &struct { // Name is assigned the text content from the element with the ID "name" Name string `scraper:"#name"` // Items is appended with the text content of each element matching the // "ul li" CSS selector Items []string `scraper:"ul li"` }{} err := scraper.Unmarshal([]byte(document), v) if err != nil { panic(err.Error()) } fmt.Printf("%+v\n", v) }
Output: &{Name:Hello Scraper! Items:[Item 1 Item 2 Item 3]}
Types ¶
type BinaryUnmarshaler ¶
type BinaryUnmarshaler interface { encoding.BinaryUnmarshaler }
BinaryUnmarshaler is the interface implemented by an object that can unmarshal the byte string (either text content or attribute) from an element matched by a scraper seleector
type Decoder ¶
type Decoder struct {
// contains filtered or unexported fields
}
Decoder will read from an io.Reader, parse the content into a root *html.Node and then unmarshal the content into a receiver
Example ¶
package main import ( "fmt" "strings" "github.com/mh-orange/scraper" ) func main() { // Decoder is useful for unmarshaling from an input stream document := `<html><body><h1 id="name">Hello Scraper!</h1></body></html>` v := &struct { // Name is assigned the text content from the element with the ID "name" Name string `scraper:"#name"` }{} reader := strings.NewReader(document) scraper.NewDecoder(reader).Decode(v) fmt.Printf("%+v\n", v) }
Output: &{Name:Hello Scraper!}
func NewDecoder ¶
NewDecoder initializes a decoder for the given reader and options
type HTMLUnmarshaler ¶
HTMLUnmarshaler is the interface implemented by types that can unmarshal parsed html directly. The input is a parsed element tree starting at the element that matched the CSS selector specified in the scraper tag
type InvalidUnmarshalError ¶
An InvalidUnmarshalError describes an invalid argument passed to Unmarshal. (The argument to Unmarshal must be a non-nil pointer.)
func (*InvalidUnmarshalError) Error ¶
func (e *InvalidUnmarshalError) Error() string
type Option ¶
type Option func(*Unmarshaler) error
Option updates an Unmarshaler with various capabilities
type TextUnmarshaler ¶
type TextUnmarshaler interface { encoding.TextUnmarshaler }
TextUnmarshaler is the interface implemented by an object that can unmarshal the byte string (either text content or attribute) from an element matched by a scraper seleector
Example ¶
package main import ( "errors" "fmt" "strings" "github.com/mh-orange/scraper" ) type Name struct { First string Last string } func (n *Name) UnmarshalText(text []byte) (err error) { tokens := strings.Split(string(text), ", ") if len(tokens) == 2 { n.Last = tokens[0] n.First = tokens[1] } else { err = errors.New("Wanted comma separated last and first names") } return err } type Class struct { Students []Name `scraper:"ul li"` } func main() { document := ` <html> <body> <h1 id="name">Class Roster</h1> <ul> <li>Stone, John</li> <li>Priya, Ponnappa</li> <li>Wong, Mia</li> </ul> </body> </html>` v := &Class{} err := scraper.Unmarshal([]byte(document), v) if err != nil { panic(err.Error()) } fmt.Printf("%+v\n", v) }
Output: &{Students:[{First:John Last:Stone} {First:Ponnappa Last:Priya} {First:Mia Last:Wong}]}
type UnmarshalTypeError ¶
type UnmarshalTypeError struct { Value string // description of value - "bool", "array", "number -5" Type reflect.Type // type of Go value it could not be assigned to }
An UnmarshalTypeError describes a value that was not appropriate for a value of a specific Go type.
func (*UnmarshalTypeError) Error ¶
func (e *UnmarshalTypeError) Error() string
type Unmarshaler ¶
type Unmarshaler struct {
// contains filtered or unexported fields
}
Unmarshaler processes an HTML tree and unmarshals/parses it into a receiver. The unmarshaler looks for struct field tags matching `scraper` and `scrapeType`
func NewUnmarshaler ¶
func NewUnmarshaler(root *html.Node, options ...Option) (u *Unmarshaler)
NewUnmarshaler creates a scraper Unmarshaler with its root set to the input *html.Node and setting any options given. If any of the options generate an error, then that error is passed through upon calling Unmarshal. This allows for chaining the NewUnmarshaler function with Unmarshal:
err := NewUnmarshaler(root).Unmarshal(v)
func (*Unmarshaler) Unmarshal ¶
func (u *Unmarshaler) Unmarshal(v interface{}) (err error)
Unmarshal the document into v