README
¶
scrape
A simple, higher level interface for Go web scraping.
When scraping with Go, I find myself redefining tree traversal and other utility functions.
This package is a place to put some simple tools which build on top of the Go HTML parsing library.
For the full interface check out the godoc
Sample
Scrape defines traversal functions like Find
and FindAll
while attempting
to be generic. It also defines convenience functions such as Attr
and Text
.
// Parse the page
root, err := html.Parse(resp.Body)
if err != nil {
// handle error
}
// Search for the title
title, ok := scrape.Find(root, scrape.ByTag(atom.Title))
if ok {
// Print the title
fmt.Println(scrape.Text(title))
}
A full example: Scraping Hacker News
package main
import (
"fmt"
"net/http"
"github.com/yhat/scrape"
"golang.org/x/net/html"
"golang.org/x/net/html/atom"
)
func main() {
// request and parse the front page
resp, err := http.Get("https://news.ycombinator.com/")
if err != nil {
panic(err)
}
root, err := html.Parse(resp.Body)
if err != nil {
panic(err)
}
// define a matcher
matcher := func(n *html.Node) bool {
// must check for nil values
if n.DataAtom == atom.A && n.Parent != nil && n.Parent.Parent != nil {
return scrape.Attr(n.Parent.Parent, "class") == "athing"
}
return false
}
// grab all articles and print them
articles := scrape.FindAll(root, matcher)
for i, article := range articles {
fmt.Printf("%2d %s (%s)\n", i, scrape.Text(article), scrape.Attr(article, "href"))
}
}
Documentation
¶
Overview ¶
Package scrape provides a searching api on top of golang.org/x/net/html
Index ¶
- func Attr(node *html.Node, key string) string
- func Find(node *html.Node, matcher Matcher) (n *html.Node, ok bool)
- func FindAll(node *html.Node, matcher Matcher) []*html.Node
- func FindAllNested(node *html.Node, matcher Matcher) []*html.Node
- func FindNextSibling(node *html.Node, matcher Matcher) (n *html.Node, ok bool)
- func FindParent(node *html.Node, matcher Matcher) (n *html.Node, ok bool)
- func FindPrevSibling(node *html.Node, matcher Matcher) (n *html.Node, ok bool)
- func Text(node *html.Node) string
- func TextJoin(node *html.Node, join func([]string) string) string
- type Matcher
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Find ¶
Find returns the first node which matches the matcher using depth-first search. If no node is found, ok will be false.
root, err := html.Parse(resp.Body) if err != nil { // handle error } matcher := func(n *html.Node) bool { return n.DataAtom == atom.Body } body, ok := scrape.Find(root, matcher)
func FindAll ¶
FindAll returns all nodes which match the provided Matcher. After discovering a matching node, it will _not_ discover matching subnodes of that node.
func FindAllNested ¶
FindAllNested returns all nodes which match the provided Matcher and _will_ discover matching subnodes of matching nodes.
func FindNextSibling ¶
Find returns the first node which matches the matcher using next sibling search. If no node is found, ok will be false.
root, err := html.Parse(resp.Body) if err != nil { // handle error } matcher := func(n *html.Node) bool { return n.DataAtom == atom.Body } body, ok := scrape.FindNextSibling(root, matcher)
func FindParent ¶
FindParent searches up HTML tree from the current node until either a match is found or the top is hit.
func FindPrevSibling ¶
Find returns the first node which matches the matcher using previous sibling search. If no node is found, ok will be false.
root, err := html.Parse(resp.Body) if err != nil { // handle error } matcher := func(n *html.Node) bool { return n.DataAtom == atom.Body } body, ok := scrape.FindPrevSibling(root, matcher)