scrape

package module

v0.0.0-...-24b7890 Latest Latest Go to latest Published: Nov 28, 2016 License: BSD-2-Clause Imports: 3 Imported by: 120

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/yhat/scrape

Links

Open Source Insights

README ¶

scrape

A simple, higher level interface for Go web scraping.

When scraping with Go, I find myself redefining tree traversal and other utility functions.

This package is a place to put some simple tools which build on top of the Go HTML parsing library.

For the full interface check out the godoc

Sample

Scrape defines traversal functions like Find and FindAll while attempting to be generic. It also defines convenience functions such as Attr and Text.

// Parse the page
root, err := html.Parse(resp.Body)
if err != nil {
    // handle error
}
// Search for the title
title, ok := scrape.Find(root, scrape.ByTag(atom.Title))
if ok {
    // Print the title
    fmt.Println(scrape.Text(title))
}

A full example: Scraping Hacker News

package main

import (
	"fmt"
	"net/http"

	"github.com/yhat/scrape"
	"golang.org/x/net/html"
	"golang.org/x/net/html/atom"
)

func main() {
	// request and parse the front page
	resp, err := http.Get("https://news.ycombinator.com/")
	if err != nil {
		panic(err)
	}
	root, err := html.Parse(resp.Body)
	if err != nil {
		panic(err)
	}

	// define a matcher
	matcher := func(n *html.Node) bool {
		// must check for nil values
		if n.DataAtom == atom.A && n.Parent != nil && n.Parent.Parent != nil {
			return scrape.Attr(n.Parent.Parent, "class") == "athing"
		}
		return false
	}
	// grab all articles and print them
	articles := scrape.FindAll(root, matcher)
	for i, article := range articles {
		fmt.Printf("%2d %s (%s)\n", i, scrape.Text(article), scrape.Attr(article, "href"))
	}
}

Documentation ¶

Overview ¶

Package scrape provides a searching api on top of golang.org/x/net/html

Index ¶

func Attr(node *html.Node, key string) string
func Find(node *html.Node, matcher Matcher) (n *html.Node, ok bool)
func FindAll(node *html.Node, matcher Matcher) []*html.Node
func FindAllNested(node *html.Node, matcher Matcher) []*html.Node
func FindNextSibling(node *html.Node, matcher Matcher) (n *html.Node, ok bool)
func FindParent(node *html.Node, matcher Matcher) (n *html.Node, ok bool)
func FindPrevSibling(node *html.Node, matcher Matcher) (n *html.Node, ok bool)
func Text(node *html.Node) string
func TextJoin(node *html.Node, join func([]string) string) string
type Matcher

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Attr ¶

func Attr(node *html.Node, key string) string

Attr returns the value of an HTML attribute.

func Find ¶

func Find(node *html.Node, matcher Matcher) (n *html.Node, ok bool)

Find returns the first node which matches the matcher using depth-first search. If no node is found, ok will be false.

root, err := html.Parse(resp.Body)
if err != nil {
    // handle error
}
matcher := func(n *html.Node) bool {
    return n.DataAtom == atom.Body
}
body, ok := scrape.Find(root, matcher)

func FindAll ¶

func FindAll(node *html.Node, matcher Matcher) []*html.Node

FindAll returns all nodes which match the provided Matcher. After discovering a matching node, it will _not_ discover matching subnodes of that node.

func FindAllNested ¶

func FindAllNested(node *html.Node, matcher Matcher) []*html.Node

FindAllNested returns all nodes which match the provided Matcher and _will_ discover matching subnodes of matching nodes.

func FindNextSibling ¶

func FindNextSibling(node *html.Node, matcher Matcher) (n *html.Node, ok bool)

Find returns the first node which matches the matcher using next sibling search. If no node is found, ok will be false.

root, err := html.Parse(resp.Body)
if err != nil {
    // handle error
}
matcher := func(n *html.Node) bool {
    return n.DataAtom == atom.Body
}
body, ok := scrape.FindNextSibling(root, matcher)

func FindParent ¶

func FindParent(node *html.Node, matcher Matcher) (n *html.Node, ok bool)

FindParent searches up HTML tree from the current node until either a match is found or the top is hit.

func FindPrevSibling ¶

func FindPrevSibling(node *html.Node, matcher Matcher) (n *html.Node, ok bool)

Find returns the first node which matches the matcher using previous sibling search. If no node is found, ok will be false.

root, err := html.Parse(resp.Body)
if err != nil {
    // handle error
}
matcher := func(n *html.Node) bool {
    return n.DataAtom == atom.Body
}
body, ok := scrape.FindPrevSibling(root, matcher)

func Text ¶

func Text(node *html.Node) string

Text returns text from all descendant text nodes joined. For control over the join function, see TextJoin.

func TextJoin ¶

func TextJoin(node *html.Node, join func([]string) string) string

TextJoin returns a string from all descendant text nodes joined by a caller provided join function.

Types ¶

type Matcher ¶

type Matcher func(node *html.Node) bool

Matcher should return true when a desired node is found.

func ByClass ¶

func ByClass(class string) Matcher

ByClass returns a Matcher which matches all nodes with the provided class.

func ById ¶

func ById(id string) Matcher

ById returns a Matcher which matches all nodes with the provided id.

func ByTag ¶

func ByTag(a atom.Atom) Matcher

ByTag returns a Matcher which matches all nodes of the provided tag type.

root, err := html.Parse(resp.Body)
if err != nil {
    // handle error
}
title, ok := scrape.Find(root, scrape.ByTag(atom.Title))

Source Files ¶

View all Source files

scrape.go

Directories ¶

Path	Synopsis
example

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL