webgrab

package module

v0.1.4 Latest Latest Go to latest Published: Dec 10, 2023 License: BSD-3-Clause Imports: 7 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/glitchruk/webgrab

Links

Open Source Insights

README ¶

🌍🤏 WebGrab

GitHub GitHub release (with filter)

WebGrab is a simple Go library which allows for easy scraping of web pages. It is built on top of the GoQuery library.

Installation

go get github.com/glitchruk/webgrab

Usage

package main

import (
    "fmt"

    "github.com/glitchruk/webgrab"
)

type Page struct {
    Title    string `grab:"title"`
    Body     string `grab:"body"`
    Keywords string `grab:"meta[name=keywords]" attr:"content"`
}

func main() {
    page := Page{}
    
    grabber := webgrab.New()
    grabber.Timeout = 30
    grabber.MaxRedirects = 10
    grabber.Grab("http://example.com", &page)

    fmt.Println(page.Title)
    fmt.Println(page.Body)
    fmt.Println(page.Keywords)
}

Tag Syntax

The defined tags are:

grab:"selector" - The selector to use to grab the value.
attr:"attribute" - The attribute of the selected element to grab.
extract:"regexp" - A regular expression to extract a value from a string.
filter:"regexp" - A regular expression to filter the value of a field.

The selector is a GoQuery selector. The attribute is an optional attribute of the selected element to grab. If no attribute is specified, the text of the selected element will be grabbed.

Arrays

If the field is an array, all matching elements will be grabbed. For example, to grab all links from a page:

type Page struct {
    Links []string `grab:"a[href]" attr:"href"`
}

Nested Structs

It is possible to use nested structs to grab values from the page. For example, to grab the title and meta keywords from a page:

type Page struct {
    Title string `grab:"title"`
    Meta  struct {
        Keywords string `grab:"meta[name=keywords]" attr:"content"`
        Author   string `grab:"meta[name=author]" attr:"content"`
    }
}

Extract

The extract tag can be used to extract a value from a string using a regular expression. For example, to extract the title from a Wikipedia page:

type Page struct {
    Title string `grab:"title" extract:"(.+) - Wikipedia"`
}

Filter

The filter tag can be used to filter the value of a field. For example, to get all links that end with .html:

type Page struct {
    Links []string `grab:"a[href]" attr:"href" filter:".*\.html$"`
}

Documentation ¶

Index ¶

type Grabber
- func New() *Grabber
- func (g Grabber) Grab(url string, data interface{}) error

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Grabber ¶

type Grabber struct {
	// Timeout is the timeout in seconds for the grabber.
	Timeout int

	// MaxRedirects is the maximum number of redirects to follow.
	MaxRedirects int

	// UserAgent is the user agent to use for the grabber.
	UserAgent string
}

Grabber is the struct that contains the configuration for the grabber.

func New ¶

func New() *Grabber

NewGrab returns a new Grab struct with default values.

func (Grabber) Grab ¶

func (g Grabber) Grab(url string, data interface{}) error

Grab grabs the data from the given URL and stores it in the given data struct.

Source Files ¶

View all Source files

grab.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL