scraper

package module

v1.0.3 Latest Latest Go to latest Published: Jun 1, 2022 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/keinberger/goScraper

Links

Open Source Insights

README ¶

goScraper

goScraper is a small web-scraping library for Go.

Installation

Package can be installed manually using

go get github.com/keinberger/goScraper

But may also be normally imported when using go modules

import "github.com/keinberger/goScraper"

Usage

The package provides several exported functions to provide high functionality.
However, the main scrape functions

func (w Website) Scrape(funcs map[string]interface{}, vars ...interface{}) (string, error)

func (el lookUpElement) ScrapeTreeForElement(node *html.Node) (string, error)

func (e *Element) GetElementNodes(doc *html.Node) ([]*html.Node, error)

should be the preffered way to use the scraper library.

As these functions use the other exported functions, as well, it provides all the features of the library packed together and guided by only having to provide a minimal amount of input. For the main Scrape() function, the user input is scoped to only having to provide a custom Website variable.

Example using `Scrape()`

This example provides a tutorial on how to scrape a website for specific html elements. The html elements will be returned chained-together, separated by a custom separator.

The example will use a custom website variable, where the Scrape() function will be called upon. The arguments of the Scrape() function are optional and will not be needed in this example.

package main

import (
	"fmt"
	"github.com/keinberger/goScraper"
)

func main() {
	website := scraper.Website{
		URL: "https://wikipedia.org/wiki/wikipedia",
		Elements: []scraper.Element{
			{
				HtmlElement: scraper.HtmlElement{
					Typ: "h1",
					Tags: []scraper.Tag{
						{
							Typ:   "id",
							Value: "firstHeading",
						},
					},
				},
			},
			{
				HtmlElement: scraper.HtmlElement{
					Typ: "td",
					Tags: []scraper.Tag{
						{
							Typ:   "class",
							Value: "infobox-data",
						},
					},
				},
				Index: 0,
			},
		},
		Separator: ", ",
	}

	scraped, err := website.Scrape(nil)
	if err != nil {
		panic(err)
	}

	fmt.Println(scraped)
}

Example using `ScrapeTreeForElement()`

This example will use ScrapeTreeForElement, which will return the content of an html element (*html.Node) inside of a bigger node tree. This function is especially useful, if one only wants one html element from a website, but still wants to retain control over formatting settings.

package main

import (
	"fmt"
	"github.com/keinberger/scraper"
)

func main() {
	htmlNode, err := scraper.GetHTMLNode("https://wikipedia.org/wiki/wikipedia")
	if err != nil {
		panic(err)
	}

	element := scraper.Element{
		HtmlElement: scraper.HtmlElement{
			Typ: "li",
			Tags: []scraper.Tag{
				{
					Typ:   "id",
					Value: "ca-viewsource",
				},
			},
		},
	}
	content, err := element.ScrapeTreeForElement(htmlNode)
	if err != nil {
		panic(err)
	}
	fmt.Println(content)
}

Other exported functions

GetElementNodes returns all html elements []*html.Node found in an html code htmlNode *html.Node with the same properties as e *Element

func (e *Element) GetElementNodes(htmlNode *html.Node) ([]*html.Node, error)

GetTextOfNodes returns the content of an html element node *html.Node

func GetTextOfNode(node *html.Node, notRecursive bool) (text string)

RenderNode returns the string representation of a node *html.Node

func RenderNode(node *html.Node) string

GetHTMLNode returns the node tree *html.Node of the html string data

func GetHTMLNode(data string) (*html.Node, error)

GetHTML returns the HTML data of URL

func GetHTML(URL string) (string, error)

Contributions

I created this project as a side-project from my normal work. Any contributions are very welcome. Just open up new issues or create a pull request if you want to contribute.

Documentation ¶

Index ¶

Constants
func GetHTML(URL string) (string, error)
func GetHTMLNode(data string) (*html.Node, error)
func GetTextOfNode(node *html.Node, notRecursive bool) (text string)
func RenderNode(node *html.Node) string
type Element
- func (e *Element) ScrapeTreeForElement(nodeTree *html.Node) (content string, err error)
type ErrType
type Error
- func (e Error) Error() string
type FormatSettings
type HtmlElement
- func (e *HtmlElement) GetElementNodes(htmlNode *html.Node) ([]*html.Node, error)
type ReplaceObj
type Settings
type Tag
type Website
- func (w Website) Scrape(funcs *map[string]interface{}, vars ...interface{}) (string, error)

Constants ¶

View Source

const (
	// ErrMissingElement will be returned if the element is issing
	ErrMissingElement = iota
	// ErrNoNodeFound will be returned if no element was found
	ErrNoNodeFound
	// ErrIdxOutOfRange will be returned if the index of an array is out of range
	ErrIdxOutOfRange
)

Variables ¶

This section is empty.

Functions ¶

func GetHTML ¶

func GetHTML(URL string) (string, error)

GetHTML returns the HTML data of URL

func GetHTMLNode ¶

func GetHTMLNode(data string) (*html.Node, error)

GetHTMLNode returns the node tree of the html string data

func GetTextOfNode ¶

func GetTextOfNode(node *html.Node, notRecursive bool) (text string)

GetTextOfNode returns the content of an html element

func RenderNode ¶

func RenderNode(node *html.Node) string

RenderNode returns the string representation of an html.Node

Types ¶

type Element ¶

type Element struct {
	HtmlElement        `json:"htmlElement"`
	Settings           `json:"settings"`
	ContentIsFollowURL *Website `json:"followURL"`
	Index              int      `json:"index"`
}

Element defines the data structure for an element to be looked up by the scraper

func (*Element) ScrapeTreeForElement ¶

func (e *Element) ScrapeTreeForElement(nodeTree *html.Node) (content string, err error)

ScrapeTreeForElement scraped the node tree for a lookUpElement.Element and formats the content of it accordingly

type ErrType ¶

type ErrType int

type Error ¶

type Error struct {
	ErrType
	// contains filtered or unexported fields
}

Error defines the data structure for a custom error

func (Error) Error ¶

func (e Error) Error() string

Error returns the error msg of an error

type FormatSettings ¶

type FormatSettings struct {
	Replacements []ReplaceObj `json:"replacements"`
	Trim         []string     `json:"trim"`
	AddBefore    string       `json:"addBefore"`
	AddAfter     string       `json:"addAfter"`
}

FormatSettings defines the data structure for optional formatting settings of a LookUpElement

type HtmlElement ¶

type HtmlElement struct {
	Typ  string `json:"typ"`
	Tags []Tag  `json:"tags"`
}

HtmlElement defines the data structure for an HTML element

func (*HtmlElement) GetElementNodes ¶

func (e *HtmlElement) GetElementNodes(htmlNode *html.Node) ([]*html.Node, error)

GetElementNodes returns an array of html.Node iniside of htmlNode having the same properties as element e

type ReplaceObj ¶

type ReplaceObj struct {
	ToBeReplaced string `json:"toBeReplaced"`
	Replacement  string `json:"replacement"`
}

ReplaceObj defines the data structure for an object, that has to be replaced

type Settings ¶

type Settings struct {
	FormatSettings           FormatSettings `json:"formatting"`
	DisallowRecursiveContent bool           `json:"disallowRecursiveContent"`
}

Settings defines the data structure for optional settings of a LookUpElement

type Tag ¶

type Tag struct {
	Typ   string `json:"typ"`
	Value string `json:"value"`
}

Tag defines the data structure for an HTML Tag

type Website ¶

type Website struct {
	URL       string    `json:"URL"`
	Elements  []Element `json:"Elements"`
	Separator string    `json:"separator"`
}

Website defines the website data type for the scraper

func (Website) Scrape ¶

func (w Website) Scrape(funcs *map[string]interface{}, vars ...interface{}) (string, error)

Scrape scrapes the website w, returning the found elements in a string each separated by Separator

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL