soup

package module
v1.1.2-0...-9580d08 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 20, 2019 License: MIT Imports: 7 Imported by: 0

README

soup

Build Status GoDoc Go Report Card

Web Scraper in Go, similar to BeautifulSoup

soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSoup.

Exported variables and functions implemented till now :

var Headers map[string]string // Set headers as a map of key-value pairs, an alternative to calling Header() individually
var Cookies map[string]string // Set cookies as a map of key-value  pairs, an alternative to calling Cookie() individually
func Get(string) (string,error){} // Takes the url as an argument, returns HTML string
func GetWithClient(string, *http.Client){} // Takes the url and a custom HTTP client as arguments, returns HTML string
func Header(string, string){} // Takes key,value pair to set as headers for the HTTP request made in Get()
func Cookie(string, string){} // Takes key, value pair to set as cookies to be sent with the HTTP request in Get()
func HTMLParse(string) Root {} // Takes the HTML string as an argument, returns a pointer to the DOM constructed
func Find([]string) Root {} // Element tag,(attribute key-value pair) as argument, pointer to first occurence returned
func FindAll([]string) []Root {} // Same as Find(), but pointers to all occurrences returned
func FindStrict([]string) Root {} //  Element tag,(attribute key-value pair) as argument, pointer to first occurence returned with exact matching values
func FindAllStrict([]string) []Root {} // Same as FindStrict(), but pointers to all occurrences returned
func FindNextSibling() Root {} // Pointer to the next sibling of the Element in the DOM returned
func FindNextElementSibling() Root {} // Pointer to the next element sibling of the Element in the DOM returned
func FindPrevSibling() Root {} // Pointer to the previous sibling of the Element in the DOM returned
func FindPrevElementSibling() Root {} // Pointer to the previous element sibling of the Element in the DOM returned
func Children() []Root {} // Find all direct children of this DOM element
func Attrs() map[string]string {} // Map returned with all the attributes of the Element as lookup to their respective values
func Text() string {} // Full text inside a non-nested tag returned, first half returned in a non-nested one
func FullText() string {} // Full text inside a nested/non-nested tag returned
func SetDebug(bool) {} // Sets the debug mode to true or false; false by default

Root is a struct, containing three fields :

  • Pointer containing the pointer to the current html node
  • NodeValue containing the current html node's value, i.e. the tag name for an ElementNode, or the text in case of a TextNode
  • Error containing an error if one occurrs, else nil is returned.

Installation

Install the package using the command

go get github.com/anaskhan96/soup

Example

An example code is given below to scrape the "Comics I Enjoy" part (text and its links) from xkcd.

More Examples

package main

import (
	"fmt"
	"github.com/anaskhan96/soup"
	"os"
)

func main() {
	resp, err := soup.Get("https://xkcd.com")
	if err != nil {
		os.Exit(1)
	}
	doc := soup.HTMLParse(resp)
	links := doc.Find("div", "id", "comicLinks").FindAll("a")
	for _, link := range links {
		fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])
	}
}

Contributions

This package was developed in my free time. However, contributions from everybody in the community are welcome, to make it a better web scraper. If you think there should be a particular feature or function included in the package, feel free to open up a new issue or pull request.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var Cookies = make(map[string]string)

Cookies contains all HTTP cookies to send

View Source
var Headers = make(map[string]string)

Headers contains all HTTP headers to send

Functions

func Cookie(n string, v string)

func Get

func Get(url string) (string, error)

Get returns the HTML returned by the url in string using the default HTTP client

func GetWithClient

func GetWithClient(url string, client *http.Client) (string, error)

GetWithClient returns the HTML returned by the url using a provided HTTP client

func Header(n string, v string)

Header sets a new HTTP header

func SetDebug

func SetDebug(d bool)

SetDebug sets the debug status Setting this to true causes the panics to be thrown and logged onto the console. Setting this to false causes the errors to be saved in the Error field in the returned struct.

Types

type Root

type Root struct {
	Pointer   *html.Node
	NodeValue string
	Error     error
}

Root is a structure containing a pointer to an html node, the node value, and an error variable to return an error if occurred

func HTMLParse

func HTMLParse(s string) Root

HTMLParse parses the HTML returning a start pointer to the DOM

func (Root) Attrs

func (r Root) Attrs() map[string]string

Attrs returns a map containing all attributes

func (Root) Children

func (r Root) Children() []Root

Children retuns all direct children of this DOME element.

func (Root) Find

func (r Root) Find(args ...string) Root

Find finds the first occurrence of the given tag name, with or without attribute key and value specified, and returns a struct with a pointer to it

func (Root) FindAll

func (r Root) FindAll(args ...string) []Root

FindAll finds all occurrences of the given tag name, with or without key and value specified, and returns an array of structs, each having the respective pointers

func (Root) FindAllStrict

func (r Root) FindAllStrict(args ...string) []Root

FindAllStrict finds all occurrences of the given tag name only if all the values of the provided attribute are an exact match

func (Root) FindNextElementSibling

func (r Root) FindNextElementSibling() Root

FindNextElementSibling finds the next element sibling of the pointer in the DOM returning a struct with a pointer to it

func (Root) FindNextSibling

func (r Root) FindNextSibling() Root

FindNextSibling finds the next sibling of the pointer in the DOM returning a struct with a pointer to it

func (Root) FindPrevElementSibling

func (r Root) FindPrevElementSibling() Root

FindPrevElementSibling finds the previous element sibling of the pointer in the DOM returning a struct with a pointer to it

func (Root) FindPrevSibling

func (r Root) FindPrevSibling() Root

FindPrevSibling finds the previous sibling of the pointer in the DOM returning a struct with a pointer to it

func (Root) FindStrict

func (r Root) FindStrict(args ...string) Root

FindStrict finds the first occurrence of the given tag name only if all the values of the provided attribute are an exact match

func (Root) FullText

func (r Root) FullText() string

FullText returns the string inside even a nested element

func (Root) Text

func (r Root) Text() string

Text returns the string inside a non-nested element

Directories

Path Synopsis
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL