soup

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 29, 2017 License: MIT Imports: 5 Imported by: 0

README

soup

Build Status GoDoc Go Report Card

Web Scraper in Go, similar to BeautifulSoup

soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSoup.

Functions implemented till now :

func Get(string) (string,error) // Takes the url as an argument, returns HTML string
func HTMLParse(string) struct{} // Takes the HTML string as an argument, returns a pointer to the DOM constructed
func Find([]string) struct{} // Element tag,(attribute key-value pair) as argument, pointer to first occurence returned
func FindAll([]string) []struct{} // Same as Find(), but pointers to all occurrences returned
func FindNextSibling() struct{} // Pointer to the next sibling of the Element in the DOM returned
func FindNextElementSibling() struct{} // Pointer to the next element sibling of the Element in the DOM returned
func FindPrevSibling() struct{} // Pointer to the previous sibling of the Element in the DOM returned
func FindPrevElementSibling() struct{} // Pointer to the previous element sibling of the Element in the DOM returned
func Attrs() map[string]string // Map returned with all the attributes of the Element as lookup to their respective values
func Text() string // Full text inside a non-nested tag returned

The struct returned by the functions has two fields :

  • Pointer containing the pointer to the current html node
  • NodeValue containing the current html node's value, i.e. the tag name for an ElementNode, or the text in case of a TextNode

Installation

Install the package using the command

go get github.com/anaskhan96/soup

Example

An example code is given below to scrape the "Comics I Enjoy" part (text and its links) from xkcd.

More Examples

package main

import (
	"fmt"
	"github.com/anaskhan96/soup"
	"os"
)

func main() {
	resp, err := soup.Get("https://xkcd.com")
	if err != nil {
		os.Exit(1)
	}
	doc := soup.HTMLParse(resp)
	links := doc.Find("div", "id", "comicLinks").FindAll("a")
	for _, link := range links {
		fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])
	}
}

Contributions

This package was developed in my free time. However, contributions from everybody in the community are welcome, to make it a better web scraper. If you feel there should be a particular new feature or function in the package, feel free to open up a new issue or pull request.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Get

func Get(url string) (string, error)

Returns the HTML returned by the url in string

Types

type Node

type Node interface {
	Find(args ...string) Root
	Attrs() map[string]string
	Text() string
	FindAll(args ...string) []Root
	FindNextSibling() Root
	FindPrevSibling() Root
	FindNextElementSibling() Root
	FindPrevElementSibling() Root
}

type Root

type Root struct {
	Pointer   *html.Node
	NodeValue string
}

func HTMLParse

func HTMLParse(s string) Root

Parses the HTML returning a start pointer to the DOM

func (Root) Attrs

func (r Root) Attrs() map[string]string

Returns an array containing key and values of all attributes

func (Root) Find

func (r Root) Find(args ...string) Root

Finds the first occurrence of the given tag name, with or without attribute key and value specified, and returns a struct with a pointer to it

func (Root) FindAll

func (r Root) FindAll(args ...string) []Root

Finds all occurrences of the given tag name, with or without key and value specified, and returns an array of structs, each having the respective pointers

func (Root) FindNextElementSibling

func (r Root) FindNextElementSibling() Root

Finds the next element sibling of the pointer in the DOM returning a struct with a pointer to it

func (Root) FindNextSibling

func (r Root) FindNextSibling() Root

func (Root) FindPrevElementSibling

func (r Root) FindPrevElementSibling() Root

Finds the previous element sibling of the pointer in the DOM returning a struct with a pointer to it

func (Root) FindPrevSibling

func (r Root) FindPrevSibling() Root

func (Root) Text

func (r Root) Text() string

Returns the string inside a non-nested element

Directories

Path Synopsis
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL