htmlutil

package module
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 14, 2019 License: Apache-2.0 Imports: 5 Imported by: 0

README

htmlutil

coverage: 100%

Package htmlutil implements a wrapper for Golang's html5 tokeniser / parser implementation, making it much easier to find and extract information, aiming to be powerful and intuitive while remaining a minimal and logical extension.

See the godoc

As of v1.0.0 the API is stable and used in multiple (personal) projects. Unless I run into a compelling use case I am declaring this feature complete. It would be nice to add some examples though, maybe later.

Change Log

2019-02-14 v1.1.0 classes methods

2019-02-11 v1.0.0 initial release

Documentation

Overview

Package htmlutil implements a wrapper for Golang's html5 tokeniser / parser implementation, making it much easier to find and extract information, aiming to be powerful and intuitive while remaining a minimal and logical extension.

There are three core components, the `htmlutil.Node` struct (a wrapper for `*html.Node`), the `htmlutil.Parse` function (optional), an ubiquitous filter algorithm used throughout this implementation, providing functionality similar to CSS selectors, and powered by optional (varargs) parameters in the form of chained closures with a signature of `func(htmlutil.Node) bool`.

Filter behavior

  • based on a recursive algorithm where each node can match at most one filter, consuming it (for that sub-tree), and is added to the result if `len(filters) == 0`
  • every node in the tree is searched (in general, there is a "find" mode where only one result is returned)
  • nil filters are preemptively stripped, and so are treated like they were omitted
  • each node will be present in the result at most once, and will retain (depth first) order
  • behavior is undefined if the tree is not "well formed" (e.g. any cycles)
  • providing no filters will return ALL nodes (or if only one result is needed, the first node)
  • filter closures will not be called with a node with a nil `Data` field
  • filter closures will receive nodes with a `Depth` field relative to the original
  • the node's `Match` field stores the last "matched" node in the chain (note: duplicate matches for the same `*html.Node` are squashed), the root node is always treated as an initial match
  • resulting node values will retain the match chain (will always be non-nil if the root was non-nil)

General behavior

  • a nil `Data` field for a `htmlutil.Node` indicates no node / no result, and methods should return default values, or other intuitive analog (behavior to make chaining far simpler)

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Node

type Node struct {
	// Data is the underlying html data for this node
	Data *html.Node
	// Depth is the relative depth to the top of the tree (being parsed, filtered, etc)
	Depth int
	// Match is the last match (set by filter impl.), and is used to check previous matches for chained filters
	Match *Node
}

Node is the data structure this package provides to allow utilisation of utility methods + extra metadata such as the last match (`Match` property) for filter / find / get calls, as well as the overall (relative) depth, allowing matching on things such as "all the table row elements that are direct children of a given tbody", a-la CSS selectors

func Parse

func Parse(r io.Reader, filters ...func(node Node) bool) (Node, error)

Parse first performs html.Parse, parsing through any errors, before applying a find to the resulting Node (wrapped like `Node{Data: node}`), returning the first matching Node, or an error, if no matches were found

func (Node) Attr

func (n Node) Attr() []html.Attribute

Attr will return the value of `n.Data.Attr`, returning nil if `n.Data` is nil

func (Node) Children

func (n Node) Children(filters ...func(node Node) bool) (children []Node)

Children builds a slice containing all child nodes using the `Range` method, passing through filters

func (Node) Classes added in v1.1.0

func (n Node) Classes() []string

Classes will return all the (whitespace-separated) values for the (first) `class` attribute, or an empty slice if n is not a valid element node with a class attribute with at least one non-whitespace character

func (Node) FilterNodes

func (n Node) FilterNodes(filters ...func(node Node) bool) []Node

FilterNodes returns all nodes from the sub-tree (a search including the receiver) matching the filters (see package comment for filter behavior)

func (Node) FindNode

func (n Node) FindNode(filters ...func(node Node) bool) (Node, bool)

FindNode returns the first node from the sub-tree (a search including the receiver) matching the filters (see package comment for filter behavior)

func (Node) FirstChild

func (n Node) FirstChild(filters ...func(node Node) bool) Node

FirstChild will return the leftmost child node matching any filters (see the `FindNode` method), or a node with a nil `Data` property for no match, note that depth will be automatically incremented

func (Node) GetAttr

func (n Node) GetAttr(namespace string, key string) (html.Attribute, bool)

GetAttr matches on the first attribute (if any) for this node with the same namespace and key (key being case insensitive if namespace is empty), returning false if no match was found

func (Node) GetAttrVal

func (n Node) GetAttrVal(namespace string, key string) string

GetAttrVal returns the value of any attribute matched by `n.GetAttr`

func (Node) GetNode

func (n Node) GetNode(filters ...func(node Node) bool) Node

GetNode returns the node returned by FindNode without the boolean flag indicating if there was a match, it is provided for chaining purposes, since this package deliberately handles a nil `Data` field gracefully

func (Node) HasClass added in v1.1.0

func (n Node) HasClass(class string) bool

HasClass will return true if n is a valid element node with the given html class (case sensitive)

func (Node) InnerHTML

func (n Node) InnerHTML(filters ...func(node Node) bool) string

InnerHTML builds a string using the outer html of all children matching all filters (see the `FindNode` method)

func (Node) InnerText

func (n Node) InnerText(filters ...func(node Node) bool) string

InnerText builds a string using the outer text of all children matching all filters (see the `FindNode` method)

func (Node) LastChild

func (n Node) LastChild(filters ...func(node Node) bool) Node

LastChild will return the rightmost child node matching any filters (see the `FindNode` method), or a node with a nil `Data` property for no match, note that depth will be automatically incremented

func (Node) NextSibling

func (n Node) NextSibling(filters ...func(node Node) bool) Node

NextSibling will return the leftmost next sibling node matching any filters (see the `FindNode` method), or a node with a nil `Data` property for no match

func (Node) Offset

func (n Node) Offset() int

Offset is the difference between the depth of this node and the depth of last match, returning the depth of this node if `n.Match` is nil

func (Node) OuterHTML

func (n Node) OuterHTML() string

OuterHTML encodes this node as html using the `html.Render` function, note that it will return an empty string if `n.Data` is nil, and will panic if any error is returned (which should only occur if the sub-tree is not "well formed")

func (Node) OuterText

func (n Node) OuterText() string

OuterText builds a string from the data of all text nodes in the sub-tree, starting from and including `n`

func (Node) Parent

func (n Node) Parent(filters ...func(node Node) bool) Node

Parent will return the first parent node matching any filters (see the `FindNode` method), or a node with a nil `Data` property for no match, note that depth will be automatically decremented (potentially multiple times)

func (Node) PrevSibling

func (n Node) PrevSibling(filters ...func(node Node) bool) Node

PrevSibling will return the rightmost previous sibling node matching any filters (see the `FindNode` method), or a node with a nil `Data` property for no match

func (Node) Range

func (n Node) Range(fn func(i int, node Node) bool, filters ...func(node Node) bool)

Range iterates on any children matching any filters (see the `FindNode` method), providing the (filtered) index and node to the provided fn, note that it will panic if fn is nil

func (Node) SiblingIndex

func (n Node) SiblingIndex(filters ...func(node Node) bool) int

SiblingIndex returns the total number of previous siblings matching any filters (see the `FindNode` method)

func (Node) SiblingLength

func (n Node) SiblingLength(filters ...func(node Node) bool) int

SiblingLength returns the total number of siblings matching any filters (see the `FindNode` method) incremented by one for the current node, or returns 0 if the receiver has nil data (is empty)

func (Node) String

func (n Node) String() string

String is an alias for `n.OuterHTML`

func (Node) Tag

func (n Node) Tag() string

Tag will return `n.Data.Data` if the node has a type of `html.ElementNode`, otherwise it will return an empty string

func (Node) Type

func (n Node) Type() html.NodeType

Type will return the value of `n.Data.Type`, returning `html.ErrorNode` if `n.Data` is nil

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL