parsing

package module
v1.0.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 12, 2023 License: GPL-3.0 Imports: 6 Imported by: 0

README

Go Reference Go Report Card codecov

Build Status Build Status Build status Go

Exploring HTML structure

HTML is parsed using golang.org/x/net/html which produces a tree.

The module provides basic functionality to compare HTML tags or nodes and their trees. The search of an HTML tag using a *node.HTML type ignores pointers. It always returns the first match. By ignoring some properties, tags like <button> are easy to count. Text value of a tag (title, error message,...) can be checked.

Good to know

Parsing is not done according to the complete syntax checker of HTML. For instance, tags like <p> for which a closing tag would fail a comparison.

Siblings must always have the same order or comparison fails. Order of attributes is treated as irrelevant.

How to start

Detailed documentation includes examples.

Versions

v1.0.6 updates golang/go/x/net package to remove CVE-2022-27664 which does not affect x/net/html v1.0.5 requires Go 1.16+ as ioutil package use is removed.
v1.0.4 requires Go 1.17+ which implements lazy loading of modules to avoid go.mod updates.
v1.0.0 was created on Go 1.12 which supports modules.

Documentation

Overview

Package parsing provides basic search and comparison of HTML documents. To limit storage of references, it uses the net/html package and its Node type to structure HTML.

Search a tag in a Node with options

  • searching a tag based on its name whatever attributes where its type is optional
  • searching a tag based on its non-pointer values: type, name, attribute and namespace
  • comparing tags including list of attributes where order is irrelevant
  • comparing Node structures with an optional type

Three ways to print a node tree

  • select type of node and a the node value where to stop.
  • select type of nodes or none.
  • complete with indentation.

Good to know

  • a non-matching closed tag is one element.
  • a non-closed tag is closed by the following opening tag. The elements that follow are discarded as the tag is closed by the parser.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func AttrIncluded

func AttrIncluded(m, n *html.Node) bool

AttrIncluded returns true if list of attributes of n is included in reference node m whatever their order.

func Equal

func Equal(m, n *html.Node) bool

Equal returns true if all fields of nodes m and n are equal except pointers reflect.DeepEqual(tag1, tag2) is unusable as pointers are checked too.

func ExploreNode

func ExploreNode(n *html.Node, s string, t html.NodeType)

ExploreNode prints node tags with name s and type t Without name, all tags are printed When type ErrorNode (iota == 0) prints tags of all types

Example (All)

ExampleExploreNode_all prints the complete node tree.

package main

import (
	"bytes"
	"fmt"

	parsing "github.com/iwdgo/htmlutils"
	"golang.org/x/net/html"
)

const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>`

func main() {
	b := new(bytes.Buffer)
	fmt.Fprint(b, HTMLf)
	o, _ := html.Parse(b)
	parsing.ExploreNode(o, "", html.ErrorNode)
}
Output:

(Document)
 html (Element)
 head (Element) body (Element)
 p (Element) [{ class ex1}]
 HTML Fragment to compare against  (Text) em (Element)
 others below (Text) to test  (Text) sub (Element)
 diffs (Text)
Example (Tags)

ExampleExploreNode_tags only prints text.

package main

import (
	"bytes"
	"fmt"

	parsing "github.com/iwdgo/htmlutils"
	"golang.org/x/net/html"
	"log"
)

const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>`

func main() {
	b := new(bytes.Buffer)
	fmt.Fprint(b, HTMLf)
	o, err := html.Parse(b) // Only place where err of Parse is checked
	if err != nil {
		log.Fatalf("parsing error:%v\n", err)
	}
	parsing.ExploreNode(o, "", html.TextNode)
}
Output:

HTML Fragment to compare against  (Text)
 others below (Text) to test  (Text)
 diffs (Text)

func FindNode

func FindNode(m *html.Node, n html.Node) *html.Node

FindNode find the first occurrence of a node

func FindTag

func FindTag(n *html.Node, s string, t html.NodeType) *html.Node

FindTag finds the first occurrence of a tag name (i.e. whatever its attributes). If ErrorNode is passed, any tag type will be searched.

func FindTags

func FindTags(n *html.Node, s string, t html.NodeType) (a []*html.Node)

FindTags finds all occurrences of a tag name whatever their attributes. If ErrorNode is passed, any tag type will be searched.

func GetText

func GetText(m *html.Node, b *bytes.Buffer)

GetText prints the text content of a tree structure like PrintNodes w/o any formatting TODO Check usage of (* Tokenizer) Text equivalent in net/html package

Example
package main

import (
	"bytes"
	"fmt"

	parsing "github.com/iwdgo/htmlutils"
	"golang.org/x/net/html"
)

const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>`

func main() {
	b := new(bytes.Buffer)
	_, _ = fmt.Fprint(b, HTMLf)
	o, _ := html.Parse(b) // Any parsing error would occured elsewhere
	w := new(bytes.Buffer)
	parsing.GetText(o, w)
	if s := fmt.Sprint(w); s != "HTML Fragment to compare against others below to test diffs" {
		fmt.Println("incorrect text")
	}
}
Output:

func IdenticalNodes

func IdenticalNodes(m, n *html.Node, t html.NodeType) *html.Node

IdenticalNodes fails if trees have different size

func IncludedNode

func IncludedNode(m, n *html.Node) *html.Node

IncludedNode checks if n is included in m. Included means that the subtree is identical to m including order of siblings. If it is identical, nil is returned. Otherwise, the tag from which trees diverge is returned. If m has more tags than n, nil is returned as the search stops when one subtree exploration is exhausted.

Example

ExampleIncludeNode is using the test files to demonstrate usage.

// f1 is the main table tag included in f2
toFind := html.Node{Type: html.ElementNode,
	Data: "table",
	Attr: []html.Attribute{{Namespace: "", Key: "class", Val: "fixed"}},
}
pm, _ := ParseFile(f1)
m := FindNode(pm, toFind) // searching <table> in d1
if m == nil {
	fmt.Printf("%s not found in %s \n", PrintData(&toFind), f1)
}

pn, _ := ParseFile(f2)
n := FindNode(pn, toFind) // searching <table> in d2
if n == nil {
	fmt.Printf("%s not found in %s \n", PrintData(&toFind), f2)
}
// Is n included in m
if f := IncludedNode(n, m); f != nil {
	fmt.Printf("nodes structures diverge from : %s\n", PrintData(f))
}
Output:

func IncludedNodeTyped

func IncludedNodeTyped(m, n *html.Node, t html.NodeType) *html.Node

IncludedNodeTyped is like IncludeNode where only tags of type t are compared

func IsTextNode

func IsTextNode(b io.ReadCloser, ns *html.Node, s string) error

IsTextNode checks the presence of a node and its text value in a buffer. An error message is returned if the node is not found or if the text is not the expected one.

func IsTextTag

func IsTextTag(b io.ReadCloser, t, s string) error

IsTextTag checks the presence of a tag and its text value in a buffer. An error message is returned if the tag is not found or if the text is not the expected one.

func ParseFile

func ParseFile(f string) (*html.Node, error)

ParseFile returns a *Node containing the parsed file or an error (file or parsing)

func PrintData

func PrintData(n *html.Node) string

PrintData returns a string with Node information (not its relationships) nil will panic

func PrintNodes

func PrintNodes(m, n *html.Node, t html.NodeType, d int)

PrintNodes prints the tree structure of node m until n node is equal. If nil is passed, the complete node is printed. Values are indented based on the recursion depth d which is usually 0 when called html.ErrorNode (iota) displays every tag except the error node.

Example (WSearch)

ExamplePrintNodes_wSearch is the previous example stopping at a searched node.

package main

import (
	"bytes"
	"fmt"

	parsing "github.com/iwdgo/htmlutils"
	"golang.org/x/net/html"
)

const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>`

func main() {
	b := new(bytes.Buffer)
	fmt.Fprint(b, HTMLf)
	o, _ := html.Parse(b)

	var tagToFind html.Node
	tagToFind.Type = html.ElementNode
	tagToFind.Data = "p"
	tagToFind.Attr = []html.Attribute{{Namespace: "", Key: "class", Val: "ex1"}}

	parsing.PrintNodes(o, &tagToFind, html.ErrorNode, 0)
}
Output:

html (Element)
. head (Element) body (Element)
.. p (Element) [{ class ex1}]
tag found: p (Element) [{ class ex1}]
... HTML Fragment to compare against  (Text) em (Element)
.... others below (Text) to test  (Text) sub (Element)
.... diffs (Text)
Example (WoSearch)

ExamplePrintNodes_woSearch prints all nodes without using search.

package main

import (
	"bytes"
	"fmt"

	parsing "github.com/iwdgo/htmlutils"
	"golang.org/x/net/html"
)

const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>`

func main() {
	b := new(bytes.Buffer)
	fmt.Fprint(b, HTMLf)
	o, _ := html.Parse(b)
	parsing.PrintNodes(o, nil, html.ErrorNode, 0)
}
Output:

html (Element)
. head (Element) body (Element)
.. p (Element) [{ class ex1}]
... HTML Fragment to compare against  (Text) em (Element)
.... others below (Text) to test  (Text) sub (Element)
.... diffs (Text)

func PrintTags

func PrintTags(n *html.Node, s string, tagOnly bool)

PrintTags prints node structure until a tag name is found (whatever attributes) Without name, all tags are printed tagOnly selects ElementNode, otherwise tags are printed whatever type. If node tree has no Errornode, there is no difference with previous i.e. exploreNode(n, "", html.ErrorNode) prints nothing then both are equivalent.

Example (WSearch)

ExamplePrintTags_wSearch is the previous example stopping at a searched tag

package main

import (
	"bytes"
	"fmt"

	parsing "github.com/iwdgo/htmlutils"
	"golang.org/x/net/html"
)

const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>`

func main() {
	b := new(bytes.Buffer)
	fmt.Fprint(b, HTMLf)
	o, _ := html.Parse(b)            // err ignored as failure is detected before
	parsing.PrintTags(o, "em", true) //
}
Output:

html (Element)
head (Element)
body (Element)
p (Element) [{ class ex1}]
em (Element)
[em] found. Stopping exploration
sub (Element)
Example (WoSearch)

ExamplePrintTags_woSearch is not using the search part.

package main

import (
	"bytes"
	"fmt"

	parsing "github.com/iwdgo/htmlutils"
	"golang.org/x/net/html"
)

const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>`

func main() {
	b := new(bytes.Buffer)
	fmt.Fprint(b, HTMLf)
	o, _ := html.Parse(b)
	parsing.PrintTags(o, "", false) // +1,6%
}
Output:

(Document)
html (Element)
head (Element)
body (Element)
p (Element) [{ class ex1}]
HTML Fragment to compare against  (Text)
em (Element)
others below (Text)
to test  (Text)
sub (Element)
diffs (Text)

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL