htmlsqueeze

package module
v0.0.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 18, 2021 License: MIT Imports: 2 Imported by: 0

README

htmlsqueeze

htmlsqueeze is a small Go library to extract text out of HTML DOM trees. It is based on the notions of predicates and extractors. Predicates are rules stating which nodes are to be extracted when traversing the HTML DOM tree. Extractors are functions that define how the text is to be extracted from a node.

TODO

  • implement some more predicates
  • implement some more extractors
  • convenience functions to build up lists of predicate lists

Example

Given this HTML page (htmlText):

<div class="main">
	<div class="odd">
		<p class="yes">a</p>
		<p class="no">b</p>
		<p class="yes">c</p>
		<p class="no">d</p>
	</div>
	<div class="even">
		<p class="yes">e</p>
		<p class="no">f</p>
		<p class="yes">g</p>
		<p class="no">h</p>
	</div>
	<div class="odd">
		<p class="yes">i</p>
		<p class="no">j</p>
		<p class="yes">k</p>
		<p class="no">l</p>
	</div>
	<div class="even">
		<p class="yes">m</p>
		<p class="no">n</p>
		<p class="yes">o</p>
		<p class="no">p</p>
	</div>
</div>

The text content of the nodes matching the CSS selector div.odd p.yes can be extracted as follows:

doc, _ := html.Parse(strings.NewReader(htmlText))
predicates := [][]htmlsqueeze.Predicate{
predicates := [][]htmlsqueeze.Predicate{
    []htmlsqueeze.Predicate{htmlsqueeze.TagMatcher("div"), htmlsqueeze.ClassMatcher("odd")},
    []htmlsqueeze.Predicate{htmlsqueeze.TagMatcher("p"), htmlsqueeze.ClassMatcher("yes")},
}
found := htmlsqueeze.Squeeze(doc, predicates, htmlsqueeze.ExtractChildText)

Or easier using the convenience interface SqueezeSelector:

doc, _ := html.Parse(strings.NewReader(htmlText))
found := htmlsqueeze.Squeeze(doc, "div.odd p.yes", htmlsqueeze.ExtractChildText)

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Apply

func Apply(n *html.Node, predicates [][]Predicate) []*html.Node

Apply applies the given predicates to the given node, and returns the matching nodes. If a node satisfies all the predicates of the first sub-list, the remaining predicates are applied to the node's children; otherwise all the predicates are applied to the node's children. If no more predicates are left to be satisfied, the node is considered a match and returned.

func ExtractChildText

func ExtractChildText(n *html.Node) string

ExtractChildText returns the text of n's first child, if it is a text node, and the empty string otherwise.

func ExtractChildrenTexts added in v0.0.4

func ExtractChildrenTexts(n *html.Node) string

ExtractChildrenTexts returns the text of n's children that are text nodes separated by space.

func MatchAll

func MatchAll(n *html.Node, predicates []Predicate) bool

MatchAll applies the given predicates to a node, returns true if the node satisfies all those predicates, and false otherwise.

func Squeeze

func Squeeze(n *html.Node, predicates [][]Predicate, extract Extractor) []string

Squeeze applies the given predicates to the given node, applies the given extractor to the matching nodes, and returns the extracted text as a list of strings.

func SqueezeSelector added in v0.0.3

func SqueezeSelector(n *html.Node, selectors string, extract Extractor) []string

SqueezeSelector is a convenience interface for Squeeze, which produces the predicates using the given selectors by invoking TagClassMatchersOf.

func TagClassMatchers added in v0.0.3

func TagClassMatchers(selectors []string) [][]Predicate

TagClassMatchers expects multiple selectors. For each selector, TagClassMatcher is invoked to produce a list of predicates. Those predicates are returned as a list of predicate lists, which can be used for the Squeeze and Apply functions.

func TagClassMatchersOf added in v0.0.3

func TagClassMatchersOf(selectors string) [][]Predicate

TagClassMatchersOf expects multiple selectors encoded in a single string, such as "div.main p.text", which are split into fields, and given as a parameter to TagClassMatchers to produce the specified predicates.

Types

type Extractor

type Extractor func(n *html.Node) string

Extractor is a function that extracts text of a node.

type Predicate

type Predicate func(n *html.Node) bool

Predicate returns true if the given node satisfies a condition, and false otherwise.

func ClassMatcher

func ClassMatcher(name string) Predicate

ClassMatcher creates a predicate that tests if a node has a class attribute containing the given name.

func DontMatch added in v0.0.3

func DontMatch() Predicate

DontMatch is a dummy matcher that never matches.

func TagClassMatcher added in v0.0.3

func TagClassMatcher(selector string) []Predicate

TagClassMatcher expects a selector like "div.main" (matching div elements with class main) or "div" (just matching div elements) and produces a list of according predicates. If the selector is malformed, a predicate list containing DontMatch is returned, which matches to nothing.

func TagMatcher

func TagMatcher(name string) Predicate

TagMatcher creates a predicate that tests if a node is an element node of the given name.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL