htmlsqueeze

package module

v0.0.4 Latest Latest Go to latest Published: Feb 18, 2021 License: MIT Imports: 2 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/patrickbucher/htmlsqueeze

Links

Open Source Insights

README ¶

htmlsqueeze

htmlsqueeze is a small Go library to extract text out of HTML DOM trees. It is based on the notions of predicates and extractors. Predicates are rules stating which nodes are to be extracted when traversing the HTML DOM tree. Extractors are functions that define how the text is to be extracted from a node.

TODO

implement some more predicates
implement some more extractors
convenience functions to build up lists of predicate lists

Example

Given this HTML page (htmlText):

<div class="main">
	<div class="odd">
		<p class="yes">a</p>
		<p class="no">b</p>
		<p class="yes">c</p>
		<p class="no">d</p>
	</div>
	<div class="even">
		<p class="yes">e</p>
		<p class="no">f</p>
		<p class="yes">g</p>
		<p class="no">h</p>
	</div>
	<div class="odd">
		<p class="yes">i</p>
		<p class="no">j</p>
		<p class="yes">k</p>
		<p class="no">l</p>
	</div>
	<div class="even">
		<p class="yes">m</p>
		<p class="no">n</p>
		<p class="yes">o</p>
		<p class="no">p</p>
	</div>
</div>

The text content of the nodes matching the CSS selector div.odd p.yes can be extracted as follows:

doc, _ := html.Parse(strings.NewReader(htmlText))
predicates := [][]htmlsqueeze.Predicate{
predicates := [][]htmlsqueeze.Predicate{
    []htmlsqueeze.Predicate{htmlsqueeze.TagMatcher("div"), htmlsqueeze.ClassMatcher("odd")},
    []htmlsqueeze.Predicate{htmlsqueeze.TagMatcher("p"), htmlsqueeze.ClassMatcher("yes")},
}
found := htmlsqueeze.Squeeze(doc, predicates, htmlsqueeze.ExtractChildText)

Or easier using the convenience interface SqueezeSelector:

doc, _ := html.Parse(strings.NewReader(htmlText))
found := htmlsqueeze.Squeeze(doc, "div.odd p.yes", htmlsqueeze.ExtractChildText)

Documentation ¶

Index ¶

func Apply(n *html.Node, predicates [][]Predicate) []*html.Node
func ExtractChildText(n *html.Node) string
func ExtractChildrenTexts(n *html.Node) string
func MatchAll(n *html.Node, predicates []Predicate) bool
func Squeeze(n *html.Node, predicates [][]Predicate, extract Extractor) []string
func SqueezeSelector(n *html.Node, selectors string, extract Extractor) []string
func TagClassMatchers(selectors []string) [][]Predicate
func TagClassMatchersOf(selectors string) [][]Predicate
type Extractor
type Predicate

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Apply ¶

func Apply(n *html.Node, predicates [][]Predicate) []*html.Node

Apply applies the given predicates to the given node, and returns the matching nodes. If a node satisfies all the predicates of the first sub-list, the remaining predicates are applied to the node's children; otherwise all the predicates are applied to the node's children. If no more predicates are left to be satisfied, the node is considered a match and returned.

func ExtractChildText ¶

func ExtractChildText(n *html.Node) string

ExtractChildText returns the text of n's first child, if it is a text node, and the empty string otherwise.

func ExtractChildrenTexts ¶ added in v0.0.4

func ExtractChildrenTexts(n *html.Node) string

ExtractChildrenTexts returns the text of n's children that are text nodes separated by space.

func MatchAll ¶

func MatchAll(n *html.Node, predicates []Predicate) bool

MatchAll applies the given predicates to a node, returns true if the node satisfies all those predicates, and false otherwise.

func Squeeze ¶

func Squeeze(n *html.Node, predicates [][]Predicate, extract Extractor) []string

Squeeze applies the given predicates to the given node, applies the given extractor to the matching nodes, and returns the extracted text as a list of strings.

func SqueezeSelector ¶ added in v0.0.3

func SqueezeSelector(n *html.Node, selectors string, extract Extractor) []string

SqueezeSelector is a convenience interface for Squeeze, which produces the predicates using the given selectors by invoking TagClassMatchersOf.

func TagClassMatchers ¶ added in v0.0.3

func TagClassMatchers(selectors []string) [][]Predicate

TagClassMatchers expects multiple selectors. For each selector, TagClassMatcher is invoked to produce a list of predicates. Those predicates are returned as a list of predicate lists, which can be used for the Squeeze and Apply functions.

func TagClassMatchersOf ¶ added in v0.0.3

func TagClassMatchersOf(selectors string) [][]Predicate

TagClassMatchersOf expects multiple selectors encoded in a single string, such as "div.main p.text", which are split into fields, and given as a parameter to TagClassMatchers to produce the specified predicates.

Types ¶

type Extractor ¶

type Extractor func(n *html.Node) string

Extractor is a function that extracts text of a node.

type Predicate ¶

type Predicate func(n *html.Node) bool

Predicate returns true if the given node satisfies a condition, and false otherwise.

func ClassMatcher ¶

func ClassMatcher(name string) Predicate

ClassMatcher creates a predicate that tests if a node has a class attribute containing the given name.

func DontMatch ¶ added in v0.0.3

func DontMatch() Predicate

DontMatch is a dummy matcher that never matches.

func TagClassMatcher ¶ added in v0.0.3

func TagClassMatcher(selector string) []Predicate

TagClassMatcher expects a selector like "div.main" (matching div elements with class main) or "div" (just matching div elements) and produces a list of according predicates. If the selector is malformed, a predicate list containing DontMatch is returned, which matches to nothing.

func TagMatcher ¶

func TagMatcher(name string) Predicate

TagMatcher creates a predicate that tests if a node is an element node of the given name.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL