html_util

package
v0.0.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 1, 2022 License: MIT Imports: 9 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var TextRegex = regexp.MustCompile("[^!-~]") // without space

Functions

func GetAttributeByKey

func GetAttributeByKey(node *html.Node, key string) (html.Attribute, error)

func GetChildren

func GetChildren(node *html.Node) []*html.Node

GetChildren Same as below, return slice of pointers, even though considered bad practice, to be able to directly modify substructures of a bigger tree.

func GetElementNodeByTagName

func GetElementNodeByTagName(name string, startNode *html.Node) *html.Node

GetElementNodeByTagName Returns the first node with the given tag name provided a starting node Returns nil if none found

func GetElementsInTableRowByConditionForOneOfTheElements

func GetElementsInTableRowByConditionForOneOfTheElements(tableNode *html.Node, cond func(n *html.Node) bool) []*html.Node

GetElementsInTableRowByConditionForOneOfTheElements Returns all children elements (with tag <td>) of the table row node with tag (<tr>), for which at least one children fulfills the provided condition cond

func GetFirstTextNode

func GetFirstTextNode(startNode *html.Node) *html.Node

func GetFirstTextNodeWithCondition

func GetFirstTextNodeWithCondition(startNode *html.Node, cond func(s string) bool) *html.Node

func GetNextNodeByCondition

func GetNextNodeByCondition(startNode *html.Node, cond func(node *html.Node) bool) *html.Node

GetNextNodeByCondition Returns the first node for which the provided condition yields true, excluding the start node

func GetNextNodesByCondition

func GetNextNodesByCondition(startNode *html.Node, cond func(node *html.Node) bool) []*html.Node

GetNextNodesByCondition Return all nodes in the tree of startNode for which the provided condition yields true, excluding startNode. Note that this returns a slice with pointers to structs which is considered bad practice However, we do not want copies to the nodes but the actual pointers in case we want to modify nodes in part of a bigger tree structure.

func GetNodeByCondition

func GetNodeByCondition(startNode *html.Node, cond func(node *html.Node) bool) *html.Node

GetNodeByCondition Returns the first node for which the provided condition yields true, including the start node

func GetNodesByCondition

func GetNodesByCondition(startNode *html.Node, cond func(node *html.Node) bool) []*html.Node

GetNodesByCondition Return all nodes in the tree of startNode for which the provided condition yields true, including startNode. Note that this returns a slice with pointers to structs which is considered bad practice However, we do not want copies to the nodes but the actual pointers in case we want to modify nodes in part of a bigger tree structure.

func GetTextNodes

func GetTextNodes(startNode *html.Node) []*html.Node

func GetTextNodesByCondition

func GetTextNodesByCondition(startNode *html.Node, cond func(s string) bool) []*html.Node

func MakeByAttributeNameAndValueCondition

func MakeByAttributeNameAndValueCondition(attributeName, attributeValue string) func(node *html.Node) bool

func MakeByClassNameCondition

func MakeByClassNameCondition(className string) func(node *html.Node) bool

func MakeByIdCondition

func MakeByIdCondition(id string) func(node *html.Node) bool

func MakeByTagNameCondition

func MakeByTagNameCondition(name string) func(node *html.Node) bool

func MakeTextNodeComposite

func MakeTextNodeComposite(textNodes []*html.Node, compositeRune string) string

func MakeTextNodeCompositeWithNormalizerFunc

func MakeTextNodeCompositeWithNormalizerFunc(textNodes []*html.Node, compositeDelimiter string, normalizerFunc func(string) string) string

func ParseSelectHTMLNode

func ParseSelectHTMLNode(selectNode *html.Node) (map[string]string, string, error)

ParseSelectHTMLNode Parses the html node with tag 'select' into its different options. Returns a map containing key: value as strings, in which key is the content text content of the option and value is the content of the 'value' attribute of this option.

If multiple options have the same content text, they will be overridden and only the last one is kept. Returns the currently selected option, which is the option with attribute 'selected' if it exists, otherwise the first occurring option.

If multiple options have the "selected" attribute, returns the last option that has it as "selectedOption" Returns nil map and nil error if no options were found.

func WalkHtmlTree

func WalkHtmlTree(node *html.Node, f func(n *html.Node) bool)

WalkHtmlTree Calls f on node. If it returns true, call WalkHtmlTree on all of its children.

Types

type HtmlTable

type HtmlTable struct {
	Headers   []string   // Headers, equal to TableData[0, :] in numpy expression
	Index     []string   // Index, equal to TableData[:, 0] in numpy expression
	TableData [][]string // All data excluding headers and index
	// contains filtered or unexported fields
}

HtmlTable Represents an HTML table in a struct Contains only text content

func ParseHtmlTable

func ParseHtmlTable(tableNode *html.Node, hasHeaderRow bool, hasIndexColumn bool, suffix string) (*HtmlTable, error)

ParseHtmlTable Parses a given html.Node which should point to a <table> ElementNode in a html tree to an HtmlTable Struct which can be used to easily look up existing indices, headers, and values. Content is set after normalizing with identity normalizer func, normalizer(s) = s. we append '{suffix}_{keyCount}' to keys which appear multiple times to make them unique. the first occurrence does not have this.

func ParseHtmlTableWithNormalizer

func ParseHtmlTableWithNormalizer(tableNode *html.Node, hasHeaderRow bool, hasIndexColumn bool, suffix string, normalizerFunc func(string) string, allowCompositeTexts bool, compositeDelimiter string) (*HtmlTable, error)

ParseHtmlTableWithNormalizer Parses a given html.Node which should point to a <table> ElementNode in a html tree to an HtmlTable Struct which can be used to easily look up existing indices, headers, and values. Content is set after normalizing with normalizerFunc we append '{suffix}_{keyCount}' to keys which appear multiple times to make them unique. the first occurrence does not have this. TODO: describe the meaning of allowCompositeTexts and compositeDelimiter parameters

func (HtmlTable) GetColumnByIndex

func (ht HtmlTable) GetColumnByIndex(j int) ([]string, string)

GetColumnByIndex Analogous to GetRowByIndex but for columns. You can check the length of columns via the length of the Headers. GetColumnByIndex(0) returns the index column.

func (HtmlTable) GetColumnByKey

func (ht HtmlTable) GetColumnByKey(key string) ([]string, int, bool)

GetColumnByKey Analogous to GetRowByKey but for columns.

func (HtmlTable) GetColumnByKeyNum

func (ht HtmlTable) GetColumnByKeyNum(key string, occurrence int) ([]string, int, bool)

GetColumnByKeyNum Returns the column with the original key (with possibly multiple occurrences) and the num occurrence

func (HtmlTable) GetElementByIndex

func (ht HtmlTable) GetElementByIndex(i, j int) string

GetElementByIndex Returns the element in table data for row i and column j. Panics if either is out of bounds.

func (HtmlTable) GetElementByKeys

func (ht HtmlTable) GetElementByKeys(rowKey, columnKey string) (string, int, int, bool)

GetElementByKeys Returns the element in table data with the provided row key and column key. returns "", false if at least one key is missing.

func (HtmlTable) GetElementByKeysNum

func (ht HtmlTable) GetElementByKeysNum(rowKey, columnKey string, rowOccurrence, columnOccurrence int) (string, int, int, bool)

GetElementByKeysNum Returns the element in table data with the provided row key and column key and the corresponding occurrences. returns "", false if at least one key is missing.

func (HtmlTable) GetRowByIndex

func (ht HtmlTable) GetRowByIndex(i int) ([]string, string)

GetRowByIndex Returns a copy of the table row with index i as well as the key of the corresponding index. panics if the row is out of bounds You can check the length of rows via the length of the index. GetRowByIndex(0) returns the header row. GetRowByIndex(1) returns the first row below the header row, and so on. Note: There is always a header row. Even if during parsing no header row was specified, the resulting table will have an artificial header row like (Index 1 2 3 4 ...)

func (HtmlTable) GetRowByKey

func (ht HtmlTable) GetRowByKey(key string) ([]string, int, bool)

GetRowByKey Returns the copy of the row with the given key as index if it exists, else, returns (nil, false)

func (HtmlTable) GetRowByKeyNum

func (ht HtmlTable) GetRowByKeyNum(key string, occurrence int) ([]string, int, bool)

GetRowByKeyNum Returns the row with the original key (with possibly multiple occurrences) and the num occurrence

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL