lhtml

package module
v0.2.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 2, 2022 License: MIT Imports: 5 Imported by: 2

README

lhtml

Build Status Code Coverage go.mod version GitHub Go Report Card PkgGoDev

lhtml is a lenient HTML parser for Go.

It differs from the standard html package because it will not re-order any of the encountered elements, nor will it try to sanitize your HTML file. This package is intended to be used for HTML-template based systems which want to process their own custom tags and attributes.

Table of contents

Features

lhtml diffes from standard html parser in the following ways:

  • Single parsing funtion that handles both documents as well as fragments
    • ParseHtml
  • You may allow tags to have multiple attributes with same name
    • ParseOption#AllowMultipleAttributesWithSameName
  • No sanitization of the resulting DOM
  • Provides node discovery functions
    • GetElementById
    • GetElementsByName
    • GetBefore
    • GetAfter
    • Get (at index)
    • First
    • Last
  • Manipulation functions
    • InsertFirst
    • InsertLast
    • EmptyChildren
    • Remove
    • Replace
  • Visitor functions when building tree, or to walk tree

API

lhtml only has a single API that works both on the HTML document as well as HTML fragments.

func ParseHtml(reader io.Reader) (*core.HtmlDocument, error)

We also expose a convenience method in case you would like to use strings instead of a io.Reader:

func ParseHtmlString(html string) (*core.HtmlDocument, error)

Usage

Simply add the library to your project:

$ go get github.com/sangupta/lhtml@v0.1.0

And then, use it to parse your HTML markup:

import (
    "github.com/sangupta/lhtml"
    "github.com/sangupta/lhtml/core"
)

func test() {
    html := "<html class='test1' class='test2' custom:title='hello'>Hello World <custom:PageBody /></html>"
    doc, err := lhtml.ParseHtmlString(htmlString)
    if err != nil {
        panic(err)
    }

    visitor := func(node *core.HtmlNode) bool {
        if node.NodeType == core.ElementNode {
            fmt.Println(node.TagName)
        }

        return true
    }
}

Examples

No DOM sanitization

For example, the HTML title tag cannot contain another tag. Given the following html:

<html>
    <head>
        <title>
            <custom:PageTitle />
        </title>
    </head>
</html>

The standard Go implementation will parse it to:

<html>
    <head>
        <title>
            &lt; custom:PageTitle /&gt;
        </title>
    </head>
</html>

However, when using lhtml you will get the exact markup as defined above. It is left to the callee code on how it wants to interpret and use the parsed DOM nodes.

Traversing the DOM

func test() {
    doc, err := lhtml.ParseString("<html><head><title>Example</title></head><body><h1>Hello World</h1></body></html>")
    if err != nil {
        panic(err)
    }

    s := ""
	called := 0
	visitor := func(node *HtmlNode) bool {
		called++
		if node.NodeType != ElementNode {
			return true
		}
		s = s + " " + node.NodeName()
		return true
	}

    doc.Traverse(visitor)
	fmt.Println(s)          // " html head title body h1"
    fmt.Println(called)     // 7 (5 element nodes, 2 text nodes)
}

Hacking

  • To build the Go docs locally:

  • To run all tests along with code coverage report

    • $ go test ./... -v -coverprofile coverage.out
    • $ go tool cover -html=coverage.out
  • To publish the Go module:

    • $ git tag v0.x.0
    • $ git push origin v0.x.0
    • $ GOPROXY=proxy.golang.org go list -m github.com/sangupta/lhtml@v0.x.0

Changelog

  • Version 0.1.0
    • Initial release $ go get github.com/sangupta/lhtml@v0.1.0

License

MIT License. Copyright (C) 2022, Sandeep Gupta.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type HtmlAttribute added in v0.2.0

type HtmlAttribute struct {
	Name  string // the name of this attribute
	Value string // the value of this attribute
}

Holds the values for an attribute pair.

type HtmlDocument added in v0.2.0

type HtmlDocument struct {
	HtmlElements // the child elements
}

A wrapper representing a HTML document which is nothing but an array of HTML elements.

func (*HtmlDocument) Body added in v0.2.0

func (document *HtmlDocument) Body() *HtmlNode

Returns the `body` element in the document if any. Only the top level elements are searched for the desired element. `nil` is returned if the document is empty or the element is not found.

func (*HtmlDocument) GetDocType added in v0.2.0

func (document *HtmlDocument) GetDocType() *HtmlNode

Return the DocType element associated if any. Only the top level elements are searched for the desired element. `nil` is returned if the document is empty or the element is not found.

func (*HtmlDocument) Head added in v0.2.0

func (document *HtmlDocument) Head() *HtmlNode

Returns the `head` element in the document if any. Only the top level elements are searched for the desired element. `nil` is returned if the document is empty or the element is not found.

type HtmlElements added in v0.2.0

type HtmlElements struct {
	// contains filtered or unexported fields
}

The structure that holds a group of HTML elements. This is similar to `HtmlDocument` except that it can hold HTML fragments as well. Thus, it is different from the internal `html` package as the functions it provide are different than the standard ones.

func NewHtmlElements added in v0.2.0

func NewHtmlElements() *HtmlElements

Function that returns a new empty `HtmlElements` object. This has no nodes defined and is totally empty. It is used to initialize the internal structure.

func ParseHtml

func ParseHtml(reader io.Reader) (*HtmlElements, error)

A loose HTML parser that just returns the tags and their attributes in the order they appear. It makes no assumption on if the tag is permitted inside another or not. For example, you cannot include `iframe` within an `input` tag. But, using this package you still will be able to parse such syntax.

This package should be useful for parsing and working with html-like template syntax where we define our own custom tags that can emit or alter the behavior of the final HTML code.

Thus, this function returns a wrapper over the actual `[]*html.Node` nodes parsed via the `html` package. This wrapper provides convenience functions to achieve some of the templating work quickly.

func ParseHtmlString

func ParseHtmlString(html string) (*HtmlElements, error)

Parse the given string as a HTML document or a fragment. It is a convenience method and calls `ParseHtml(reader io.Reader)` internally.

Returns the `HtmlDocument` and any error if encountered. If error is `nil`, the `HtmlDocument` instance would be available. If the error is not `nil`, the `HtmlDocument` will be `nil`.

func ParseWithOptions added in v0.2.0

func ParseWithOptions(reader io.Reader, options *ParseOptions) (*HtmlElements, error)

Generic parse function that takes the reader and tries to return the `HtmlDocument` on a best-effort basis.

func (*HtmlElements) AsHtmlDocument added in v0.2.0

func (elements *HtmlElements) AsHtmlDocument() *HtmlDocument

Convert this `HtmlElements` instance into a `HtmlDocument` instance. Note, in case of fragments, most of the `HtmlDocument` functions will return `nil` unless you add them (for example, for a simple fragment, `document.Head()` will return `nil`).

func (*HtmlElements) Empty added in v0.2.0

func (elements *HtmlElements) Empty()

Remove all nodes from this list of elements. All removed nodes are detached.

func (*HtmlElements) First added in v0.2.0

func (elements *HtmlElements) First() *HtmlNode

Return the first node in this list of nodes.

func (*HtmlElements) Get added in v0.2.0

func (elements *HtmlElements) Get(index int) *HtmlNode

Return the node at the given index. If index is out of bounds, this function shall return `nil`.

func (*HtmlElements) GetAfter added in v0.2.0

func (elements *HtmlElements) GetAfter(child *HtmlNode) *HtmlNode

Get the node occurring after this node in the list of nodes. Returns `nil` if the child node is `nil`, or this node has no child nodes, or the given node is not its child.

func (*HtmlElements) GetBefore added in v0.2.0

func (elements *HtmlElements) GetBefore(child *HtmlNode) *HtmlNode

Get the node occurring before this node in the list of nodes. Returns `nil` if the child node is `nil`, or this node has no child nodes, or the given node is not its child.

func (*HtmlElements) GetChildrenByName added in v0.2.0

func (elements *HtmlElements) GetChildrenByName(name string) *HtmlElements

Find and return all elements in this list's direct children that match the given name/tag name/node name. Returns an instance of `HtmlElements` which contains all the selected nodes. If no match is found, an empty list is returned. This method never returns a `nil`.

func (*HtmlElements) GetElementById added in v0.2.0

func (elements *HtmlElements) GetElementById(id string) *HtmlNode

Find a node within this list of elements which has an ID value as the given value.

Returns `HtmlNode` instance if found, `nil` otherwise

func (*HtmlElements) GetElementsByName added in v0.2.0

func (elements *HtmlElements) GetElementsByName(name string) *HtmlElements

Find and return all elements in this list of elements and its children that match the given name/tag name/node name. This function searches the entire tree for a match.

Returns an instance of `HtmlElements` which contains all the selected nodes. If no match is found, an empty list is returned. This method never returns a `nil`.

func (*HtmlElements) InsertAfter added in v0.2.0

func (elements *HtmlElements) InsertAfter(childNode *HtmlNode, newNode *HtmlNode) bool

Insert a newNode after given childNode. Returns `true` if the newNode was added successfully. Returns `false` if there are no elements in this instance or the child instance cannot be found.

func (*HtmlElements) InsertAt added in v0.2.0

func (elements *HtmlElements) InsertAt(index int, newNode *HtmlNode)

Insert a node at given index. If index is less than or equal to zero, the node is inserted as first element. If index is equal or greater than length, the node is inserted as last element.

func (*HtmlElements) InsertBefore added in v0.2.0

func (elements *HtmlElements) InsertBefore(childNode *HtmlNode, newNode *HtmlNode) bool

Insert a newNode before another childNode. Returns `true` if the newNode was added successfully. Returns `false` if there are no elements in this instance or the child instance cannot be found.

func (*HtmlElements) InsertFirst added in v0.2.0

func (elements *HtmlElements) InsertFirst(newNode *HtmlNode)

Insert the given newNode as the first node in the list of elements.

func (*HtmlElements) InsertLast added in v0.2.0

func (elements *HtmlElements) InsertLast(newNode *HtmlNode)

Insert the given node as the last node in the list of elements.

func (*HtmlElements) IsEmpty added in v0.2.0

func (elements *HtmlElements) IsEmpty() bool

Check if this document is empty or not. A document is considered empty if it has no child node.

func (*HtmlElements) Last added in v0.2.0

func (elements *HtmlElements) Last() *HtmlNode

Return the last node in this list of nodes.

func (*HtmlElements) Length added in v0.2.0

func (elements *HtmlElements) Length() int

Return the length of elements inside this instance.

func (*HtmlElements) Nodes added in v0.2.1

func (elements *HtmlElements) Nodes() []*HtmlNode

Return all child nodes of this element.

func (*HtmlElements) Remove added in v0.2.0

func (elements *HtmlElements) Remove(childNode *HtmlNode) bool

Remove given childNode from document if it is a direct child.

Returns `true` if the childNode was actually removed, `false` otherwise. The removed childNode is detached.

func (*HtmlElements) Replace added in v0.2.0

func (elements *HtmlElements) Replace(childNode *HtmlNode, newNode *HtmlNode) bool

Replace the given childNode with provided newNode replacement if it exists in the list of nodes within this element. Returns `true` if the node was actually replaced, `false` otherwise. The removed childNode is detached.

func (*HtmlElements) String added in v0.2.1

func (elements *HtmlElements) String() (string, error)

func (*HtmlElements) Traverse added in v0.2.0

func (doc *HtmlElements) Traverse(visitor HtmlNodeVisitor)

Allow traversing over the `HtmlDocument`. If a `nil` visitor is supplied, no tree traversal happens.

func (*HtmlElements) WrappedString added in v0.2.1

func (elements *HtmlElements) WrappedString(node *HtmlNode) (string, error)

type HtmlNode added in v0.2.0

type HtmlNode struct {
	Attributes []*HtmlAttribute

	IsSelfClosing bool
	NodeType      HtmlNodeType
	Data          string
	// contains filtered or unexported fields
}

Defines the structure for a `node` in the HTML. Before working with a node, do check the `NodeType` value to ensure that the property you are reading will contain a value or not.

func (*HtmlNode) AddAttribute added in v0.2.0

func (node *HtmlNode) AddAttribute(key string, value string)

Add a new attribute to this node. By design, we allow a single tag to hold multiple values for the same attribute name. This is to ensure that we can parse JSX-like syntax to allow templates to hold individual values, and then let the template engines to merge them into a single value.

func (*HtmlNode) Children added in v0.2.1

func (node *HtmlNode) Children() []*HtmlNode

Get a list of all children of this `HtmlNode`.

func (*HtmlNode) ContainsAttributes added in v0.2.0

func (node *HtmlNode) ContainsAttributes() bool

func (*HtmlNode) First added in v0.2.0

func (node *HtmlNode) First() *HtmlNode

Return the first child node, if any. Returns `nil` if the node has no children.

func (*HtmlNode) Get added in v0.2.0

func (node *HtmlNode) Get(index int) *HtmlNode

Return the node at a given index. If the index is out of bounds, `nil` is returned.

func (*HtmlNode) GetAttribute added in v0.2.0

func (node *HtmlNode) GetAttribute(key string) *HtmlAttribute

Find and return the first attribute with the given name.

Returns an `HtmlNode` if found, `nil` otherwise

func (*HtmlNode) GetAttributeValue added in v0.2.1

func (node *HtmlNode) GetAttributeValue(key string) (string, error)

func (*HtmlNode) GetAttributeWithValue added in v0.2.0

func (node *HtmlNode) GetAttributeWithValue(key string, value string) *HtmlAttribute

Find and return the `HtmlAttribute` which has the given name and value

Returns either a `HtmlAttribute` instance, `nil` otherwise

func (*HtmlNode) GetAttributes added in v0.2.0

func (node *HtmlNode) GetAttributes(key string) []*HtmlAttribute

Find and return all attributes that have the given name.

Returns a slice of `HtmlNode` if found, `nil` otherwise

func (*HtmlNode) GetChild added in v0.2.0

func (node *HtmlNode) GetChild(index int) *HtmlNode

Return the child at a given index. If the index is out of bounds' `nil` is returned.

func (*HtmlNode) GetChildAfter added in v0.2.1

func (node *HtmlNode) GetChildAfter(child *HtmlNode) *HtmlNode

Return the node after the given child node. Returns `nil` if the child is `nil`, or is not a direct child of this node, or if this is the last node in list.

func (*HtmlNode) GetChildBefore added in v0.2.1

func (node *HtmlNode) GetChildBefore(child *HtmlNode) *HtmlNode

Return the node before the given child node. Returns `nil` if the child is `nil`, or is not a direct child of this node, or if this is the first node in list.

func (*HtmlNode) GetChildByName added in v0.2.0

func (node *HtmlNode) GetChildByName(name string) *HtmlNode

func (*HtmlNode) GetElementById added in v0.2.0

func (node *HtmlNode) GetElementById(id string) *HtmlNode

Find a node within this node (including this one) which has an ID value as the given value.

Returns `HtmlNode` instance if found, `nil` otherwise

func (*HtmlNode) GetElementsByName added in v0.2.0

func (node *HtmlNode) GetElementsByName(name string) *HtmlElements

Return the elements/nodes that match the given tag name, including this element. The node hierarchy is not maintained in results.

func (*HtmlNode) HasAttribute added in v0.2.0

func (node *HtmlNode) HasAttribute(key string) bool

Check if the node has an attribute with the given name.

Returns `true` if the an attribute exists, `false` otherwise

func (*HtmlNode) HasChildren added in v0.2.0

func (node *HtmlNode) HasChildren() bool

Quick check to see if this node has any children or not.

func (*HtmlNode) InsertAfterChild added in v0.2.0

func (node *HtmlNode) InsertAfterChild(child *HtmlNode, additional *HtmlNode) bool

Insert a node after given child. Returns `true` if the node is inserted. Returns `false` if the node has no children, or the given child does not belong to this node.

func (*HtmlNode) InsertAfterMe added in v0.2.0

func (node *HtmlNode) InsertAfterMe(additional *HtmlNode) bool

Insert a node after this node in its parent's child nodes. Returns `true` if the node was inserted. Returns `false` if this node has no parent.

func (*HtmlNode) InsertBeforeChild added in v0.2.0

func (node *HtmlNode) InsertBeforeChild(child *HtmlNode, additional *HtmlNode) bool

Insert a node before given child. Returns `true` if the node is inserted. Returns `false` if the node has no children, or the given child does not belong to this node.

func (*HtmlNode) InsertBeforeMe added in v0.2.0

func (node *HtmlNode) InsertBeforeMe(additional *HtmlNode) bool

Insert a node before this node in its parent's child nodes. Returns `true` if the node was inserted. Returns `false` if this node has no parent.

func (*HtmlNode) InsertChildAt added in v0.2.0

func (node *HtmlNode) InsertChildAt(index int, additional *HtmlNode)

Insert the child at the given index. If the index is less than zero the node is inserted as the first node. If the index is greater than the last node index it is inserted as the last node.

func (*HtmlNode) Last added in v0.2.0

func (node *HtmlNode) Last() *HtmlNode

Return the last child node, if any. Returns `nil` if the node has no children.

func (*HtmlNode) NextSibling added in v0.2.0

func (node *HtmlNode) NextSibling() *HtmlNode

Return the node after this node in the list. Returns `nil` if this node is detached, or has no next sibling.

func (*HtmlNode) NodeName added in v0.2.0

func (node *HtmlNode) NodeName() string

Return the node name, also known as tag name for this element.

func (*HtmlNode) NumAttributes added in v0.2.0

func (node *HtmlNode) NumAttributes() int

func (*HtmlNode) NumChildren added in v0.2.0

func (node *HtmlNode) NumChildren() int

Return the total number of children this node has.

func (*HtmlNode) Parent added in v0.2.0

func (node *HtmlNode) Parent() *HtmlNode

Return the parent of this node, if any. A node at the root level (such as <html />) does not have a parent, but may have an internal `wrappingElement`. This allows us to provide functions to replace/remove node directly.

func (*HtmlNode) PrevSibling added in v0.2.0

func (node *HtmlNode) PrevSibling() *HtmlNode

Return the node before this node in the list. Returns `nil` if this node is detached, or has no previous sibling.

func (*HtmlNode) RemoveAllChildren added in v0.2.0

func (node *HtmlNode) RemoveAllChildren()

Remove all children from this `HtmlNode`.

func (*HtmlNode) RemoveAttribute added in v0.2.0

func (node *HtmlNode) RemoveAttribute(key string) bool

func (*HtmlNode) RemoveChild added in v0.2.0

func (node *HtmlNode) RemoveChild(child *HtmlNode) bool

Remove the given child from this node.

Returns `true` if the node was actually removed, `false` otherwise

func (*HtmlNode) RemoveDuplicateAttributes added in v0.2.0

func (node *HtmlNode) RemoveDuplicateAttributes() bool

Function removes all duplicate attributes from the node. The first available value is kept and other values are dropped. The function returns `true` if the attributes were modified (duplicates were removed), `false` otherwise.

func (*HtmlNode) RemoveMe added in v0.2.0

func (node *HtmlNode) RemoveMe() bool

Remove this node from its parent node, or from the document.

Returns `true` if the node was actually removed, `false` otherwise

func (*HtmlNode) ReplaceChild added in v0.2.0

func (node *HtmlNode) ReplaceChild(original *HtmlNode, replacement *HtmlNode) bool

Replace a child of this node with given replacement.

Returns `true` if the node was actually replaced, `false` otherwise

func (*HtmlNode) ReplaceMe added in v0.2.0

func (node *HtmlNode) ReplaceMe(replacement *HtmlNode) bool

ReplaceMe the given node with provided replacement by ensuring whether it has a parent, or is directly attached to document.

Returns `true` if the node was actually replaced, `false` otherwise

func (*HtmlNode) SetAttribute added in v0.2.0

func (node *HtmlNode) SetAttribute(key string, value string) bool

func (*HtmlNode) String added in v0.2.1

func (node *HtmlNode) String() string

func (*HtmlNode) Traverse added in v0.2.0

func (node *HtmlNode) Traverse(visitor HtmlNodeVisitor) bool

Allow traversing over the `HtmlNode`.

func (*HtmlNode) WriteToBuilder added in v0.2.1

func (node *HtmlNode) WriteToBuilder(builder *strings.Builder)

type HtmlNodeType added in v0.2.0

type HtmlNodeType uint32

Enum to define the NodeType

const (
	ErrorNode HtmlNodeType = iota
	TextNode
	DocumentNode
	ElementNode
	CommentNode
	DoctypeNode
)

Enumeration

type HtmlNodeVisitor added in v0.2.0

type HtmlNodeVisitor func(node *HtmlNode) bool

Defines a simple contract for a node visitor. The visitor receives a node, and then returns either `true` to continue traversing the html tree, or `false` to immediately stop walking the tree.

type ParseOptions added in v0.2.0

type ParseOptions struct {
	CaseSensitiveAttributes             bool
	AllowMultipleAttributesWithSameName bool
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL