degausser

package module
v1.0.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 29, 2022 License: MIT Imports: 3 Imported by: 0

README

degausser

Go

HTML to plain text conversion.

For when you want to eliminate HTML tags from a document and leave reasonably rendered text behind.

The target algorithm is similar to the HTMLElement.innerText property of the HTML5 DOM. With the limitation of not taking into account layout or styling.

Usage

Example:

package main

import "github.com/flowpub/degausser/go/degausser"

func main() {
	html := `
<h3>For example:</h3>
<p id="source">
  <style>#source { color: red; }</style>
  Take a look at
  <br>
  <strong>how</strong>
  <em>this</em>
  text<br>is
  <mark>inter</mark>preted
  below.
  <span style="display:none">HIDDEN TEXT</span>
</p>
	`
	plain, err := degausser.HTMLToPlainText(html)
	if err != nil {
		panic(err)
	}

	print(plain)
}

Output:

For example:

Take a look at
how this text
is interpreted below. HIDDEN TEXT

Documentation

Index

Constants

This section is empty.

Variables

View Source
var MetadataContent = []string{
	"base",
	"command",
	"link",
	"meta",
	"noscript",
	"script",
	"style",
	"title",

	"html",
	"head",
}

MetadataContent is a set of node names, for the nodes that are in the HTML metadata content category.

The node names, `html` and `head`, are treated as special set members.

View Source
var PhrasingContent = []string{
	"a",
	"abbr",
	"audio",
	"b",
	"bdo",
	"br",
	"button",
	"canvas",
	"cite",
	"code",
	"command",
	"data",
	"datalist",
	"dfn",
	"em",
	"embed",
	"i",
	"iframe",
	"img",
	"input",
	"kbd",
	"keygen",
	"label",
	"mark",
	"math",
	"meter",
	"noscript",
	"object",
	"output",
	"progress",
	"q",
	"ruby",
	"samp",
	"script",
	"select",
	"small",
	"span",
	"strong",
	"sub",
	"sup",
	"svg",
	"textarea",
	"time",
	"var",
	"video",
	"wbr",

	"map",
	"area",
}

PhrasingContent is a set of node names, for the nodes that are in the HTML phrasing content category.

The node names, `map` and `area`, are treated as special set members.

Functions

func CollapseRepeatingSpaces

func CollapseRepeatingSpaces(input string) string

CollapseRepeatingSpaces returns a slice of the string input with repeating spaces reduced down to one.

func CollapseRepeatingWhitespace

func CollapseRepeatingWhitespace(input string) string

CollapseRepeatingWhitespace returns a slice of the string input with repeating whitespace characters reduced down to one.

func GetNodeName

func GetNodeName(node *html.Node) string

GetNodeName returns the node name of the given node.

func HTMLToPlainText

func HTMLToPlainText(htmlMarkup string) (string, error)

HTMLToPlainText receives HTML markup text as input, and returns a transformed plain-text representation.

It implements an algorithm similar to an HTML5 DOM element node's `.innerText` property. This does not take layout or styling into account.

func IsElement

func IsElement(node *html.Node) bool

IsElement returns true if the given node is the element node type.

func IsElementNodeOfType

func IsElementNodeOfType(node *html.Node, types []string) bool

IsElementNodeOfType returns true if the given node has a name that is a member of the types slice.

func IsTextNode

func IsTextNode(node *html.Node) bool

IsTextNode returns true if the given node is the text node type.

func LastIn

func LastIn(slice []string) string

LastIn returns the last item of the given slice.

If the slice empty the null character is returned.

func TrimSpaces

func TrimSpaces(input string) string

TrimSpaces returns a slice of the string input with only spaces removed.

func TrimWhitespaceLeft

func TrimWhitespaceLeft(input string) string

TrimWhitespaceLeft returns a slice of the string input with all leading whitespace characters removed.

func Walk

func Walk(root *html.Node, enter func(*html.Node), exit func(*html.Node))

Walk allows you to traverse a *html.Node tree.

Types

This section is empty.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL