html

package

v2.4.3 Latest Latest Go to latest Published: Jun 3, 2020 License: MIT Imports: 4 Imported by: 1

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/dtrenin7/parse

Links

Open Source Insights

README ¶

HTML

This package is an HTML5 lexer written in Go. It follows the specification at The HTML syntax. The lexer takes an io.Reader and converts it into tokens until the EOF.

Installation

Run the following command

go get -u github.com/dtrenin7/parse/v2/html

or add the following import and run project with go get

import "github.com/dtrenin7/parse/v2/html"

Lexer

Usage

The following initializes a new Lexer with io.Reader r:

l := html.NewLexer(r)

To tokenize until EOF an error, use:

for {
	tt, data := l.Next()
	switch tt {
	case html.ErrorToken:
		// error or EOF set in l.Err()
		return
	case html.StartTagToken:
		// ...
		for {
			ttAttr, dataAttr := l.Next()
			if ttAttr != html.AttributeToken {
				break
			}
			// ...
		}
	// ...
	}
}

All tokens:

ErrorToken TokenType = iota // extra token when errors occur
CommentToken
DoctypeToken
StartTagToken
StartTagCloseToken
StartTagVoidToken
EndTagToken
AttributeToken
TextToken

Examples

package main

import (
	"os"

	"github.com/dtrenin7/parse/v2/html"
)

// Tokenize HTML from stdin.
func main() {
	l := html.NewLexer(os.Stdin)
	for {
		tt, data := l.Next()
		switch tt {
		case html.ErrorToken:
			if l.Err() != io.EOF {
				fmt.Println("Error on line", l.Line(), ":", l.Err())
			}
			return
		case html.StartTagToken:
			fmt.Println("Tag", string(data))
			for {
				ttAttr, dataAttr := l.Next()
				if ttAttr != html.AttributeToken {
					break
				}

				key := dataAttr
				val := l.AttrVal()
				fmt.Println("Attribute", string(key), "=", string(val))
			}
		// ...
		}
	}
}

License

Released under the MIT license.

Documentation ¶

Overview ¶

Package html is an HTML5 lexer following the specifications at http://www.w3.org/TR/html5/syntax.html.

Index ¶

func EscapeAttrVal(buf *[]byte, orig, b []byte, isXML bool) []byte
type Hash
- func ToHash(s []byte) Hash
- func (i Hash) String() string
type Lexer
- func NewLexer(r io.Reader) *Lexer
type TokenType
- func (tt TokenType) String() string

Examples ¶

NewLexer

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func EscapeAttrVal ¶

func EscapeAttrVal(buf *[]byte, orig, b []byte, isXML bool) []byte

EscapeAttrVal returns the escaped attribute value bytes without quotes.

Types ¶

type Hash ¶

type Hash uint32

Hash defines perfect hashes for a predefined list of strings

const (
	Iframe    Hash = 0x6    // iframe
	Math      Hash = 0x604  // math
	Plaintext Hash = 0x1e09 // plaintext
	Script    Hash = 0xa06  // script
	Style     Hash = 0x1405 // style
	Svg       Hash = 0x1903 // svg
	Textarea  Hash = 0x2308 // textarea
	Title     Hash = 0xf05  // title
	Xmp       Hash = 0x1c03 // xmp
)

Unique hash definitions to be used instead of strings

func ToHash ¶

func ToHash(s []byte) Hash

ToHash returns the hash whose name is s. It returns zero if there is no such hash. It is case sensitive.

func (Hash) String ¶

func (i Hash) String() string

String returns the hash' name.

type Lexer ¶

type Lexer struct {
	// contains filtered or unexported fields
}

Lexer is the state for the lexer.

func NewLexer ¶

func NewLexer(r io.Reader) *Lexer

NewLexer returns a new Lexer for a given io.Reader.

Example ¶

l := NewLexer(bytes.NewBufferString("<span class='user'>John Doe</span>"))
out := ""
for {
	tt, data := l.Next()
	if tt == ErrorToken {
		break
	}
	out += string(data)
}
fmt.Println(out)

Output:

<span class='user'>John Doe</span>

func (*Lexer) AttrVal ¶

func (l *Lexer) AttrVal() []byte

AttrVal returns the attribute value when an AttributeToken was returned from Next.

func (*Lexer) Err ¶

func (l *Lexer) Err() error

Err returns the error encountered during lexing, this is often io.EOF but also other errors can be returned.

func (*Lexer) Next ¶

func (l *Lexer) Next() (TokenType, []byte)

Next returns the next Token. It returns ErrorToken when an error was encountered. Using Err() one can retrieve the error message.

func (*Lexer) Offset ¶

func (l *Lexer) Offset() int

Offset returns the current position in the input stream.

func (*Lexer) Restore ¶

func (l *Lexer) Restore()

Restore restores the NULL byte at the end of the buffer.

func (*Lexer) Text ¶

func (l *Lexer) Text() []byte

Text returns the textual representation of a token. This excludes delimiters and additional leading/trailing characters.

type TokenType ¶

type TokenType uint32

TokenType determines the type of token, eg. a number or a semicolon.

const (
	ErrorToken TokenType = iota // extra token when errors occur
	CommentToken
	DoctypeToken
	StartTagToken
	StartTagCloseToken
	StartTagVoidToken
	EndTagToken
	AttributeToken
	TextToken
	SvgToken
	MathToken
)

TokenType values.

func (TokenType) String ¶

func (tt TokenType) String() string

String returns the string representation of a TokenType.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL