html

package
v2.4.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 3, 2020 License: MIT Imports: 4 Imported by: 1

README

HTML GoDoc

This package is an HTML5 lexer written in Go. It follows the specification at The HTML syntax. The lexer takes an io.Reader and converts it into tokens until the EOF.

Installation

Run the following command

go get -u github.com/dtrenin7/parse/v2/html

or add the following import and run project with go get

import "github.com/dtrenin7/parse/v2/html"

Lexer

Usage

The following initializes a new Lexer with io.Reader r:

l := html.NewLexer(r)

To tokenize until EOF an error, use:

for {
	tt, data := l.Next()
	switch tt {
	case html.ErrorToken:
		// error or EOF set in l.Err()
		return
	case html.StartTagToken:
		// ...
		for {
			ttAttr, dataAttr := l.Next()
			if ttAttr != html.AttributeToken {
				break
			}
			// ...
		}
	// ...
	}
}

All tokens:

ErrorToken TokenType = iota // extra token when errors occur
CommentToken
DoctypeToken
StartTagToken
StartTagCloseToken
StartTagVoidToken
EndTagToken
AttributeToken
TextToken
Examples
package main

import (
	"os"

	"github.com/dtrenin7/parse/v2/html"
)

// Tokenize HTML from stdin.
func main() {
	l := html.NewLexer(os.Stdin)
	for {
		tt, data := l.Next()
		switch tt {
		case html.ErrorToken:
			if l.Err() != io.EOF {
				fmt.Println("Error on line", l.Line(), ":", l.Err())
			}
			return
		case html.StartTagToken:
			fmt.Println("Tag", string(data))
			for {
				ttAttr, dataAttr := l.Next()
				if ttAttr != html.AttributeToken {
					break
				}

				key := dataAttr
				val := l.AttrVal()
				fmt.Println("Attribute", string(key), "=", string(val))
			}
		// ...
		}
	}
}

License

Released under the MIT license.

Documentation

Overview

Package html is an HTML5 lexer following the specifications at http://www.w3.org/TR/html5/syntax.html.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func EscapeAttrVal

func EscapeAttrVal(buf *[]byte, orig, b []byte, isXML bool) []byte

EscapeAttrVal returns the escaped attribute value bytes without quotes.

Types

type Hash

type Hash uint32

Hash defines perfect hashes for a predefined list of strings

const (
	Iframe    Hash = 0x6    // iframe
	Math      Hash = 0x604  // math
	Plaintext Hash = 0x1e09 // plaintext
	Script    Hash = 0xa06  // script
	Style     Hash = 0x1405 // style
	Svg       Hash = 0x1903 // svg
	Textarea  Hash = 0x2308 // textarea
	Title     Hash = 0xf05  // title
	Xmp       Hash = 0x1c03 // xmp
)

Unique hash definitions to be used instead of strings

func ToHash

func ToHash(s []byte) Hash

ToHash returns the hash whose name is s. It returns zero if there is no such hash. It is case sensitive.

func (Hash) String

func (i Hash) String() string

String returns the hash' name.

type Lexer

type Lexer struct {
	// contains filtered or unexported fields
}

Lexer is the state for the lexer.

func NewLexer

func NewLexer(r io.Reader) *Lexer

NewLexer returns a new Lexer for a given io.Reader.

Example
l := NewLexer(bytes.NewBufferString("<span class='user'>John Doe</span>"))
out := ""
for {
	tt, data := l.Next()
	if tt == ErrorToken {
		break
	}
	out += string(data)
}
fmt.Println(out)
Output:

<span class='user'>John Doe</span>

func (*Lexer) AttrVal

func (l *Lexer) AttrVal() []byte

AttrVal returns the attribute value when an AttributeToken was returned from Next.

func (*Lexer) Err

func (l *Lexer) Err() error

Err returns the error encountered during lexing, this is often io.EOF but also other errors can be returned.

func (*Lexer) Next

func (l *Lexer) Next() (TokenType, []byte)

Next returns the next Token. It returns ErrorToken when an error was encountered. Using Err() one can retrieve the error message.

func (*Lexer) Offset

func (l *Lexer) Offset() int

Offset returns the current position in the input stream.

func (*Lexer) Restore

func (l *Lexer) Restore()

Restore restores the NULL byte at the end of the buffer.

func (*Lexer) Text

func (l *Lexer) Text() []byte

Text returns the textual representation of a token. This excludes delimiters and additional leading/trailing characters.

type TokenType

type TokenType uint32

TokenType determines the type of token, eg. a number or a semicolon.

const (
	ErrorToken TokenType = iota // extra token when errors occur
	CommentToken
	DoctypeToken
	StartTagToken
	StartTagCloseToken
	StartTagVoidToken
	EndTagToken
	AttributeToken
	TextToken
	SvgToken
	MathToken
)

TokenType values.

func (TokenType) String

func (tt TokenType) String() string

String returns the string representation of a TokenType.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL