tokenizer

package
v2.10.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 17, 2023 License: MIT Imports: 14 Imported by: 0

Documentation

Overview

Package tokenizer tokenizes CSS based on part four of the CSS Syntax Module Level 3 (W3C Candidate Recommendation Draft), 24 December 2021.

The main elements of this package are the New function, which returns a new Tokenizer, and the Tokenizer.Next method.

This package also exposes several low-level "Consume" functions, which implement specific algorithms in the CSS specification. Note that all "Consume" functions may panic on I/O error. The Tokenizer.Next method catches these panics. Also note that all "Consume" functions operate on a steam of filtered code points (see https://www.w3.org/TR/css-syntax-3/#input-preprocessing), not raw input. This is implemented by css/tokenizer/filter.Transform and automatically handled by a New Tokenizer.

Disclaimer: although this software runs against a thorough and diverse set of test cases, no claims are made of this software's performance or conformance against the W3C Specification itself (because there is no official W3C test suite for the tokenization step alone).

This software includes material derived from CSS Syntax Module Level 3, W3C Candidate Recommendation Draft, 24 December 2021. Copyright © 2021 W3C® (MIT, ERCIM, Keio, Beihang). See LICENSE-PARTS.txt and TRADEMARKS.md.

Index

Examples

Constants

This section is empty.

Variables

View Source
var (
	ErrUnexpectedEOF       = fmt.Errorf("unexpected end of file")
	ErrUnexpectedLinebreak = fmt.Errorf("unexpected line break")
	ErrUnexpectedInput     = fmt.Errorf("unexpected input")
	ErrBadUrl              = fmt.Errorf("invalid URL syntax")
)

Functions

func ConsumeBadUrl

func ConsumeBadUrl(rdr *runeio.Reader)

ConsumeBadUrl consumes the remnants of a bad url from a stream of code points, "cleaning up" after the tokenizer realizes that it’s in the middle of a <bad-url-token> rather than a <url-token>. It returns nothing; its sole use is to consume enough of the input stream to reach a recovery point where normal tokenizing can resume.

func ConsumeComments

func ConsumeComments(rdr *runeio.Reader) error

ConsumeComments consumes zero or more CSS comments.

func ConsumeEscapedCodepoint

func ConsumeEscapedCodepoint(rdr *runeio.Reader) rune

ConsumeEscapedCodepoint consumes an escaped code point. It assumes that the U+005C REVERSE SOLIDUS (\) has already been consumed and that the next input code point has already been verified to be part of a valid escape.

func ConsumeIdentLikeToken

func ConsumeIdentLikeToken(rdr *runeio.Reader) (token.Token, error)

ConsumeIdentLikeToken consumes an ident-like token from a stream of code points. It returns an <ident-token>, <function-token>, <url-token>, or <bad-url-token>.

func ConsumeIdentSequence

func ConsumeIdentSequence(rdr *runeio.Reader) string

ConsumeIdentSequence consumes an ident sequence from a stream of code points. It returns a string containing the largest name that can be formed from adjacent code points in the stream, starting from the first.

Note: This algorithm does not do the verification of the first few code points that are necessary to ensure the returned code points would constitute an <ident-token>. If that is the intended use, ensure that the stream starts with an ident sequence before calling this algorithm.

func ConsumeNumber

func ConsumeNumber(rdr *runeio.Reader) (nt token.NumberType, repr string, value float64)

ConsumeNumber consumes a number from a stream of code points. It returns a representation, a numeric value, and a type which is either "integer" or "number".

The representation is the token lexeme as it appears in the input stream. This preserves details such as whether .009 was written as .009 or 9e-3.

Note: This algorithm does not do the verification of the first few code points that are necessary to ensure a number can be obtained from the stream. Ensure that the stream starts with a number before calling this algorithm.

func ConsumeNumericToken

func ConsumeNumericToken(rdr *runeio.Reader) token.Token

ConsumeNumericToken consumes a numeric token from a stream of code points. It returns either a <number-token>, <percentage-token>, or <dimension-token>.

func ConsumeString

func ConsumeString(rdr *runeio.Reader, endpoint rune) (t token.Token, err error)

ConsumeString consumes a string token. It is assumed that the character that opens a string (if any) has already been consumed. Returns either a <string-token> or a <bad-string-token>. Endpoint specifies the codepoint that terminates the string (e.g. a double or single quotation mark).

func ConsumeUrlToken

func ConsumeUrlToken(rdr *runeio.Reader) (token.Token, error)

ConsumeUrlToken describes how to consume a url token from a stream of code points. It returns either a <url-token> or a <bad-url-token>.

Note: This algorithm assumes that the initial "url(" has already been consumed. This algorithm also assumes that it’s being called to consume an "unquoted" value, like url(foo). A quoted value, like url("foo"), is parsed as a <function-token>. ConsumeIdentLikeToken automatically handles this distinction; this algorithm shouldn’t be called directly otherwise.

func ConsumeWhitespace

func ConsumeWhitespace(rdr *runeio.Reader) token.Token

ConsumeWhitespace consumes as much whitespace as possible and returns a <whitespace-token>.

func StringToNumber

func StringToNumber(x string) float64

StringToNumber describes how to convert a string to a number according to the CSS specification.

Note: This algorithm does not do any verification to ensure that the string contains only a number. Ensure that the string contains only a valid CSS number before calling this algorithm.

Types

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}
Example
package main

import (
	"fmt"

	"strings"

	"github.com/tawesoft/golib/v2/css/tokenizer"
	"github.com/tawesoft/golib/v2/css/tokenizer/token"
)

func main() {
	str := `/* example */
#something[rel~="external"] {
    background-color: rgb(128, 64, 64);
}`
	t := tokenizer.New(strings.NewReader(str))

	for {
		tok := t.NextExcept(token.TypeWhitespace)
		if tok.Is(token.TypeEOF) {
			break
		}
		fmt.Println(tok)
	}

	if len(t.Errors()) > 0 {
		fmt.Printf("%v\n", t.Errors())
	}

}
Output:

<hash-token>{type: "id", value: "something"}
<[-token>
<ident-token>{value: "rel"}
<delim-token>{delim: '~'}
<delim-token>{delim: '='}
<string-token>{value: "external"}
<]-token>
<{-token>
<ident-token>{value: "background-color"}
<colon-token>
<function-token>{value: "rgb"}
<number-token>{type: "integer", value: 128.000000, repr: "128"}
<comma-token>
<number-token>{type: "integer", value: 64.000000, repr: "64"}
<comma-token>
<number-token>{type: "integer", value: 64.000000, repr: "64"}
<)-token>
<semicolon-token>
<}-token>

func New

func New(r io.Reader) *Tokenizer

func (*Tokenizer) Errors

func (z *Tokenizer) Errors() []error

Errors reports parse errors.

func (*Tokenizer) Next

func (z *Tokenizer) Next() (result token.Token)

Next returns the next token from the input stream. Once the stream has ended, it returns token.EOF().

Check z.Errors() once the stream has ended, or at any point if you want to fail-fast without recovering, to detect parse errors.

func (*Tokenizer) NextExcept

func (z *Tokenizer) NextExcept(types ...token.Type) (result token.Token)

NextExcept is like Tokenizer.Next however any tokens matching the given types are suppressed. For example, it is common to ignore whitespace. token.EOF() is never ignored.

func (*Tokenizer) Push added in v2.7.0

func (z *Tokenizer) Push(x token.Token)

Push places a token back on a pushback buffer (first in, first out) so that it is returned by Next() before advancing the input stream. This has a limited capacity, and will panic if exceeded.

Directories

Path Synopsis
Package filter implements a [transform.Transformer] that performs the Unicode code point filtering preprocessing step defined in [CSS Syntax Module Level 3, section 3.3]:
Package filter implements a [transform.Transformer] that performs the Unicode code point filtering preprocessing step defined in [CSS Syntax Module Level 3, section 3.3]:
Package token defines CSS tokens produced by a tokenizer.
Package token defines CSS tokens produced by a tokenizer.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL