htmlutil

package
v0.4.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 2, 2023 License: BSD-2-Clause Imports: 11 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GetRawTextFromHTML

func GetRawTextFromHTML(r io.Reader) (io.Reader, error)

GetRawTextFromHTML extracts text from an HTML document without retaining any particular formatting information.

Limitation: GetRawTextFromHTML is only a minimal and naive HTML to text extractor, it does not consider any fancy HTML formatting directive nor complicated rules related to spaces collapsing and only concentrate into getting rid of HTML directive to access meaningful information for scraping or searching.

Types

type Scanner

type Scanner struct {
	// AllowedTags is the white-list of allowed tags. For each tag, allowed
	// attributes can be expressed as a pattern:
	//  - **         : all attributes (except data-xxx and onxxx events that
	//                 should be explicitly allowed) are allowed.
	//  - *          : all attributes (except data-xxx and onxxx events that
	//                 should be explicitly allowed) are allowed if their value
	//                 is a space-separated list  of names (letters, numbers, _
	//                 or -).
	//  - a=**       : attribute 'a' whatever its value is.
	//  - a or a=*   : attribute 'a' whose value is a space-separated list of
	//                 names (letters, numbers, _ or -).
	//  - a=_MIME    : attribute 'a' whose value is a mimetype specification
	//                 (names separated by '/' like 'text/css')
	//  - a=key      : attribute 'a' whose value is 'key'.
	//  - a=__URL    : attribute 'a' whose value is an 'allowed' URL.
	//                 An allowed URL is parsable with a scheme matching
	//                 SafeSchemes.
	//                 'Anonymous' hosts (no recorded domain name) are not
	//                 accepted.
	//                 Absolute URL for style-sheets are not accepted.
	//                 Absolute URL with target=_blank but without rel="noopener"
	//                 URL's query are not accepted except if _? suffix
	//                 is added.
	//  - a=__REL_URL: like __URL but only relative URL are accepted.
	//
	// Several patterns can be listed for a given attribute's name, knowing
	// that patterns are checked against in their declaration order (first
	// matching will pass/first non-matching check will fail).
	// LIMITATION: Be extra-careful when using catch-all patterns. For instance
	// {http-equiv=refresh, '*'} as a result will allow any http-equiv to be
	// accepted, so catch-all patterns are actually quite tedious to use.
	// TODO: As off now, it is a "good enough" approach but probably needs further
	// polishing/rework to make something acceptable out of this.
	AllowedTags map[atom.Atom][]string

	// AllowedURLSchemes is the white-list of allowed schemes in URL.
	// "*" allows any schemes.
	AllowedURLSchemes []string

	// AllowAbsoluteURLinCSS, when set to true, accepts using external URL.
	// (by default only relative URL or local URL are considered). Queries are
	// not accepted.
	AllowAbsoluteURLinCSS bool

	// AllowedCSSProperties is the white-list of accepted CSS properties.
	// "*" allows any property, "!xxx" failed immediately for property xxx even
	// if property xxx is allowed afterwards.
	AllowedCSSProperties []string

	// AllowedCSSFunctions is the white-list of accepted CSS functions.
	// "*" allows any functions, "!xxx" failed immediately for function xxx even
	// if function xxx is allowed afterwards.
	AllowedCSSFunctions []string

	// AllowedCSSAtKeywords is the white-list of accepted at-keywords.
	// "*" allows any keywords, "!xxx" failed immediately for keyword xxx even
	// if keyword xxx is allowed afterwards.
	AllowedCSSAtKeywords []string
}

Scanner represents an HTML/CSS scanner that looks for possible security risks. This scanner is only oriented to check existing untrusted HTML/CSS in EPUB, it does not properly managed all injection cases, notably obfuscated strings relying for example on strange characters encodings.

func NewMinimalScanner

func NewMinimalScanner(globalAttr ...string) *Scanner

NewMinimalScanner creates a new scanner that allows only minimal HTML features, no CSS nor JS. globalAttr are added to the list of allowed attributes of all atom.

func NewPermissiveScanner

func NewPermissiveScanner() *Scanner

NewPermissiveScanner creates a new scanner that allows any attributes or CSS properties but fetching external resources/URL.

func NewScannerWithStyle

func NewScannerWithStyle(globalAttr ...string) *Scanner

NewScannerWithStyle creates a new scanner that extends NewStrictScanner to allow use of common CSS properties (but not CSS functions or external CSS resources). If globalAttr are provided, these attributes will be allowed for each atom in addition to "style" and "class" attributes that are accepted by default.

func (*Scanner) Scan

func (s *Scanner) Scan(r io.Reader) ([]string, error)

Scan checks that io.Reader contains only allowed tags or attributes. Scan returns a list of messages describing encountered issues.

func (*Scanner) ScanCSS

func (s *Scanner) ScanCSS(r io.Reader) ([]string, error)

ScanCSS checks that io.Reader contains only allowed CSS style declarations. ScanCSS returns a list of messages describing encountered issues.

Directories

Path Synopsis
Package css - parse CSS
Package css - parse CSS

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL