htmlsanitizer

package
v0.0.0-...-642df0c Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 14, 2024 License: BSD-3-Clause, MIT Imports: 5 Imported by: 0

README

Golang HTML Sanitizer

Go codecov

htmlsanitizer is a super fast, allowlist-based HTML sanitizer written in Golang. A built-in, secure-by-default allowlist helps you filter out any dangerous HTML content.

Why use htmlsanitizer?

  • Fast, a Finite State Machine was implemented internally, making the time complexity always O(n).
  • Highly customizable, allows you to modify the allowlist, or simply disable all HTML tags.
  • Dependency free.

Install

go get -u github.com/sym01/htmlsanitizer

Getting Started

Use the secure-by-default allowlist

Simply use the secure-by-default allowlist to sanitize untrusted HTML.

sanitizedHTML, err := htmlsanitizer.SanitizeString(rawHTML)
Disable the id attribute globally

By default, htmlsanitizer allows the id attribute globally. If we do NOT allow the id attribute, we can simply override the GlobalAttr.

s := htmlsanitizer.NewHTMLSanitizer()
s.GlobalAttr = []string{"class"}

sanitizedHTML, err := s.SanitizeString(rawHTML)
Disable or add HTML tag
s := htmlsanitizer.NewHTMLSanitizer()
// remove <a> tag
s.RemoveTag("a")

// add a custom tag named my-tag, which allows my-attr attribute
customTag := &htmlsanitizer.Tag{
    Name: "my-tag",
    Attr: []string{"my-attr"},
}
s.AllowList.Tags = append(s.AllowList.Tags, customTag)

sanitizedHTML, err := s.SanitizeString(rawHTML)
Disable all HTML tags

You can also use htmlsanitizer to remove all HTML tags.

s := htmlsanitizer.NewHTMLSanitizer()
// just set AllowList to nil to disable all tags
s.AllowList = nil

sanitizedHTML, err := s.SanitizeString(rawHTML)

Documentation

Index

Constants

This section is empty.

Variables

View Source
var DefaultAllowList = &AllowList{
	Tags: []*Tag{
		{"address", []string{}, []string{}},
		{"article", []string{}, []string{}},
		{"aside", []string{}, []string{}},
		{"footer", []string{}, []string{}},
		{"header", []string{}, []string{}},
		{"h1", []string{}, []string{}},
		{"h2", []string{}, []string{}},
		{"h3", []string{}, []string{}},
		{"h4", []string{}, []string{}},
		{"h5", []string{}, []string{}},
		{"h6", []string{}, []string{}},
		{"hgroup", []string{}, []string{}},
		{"main", []string{}, []string{}},
		{"nav", []string{}, []string{}},
		{"section", []string{}, []string{}},
		{"blockquote", []string{}, []string{"cite"}},
		{"dd", []string{}, []string{}},
		{"div", []string{}, []string{}},
		{"dl", []string{}, []string{}},
		{"dt", []string{}, []string{}},
		{"figcaption", []string{}, []string{}},
		{"figure", []string{}, []string{}},
		{"hr", []string{}, []string{}},
		{"li", []string{}, []string{}},
		{"main", []string{}, []string{}},
		{"ol", []string{}, []string{}},
		{"p", []string{}, []string{}},
		{"pre", []string{}, []string{}},
		{"ul", []string{}, []string{}},
		{"a", []string{"rel", "target", "referrerpolicy"}, []string{"href"}},
		{"abbr", []string{"title"}, []string{}},
		{"b", []string{}, []string{}},
		{"bdi", []string{}, []string{}},
		{"bdo", []string{}, []string{}},
		{"br", []string{}, []string{}},
		{"cite", []string{}, []string{}},
		{"code", []string{}, []string{}},
		{"data", []string{"value"}, []string{}},
		{"em", []string{}, []string{}},
		{"i", []string{}, []string{}},
		{"kbd", []string{}, []string{}},
		{"mark", []string{}, []string{}},
		{"q", []string{}, []string{"cite"}},
		{"s", []string{}, []string{}},
		{"small", []string{}, []string{}},
		{"span", []string{}, []string{}},
		{"strong", []string{}, []string{}},
		{"sub", []string{}, []string{}},
		{"sup", []string{}, []string{}},
		{"time", []string{"datetime"}, []string{}},
		{"u", []string{}, []string{}},
		{"area", []string{"alt", "coords", "shape", "target", "rel", "referrerpolicy"}, []string{"href"}},
		{"audio", []string{"autoplay", "controls", "crossorigin", "duration", "loop", "muted", "preload"}, []string{"src"}},
		{"img", []string{"alt", "crossorigin", "height", "width", "loading", "referrerpolicy"}, []string{"src"}},
		{"map", []string{"name"}, []string{}},
		{"track", []string{"default", "kind", "label", "srclang"}, []string{"src"}},
		{"video", []string{"autoplay", "buffered", "controls", "crossorigin", "duration", "loop", "muted", "preload", "height", "width"}, []string{"src", "poster"}},

		{"picture", []string{}, []string{}},
		{"source", []string{"type"}, []string{"src"}},

		{"del", []string{}, []string{}},
		{"ins", []string{}, []string{}},
		{"caption", []string{}, []string{}},
		{"col", []string{"span"}, []string{}},
		{"colgroup", []string{}, []string{}},
		{"table", []string{}, []string{}},
		{"tbody", []string{}, []string{}},
		{"td", []string{"colspan", "rowspan"}, []string{}},
		{"tfoot", []string{}, []string{}},
		{"th", []string{"colspan", "rowspan", "scope"}, []string{}},
		{"thead", []string{}, []string{}},
		{"tr", []string{}, []string{}},

		{"details", []string{"open"}, []string{}},
		{"summary", []string{}, []string{}},
	},
	GlobalAttr: []string{
		"class",
		"id",
	},
}

DefaultAllowList for HTML filter.

The allowlist contains most tags listed in https://developer.mozilla.org/en-US/docs/Web/HTML/Element . It is not recommended to modify the default list directly, use .Clone() and then modify the new one instead.

Functions

func DefaultURLSanitizer

func DefaultURLSanitizer(rawURL string) (sanitzed string, ok bool)

DefaultURLSanitizer is a default and strict sanitizer. It only accepts

  • URL with scheme http or https
  • relative URL, such as abc, abc?xxx=1, abc#123
  • absolute URL, such as /abc, /abc?xxx=1, /abc#123

func NewWriter

func NewWriter(w io.Writer) io.Writer

NewWriter returns a new Writer, with DefaultAllowList, writing sanitized HTML content to w.

func Sanitize

func Sanitize(data []byte) ([]byte, error)

Sanitize uses the DefaultAllowList to sanitize the HTML data.

func SanitizeString

func SanitizeString(data string) (string, error)

SanitizeString uses the DefaultAllowList to sanitize the HTML string.

Types

type AllowList

type AllowList struct {
	// Tags specifies all the allow tags.
	Tags []*Tag

	// GlobalAttr specifies the allowed attributes for all the tag.
	// It's very useful for some common attributes, such as `class`, `id`.
	// For security reasons, it's not recommended to set a glboal attr for
	// any URL-related attribute.
	GlobalAttr []string
}

AllowList speficies all the allowed HTML tags and its attributes for the filter.

func (*AllowList) Clone

func (l *AllowList) Clone() *AllowList

Clone a new AllowList.

func (*AllowList) FindTag

func (l *AllowList) FindTag(p []byte) *Tag

FindTag finds and returns tag by its name, case insensitive.

func (*AllowList) RemoveTag

func (l *AllowList) RemoveTag(name string)

RemoveTag removes all tags name `name`, must be lowercase It is not recommended to modify the default list directly, use .Clone() and then modify the new one instead.

type HTMLSanitizer

type HTMLSanitizer struct {
	*AllowList

	// URLSanitizer is a func used to sanitize all the URLAttr.
	// URLSanitizer returns a sanitzed URL and a bool var indicating
	// whether the current attribute is acceptable. If not acceptable,
	// the current attribute will be ignored.
	// If the func is nil, then DefaultURLSanitizer will be used.
	URLSanitizer func(rawURL string) (sanitzed string, ok bool)
}

HTMLSanitizer is a super fast HTML sanitizer for arbitrary HTML content. This is a allowlist-based santizer, of which the time complexity is O(n).

func NewHTMLSanitizer

func NewHTMLSanitizer() *HTMLSanitizer

NewHTMLSanitizer creates a new HTMLSanitizer with the clone of the DefaultAllowList.

func (*HTMLSanitizer) NewWriter

func (f *HTMLSanitizer) NewWriter(w io.Writer) io.Writer

NewWriter returns a new Writer writing sanitized HTML content to w.

func (*HTMLSanitizer) Sanitize

func (f *HTMLSanitizer) Sanitize(data []byte) ([]byte, error)

Sanitize the HTML data and return the sanitzed HTML.

func (*HTMLSanitizer) SanitizeString

func (f *HTMLSanitizer) SanitizeString(data string) (string, error)

SanitizeString sanitizes the HTML string and return the sanitzed HTML.

type Tag

type Tag struct {
	// Name for current tag, must be lowercase.
	Name string

	// Attr specifies the allowed attributes for current tag,
	// must be lowercase.
	//
	// e.g. colspan, rowspan
	Attr []string

	// URLAttr specifies the allowed, URL-relatedd attributes for current tag,
	// must be lowercase.
	//
	// e.g. src, href
	URLAttr []string
}

Tag with its attributes.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL