sanitize

package module
v0.0.0-...-2cee576 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 16, 2020 License: MIT Imports: 7 Imported by: 0

README

Sanitize

A dead simple Go HTML whitelist-sanitization library.

wercker status

Goal

Efficiently support the following types of HTML sanitization through simple programmatic or JSON configuration:

  • Removal of all non-whitelisted elements
  • Unwrapping of all non-whitelisted elements

Examples

Given a whitelist configuration

{
    "elements": {
        "div": ["id", "class"],
        "b": [],
        "i": []
    }
}

and basic input

<div class="my-class" style="position:relative;">
    <i>Something emphasized</i>
    <p>
        here is a
        <i>paragraph</i>
    </p>
    <b>Something bold</b> 
</div>

Removal

Removal of non-whitelisted elements in the provided example would yield

<div class="my-class">
    <i>Something emphasized</i>
    <b>Something bold</b> 
</div>

Note how the style attribute was removed from the div element and the p element was removed entirely

Unwrapping

Unwrapping of non-whitelisted elements in the provided example would yield

<div class="my-class">
    <i>Something emphasized</i>
    here is a
    <i>paragraph</i>
    <b>Something bold</b> 
</div>

Note how the style attribute was still removed from the div element, while the p element was 'unwrapped' (ie. it's children were attached to it's parent)

Usage

Create JSON configuration. Below are the currently supported options

key value type default description
stripComments boolean false Whether or not to strip comment nodes
stripWhitespace boolean false Whether or not to strip whitespace (leading and trailing tabs or spaces)
elements Object {} a list of K-V pairs where the keys are whitelisted element tags and the values are arrays of whitelisted attribtues for that element
{
    "stripComments": true,
    "stripWhitespace": true,
    "elements": {
        "html": ["xmlns"],
        "head": [],
        "body": [],
        "div": ["id", "class"],
    }
}

Create a sanitize.Whitelist object from a json file with sanitize.WhitelistFromFile(filepath string) or from a []byte with sanitize.NewWhitelist(byteArray []byte) and use it to parse some HTML:

whitelist, err := sanitize.WhitelistFromFile("./path/to/file.json")
// or create from a json []byte
// whitelist, err := sanitize.NewWhitelist(byteArray)

f, _ := os.Open("./path/to/example.html")
sanitized, _ := whitelist.SanitizeRemove(f) // takes any io.Reader

fmt.Printf("sanitized html: %d", sanitized)

Supported operations

whitelist, err := sanitize.WhitelistFromFile("./path/to/file.json")
f, _ := os.Open("./path/to/example.html")

// sanitize a full HTML document by removing
// non-whitelisted elements and attributes
sanitized, _ := whitelist.SanitizeRemove(f)

// sanitize a full HTML document by reattaching
// the children of non-whitelisted elements to the
// non-whitelisted parent; also removes non whitelisted
// attributes for any element
sanitized, _ := whitelist.SanitizeUnwrap(f)

// sanitize an HTML document fragment (ie no html,
// head, or body tags) by removing
// non-whitelisted elements and attributes
sanitized, _ := whitelist.SanitizeRemoveFragment(f)

// sanitize an HTML document fragment (ie no html,
// head, or body tags) by reattaching
// the children of non-whitelisted elements to the
// non-whitelisted parent; also removes non whitelisted
// attributes for any element
sanitized, _ := whitelist.SanitizeUnwrapFragment(f)

Steps to 1.0

  • Support sanitization that unwraps non-whitelisted nodes, allowing the text and/or whitelisted subtree through
  • Whitelist-level configuration options (eg. stripWhitespace)
  • Efficient attribute checking by not allocating a new slice on every whitelisted attribute for an element
  • Support sanitization of HTML fragments (instead of just full documents)
  • Support non string type attribute values
  • Refactor configuration parsing to have []byte interface instead of expecting a filepath
  • Create sane defaults
  • Usable godoc documentation

Known Issues

Contributing

Head over to the issues page or open a pull request. Please ensure your code is documented, all existing tests pass, and any new features have tests before submitting a pull request. If you want to check in whether a pull request for a new feature would be accepted, feel free to open an issue.

License

MIT

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Whitelist

type Whitelist struct {
	StripWhitespace bool                `json:"stripWhitespace"`
	StripComments   bool                `json:"stripComments"`
	Elements        map[string][]string `json:"elements"`
}

func NewWhitelist

func NewWhitelist(jsonData []byte) (*Whitelist, error)

Create a new whitelist from JSON configuration

func WhitelistFromFile

func WhitelistFromFile(filepath string) (*Whitelist, error)

Load a new whitelist from a JSON file

func (*Whitelist) AddElement

func (w *Whitelist) AddElement(elementTag string, attributes []string)

func (*Whitelist) GetAttributesForElement

func (w *Whitelist) GetAttributesForElement(elementTag string) []string

func (*Whitelist) HasAttributeForElement

func (w *Whitelist) HasAttributeForElement(elementTag string, attributeName string) bool

func (*Whitelist) HasElement

func (w *Whitelist) HasElement(elementTag string) bool

func (*Whitelist) SanitizeRemove

func (w *Whitelist) SanitizeRemove(reader io.Reader) (string, error)

remove non whitelisted elements entirely from a full HTML document

func (*Whitelist) SanitizeRemoveFragment

func (w *Whitelist) SanitizeRemoveFragment(reader io.Reader) (string, error)

remove non whitelisted elements in provided document fragment

given the go.net/html library creates a document root with a head and body by default around the provided fragment, simply unwrap those portions along before performing the sanitizeRemove function on the remaining children

func (*Whitelist) SanitizeUnwrap

func (w *Whitelist) SanitizeUnwrap(reader io.Reader) (string, error)

unwrap non whitelisted elements from a full HTML document

func (*Whitelist) SanitizeUnwrapFragment

func (w *Whitelist) SanitizeUnwrapFragment(reader io.Reader) (string, error)

unwrap non whitelisted elements in provided document fragment

given the go.net/html library creates a document root with a head and body by default around the provided fragment, simply unwrap those portions along before performing the sanitizeUnwrap function on the remaining children

func (*Whitelist) ToJSON

func (w *Whitelist) ToJSON() (string, error)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL