html

package
v0.5.13 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 1, 2024 License: MIT Imports: 10 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewHTMLCommand

func NewHTMLCommand() (*cobra.Command, error)

Types

type HTMLSplitParser

type HTMLSplitParser struct {
	// contains filtered or unexported fields
}

HTMLSplitParser is a GlazeProcessor that splits an HTML document into sections. When encountering one of the tags in splitTags, it extracts the content below the tag as Title (if extractTitle is true) and the following siblings until the next split tag is encountered as body.

func NewHTMLHeadingSplitParser

func NewHTMLHeadingSplitParser(gp middlewares.Processor, removeTags []string) *HTMLSplitParser

NewHTMLHeadingSplitParser creates a new HTMLSplitParser that splits the document into sections and keeps the titles, by splitting at h1, h2, h3...

func NewHTMLSplitParser

func NewHTMLSplitParser(gp middlewares.Processor, removeTags, splitTags []string, extractTitle bool) *HTMLSplitParser

func (*HTMLSplitParser) ProcessNode

func (hsp *HTMLSplitParser) ProcessNode(ctx context.Context, n *html.Node) (*html.Node, error)

ProcessNode extracts the content below a header tag and sends it to the GlazeProcessor. It extracts the header tag content as Title, and the following siblings until the next header tag is encountered as body.

It returns the next node to be parsed (because we need to split a certain amount of sibling nodes).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL