converter

package
v0.2.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 28, 2022 License: MIT Imports: 6 Imported by: 0

Documentation

Overview

Converts HTML content into markdown content

Index

Constants

View Source
const DefaultSearchPattern = "p,span,hr,h1,h2,h3,h4,h5,h6,ul,ol,div,table"

DefaultSearchPattern defines a default pattern to search for elements that will contain content for the markdown document

Variables

This section is empty.

Functions

func CleanText

func CleanText(content string) string

CleanText removes newlines, replaces common unicode characters with ascii, removes any other non-common ascii values, and trims any whitespace.

func PrintUnicodeRunes

func PrintUnicodeRunes(content string)

PrintUnicodeRunes finds all non-ascii characters in the string and prints out the unicode character point. This is useful for debugging to find unicode characters that need to be handled by the CleanText function.

Types

type ConfluenceSelectionConverter

type ConfluenceSelectionConverter struct {
	Transformer            *Transformer
	RootElementFinder      FindDocumentSelection
	TitleFinder            FindText
	ContentSelector        FindSelection
	ContentSelectorHandler HandleSelection
}

ConfluenceSelectionConverter converts the Confluence HTML page to markdown. Tags controls which HTML tags will be searched when looking for content. If not set, then the defaultTags will be used.

func NewConfluenceSelectionConverter

func NewConfluenceSelectionConverter(conf SelectionConverterConfig) *ConfluenceSelectionConverter

NewConfluenceSelectionConverter intializes a ConfluenceSelectionConverter with default function calls.

func (*ConfluenceSelectionConverter) FindContentElements

func (c *ConfluenceSelectionConverter) FindContentElements(s *goquery.Selection) *goquery.Selection

FindContentElements finds the selections that that should be iterated over for content

func (*ConfluenceSelectionConverter) FindRootElement

FindRootElement finds the root element.

func (*ConfluenceSelectionConverter) FindTitle

FindTitle finds the title of the document.

func (*ConfluenceSelectionConverter) HandleMatchedSelection

func (c *ConfluenceSelectionConverter) HandleMatchedSelection(i int, elm *goquery.Selection, mdDoc *markdown.Doc, toMD SelectionToMD)

HandleMatchedSelection handles matched selections from FindContentElements.

type DocumentConverter

type DocumentConverter struct {
	SelectionConv SelectionConverter
}

DocumentConverter is a struct that can convert an HTML document into a markdown document

func (*DocumentConverter) DocumentToMarkdown

func (c *DocumentConverter) DocumentToMarkdown(doc *goquery.Document) *markdown.Doc

DocumentToMarkdown converts the HTML doc to markdown

func (*DocumentConverter) SelectionToMarkdown

func (c *DocumentConverter) SelectionToMarkdown(elm *goquery.Selection, docConf markdown.DocConfig) *markdown.Doc

SelectionToMarkdown creates a new markdown document, and searches for content to add to the markdown doc. It hands off handling of matched selections to the SelectionConverter since it depends heavily on the HTML structure of the original document.

type FindDocumentSelection

type FindDocumentSelection func(*goquery.Document) *goquery.Selection

FindDocumentSelection is a callable that finds DOM elements in the given the Document

type FindSelection

type FindSelection func(*goquery.Selection) *goquery.Selection

FindSelection is a callable that finds DOM elements in the given selection

type FindText

type FindText func(*goquery.Document) string

FindText is a callable that finds text in the given the HTMLDoc

type GoogleSelectionConverter

type GoogleSelectionConverter struct {
	Transformer            *Transformer
	RootElementFinder      FindDocumentSelection
	TitleFinder            FindText
	ContentSelector        FindSelection
	ContentSelectorHandler HandleSelection
}

GoogleSelectionConverter converts the Google Doc HTML page to markdown

func NewGoogleSelectionConverter

func NewGoogleSelectionConverter(conf SelectionConverterConfig) *GoogleSelectionConverter

NewGoogleSelectionConverter intializes a GoogleSelectionConverter with default function calls.

func (*GoogleSelectionConverter) FindContentElements

func (c *GoogleSelectionConverter) FindContentElements(s *goquery.Selection) *goquery.Selection

FindContentElements finds the selections that that should be iterated over for content

func (*GoogleSelectionConverter) FindRootElement

func (c *GoogleSelectionConverter) FindRootElement(doc *goquery.Document) *goquery.Selection

FindRootElement finds the root element.

func (*GoogleSelectionConverter) FindTitle

func (c *GoogleSelectionConverter) FindTitle(doc *goquery.Document) string

FindTitle finds the title of the document.

func (*GoogleSelectionConverter) HandleMatchedSelection

func (c *GoogleSelectionConverter) HandleMatchedSelection(i int, elm *goquery.Selection, mdDoc *markdown.Doc, toMD SelectionToMD)

HandleMatchedSelection handles matched selections from FindContentElements.

type HTMLSelectionConverter

type HTMLSelectionConverter struct {
	Transformer            *Transformer
	RootElementFinder      FindDocumentSelection
	TitleFinder            FindText
	ContentSelector        FindSelection
	ContentSelectorHandler HandleSelection
}

HTMLSelectionConverter converts generic HTML pages to markdown

func NewHTMLSelectionConverter

func NewHTMLSelectionConverter(conf SelectionConverterConfig) *HTMLSelectionConverter

NewHTMLSelectionConverter intializes a HTMLSelectionConverter with default function calls.

func (*HTMLSelectionConverter) FindContentElements

func (c *HTMLSelectionConverter) FindContentElements(s *goquery.Selection) *goquery.Selection

FindContentElements finds the selections that that should be iterated over for content

func (*HTMLSelectionConverter) FindRootElement

func (c *HTMLSelectionConverter) FindRootElement(doc *goquery.Document) *goquery.Selection

FindRootElement finds the root element.

func (*HTMLSelectionConverter) FindTitle

func (c *HTMLSelectionConverter) FindTitle(doc *goquery.Document) string

FindTitle finds the title of the document.

func (*HTMLSelectionConverter) HandleMatchedSelection

func (c *HTMLSelectionConverter) HandleMatchedSelection(i int, elm *goquery.Selection, mdDoc *markdown.Doc, toMD SelectionToMD)

HandleMatchedSelection handles matched selections from FindContentElements.

type HandleSelection

type HandleSelection func(int, *goquery.Selection, *markdown.Doc, SelectionToMD)

HandleSelection is a callable that is given a selection, a markdown document to add to, and a callable to convert child elements to markdown documents

type SelectionCallback

type SelectionCallback = func(i int, s *goquery.Selection)

SelectionCallback is a function that handles a goquery.Selection

type SelectionConverter

type SelectionConverter interface {
	FindRootElement(*goquery.Document) *goquery.Selection
	FindTitle(*goquery.Document) string
	FindContentElements(*goquery.Selection) *goquery.Selection
	HandleMatchedSelection(int, *goquery.Selection, *markdown.Doc, SelectionToMD)
}

SelectionConverter is an interface that converts a style of HTML document to markdown. The interface allows for customization to handle a specific and known HTML structure.

type SelectionConverterConfig

type SelectionConverterConfig struct {
	Transformer            *Transformer
	RootElementFinder      FindDocumentSelection
	TitleFinder            FindText
	ContentSelector        FindSelection
	ContentSelectorHandler HandleSelection
}

SelectionConverterConfig contains parameters that a SelectionConvert will can use to be more customizable

type SelectionToMD

type SelectionToMD func(*goquery.Selection, markdown.DocConfig) *markdown.Doc

SelectionToMD is a callable that converts a selection to a markdown document

type Transformer

type Transformer struct {
	Format string
}

Transformer converts HTML DOM elements into markdown elements

func (*Transformer) RemoveScripts

func (t *Transformer) RemoveScripts(elm *goquery.Selection)

RemoveScripts removes any script, style, or link tags from the DOM element.

func (*Transformer) ReplaceAll

func (t *Transformer) ReplaceAll(elm *goquery.Selection)

ReplaceAll runs all the default replacement functions

func (*Transformer) ReplaceAnchor

func (t *Transformer) ReplaceAnchor(i int, s *goquery.Selection)

ReplaceAnchor replaces the DOM element in place with a markdown link.

func (*Transformer) ReplaceAnchors

func (t *Transformer) ReplaceAnchors(elm *goquery.Selection)

ReplaceAnchors finds all child "a" tags and replaces them in place with markdown links.

func (*Transformer) ReplaceBold

func (t *Transformer) ReplaceBold(i int, s *goquery.Selection)

ReplaceBold replaces the DOM element in place with the text content wrapped in "**".

func (*Transformer) ReplaceBolds

func (t *Transformer) ReplaceBolds(elm *goquery.Selection)

ReplaceBolds finds all child "strong" tags and replaces them in place with markdown bold.

func (*Transformer) ReplaceImage

func (t *Transformer) ReplaceImage(i int, s *goquery.Selection)

ReplaceImage replaces the DOM element in place with a markdown image link. If the Transformer is rendering for Hugo, then will replace with a Hugo figure shortcode.

func (*Transformer) ReplaceImages

func (t *Transformer) ReplaceImages(elm *goquery.Selection)

ReplaceImages finds all child "img" tags and replaces them in place with markdown image links.

func (*Transformer) ReplaceInlineCode

func (t *Transformer) ReplaceInlineCode(i int, s *goquery.Selection)

ReplaceInlineCode replaces the DOM element in place with text content wrapped in "`".

func (*Transformer) ReplaceInlineCodes

func (t *Transformer) ReplaceInlineCodes(elm *goquery.Selection)

ReplaceInlineCodes finds all child "code" tags and replaces them in place with text content wrapped in "`".

func (*Transformer) ReplaceItalic

func (t *Transformer) ReplaceItalic(i int, s *goquery.Selection)

ReplaceItalic replaces the DOM element in place with the text content wrapped in "_".

func (*Transformer) ReplaceItalics

func (t *Transformer) ReplaceItalics(elm *goquery.Selection)

ReplaceItalics finds all child "em" tags and replaces them in place with markdown italics.

func (*Transformer) ToList

func (t *Transformer) ToList(list *goquery.Selection) markdown.List

ToList transforms the "ul" or "ol" dom element to a markdown List.

func (*Transformer) ToTable

func (t *Transformer) ToTable(table *goquery.Selection) markdown.Table

ToTable transforms the "table" dom element to a markdown Table.

func (*Transformer) Transform

func (t *Transformer) Transform(pattern string, elm *goquery.Selection, callbacks ...SelectionCallback)

Transform finds all elements matching the pattern and calls each given callback on each child element.

func (*Transformer) Transforms

func (t *Transformer) Transforms(i int, s *goquery.Selection, callbacks ...SelectionCallback)

Transforms calls each callback on the given DOM element.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL