readability

package module
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 29, 2022 License: Apache-2.0 Imports: 10 Imported by: 0

README

Readability

Readability is a library written in Go (golang) to parse, analyze and convert HTML pages into readable content

Code is derived from this project

Some errors are fixed and several improvements are added.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Article

type Article struct {
	// Title is the heading that precedes the article’s content, and the basis
	// for the article’s page name and URL. It indicates what the article is
	// about, and distinguishes it from other articles. The title may simply
	// be the name of the subject of the article, or it may be a description
	// of the topic.
	Title string

	// Byline is a printed line of text accompanying a news story, article, or
	// the like, giving the author’s name
	Byline string

	// Dir is the direction of the text in the article.
	//
	// Either Left-to-Right (LTR) or Right-to-Left (RTL).
	Dir string

	// Content is the relevant text in the article with HTML tags.
	Content string

	// TextContent is the relevant text in the article without HTML tags.
	TextContent string

	// Excerpt is the summary for the relevant text in the article.
	Excerpt string

	// SiteName is the name of the original publisher website.
	SiteName string

	// Favicon (short for favorite icon) is a file containing one or more small
	// icons, associated with a particular website or web page. A web designer
	// can create such an icon and upload it to a website (or web page) by
	// several means, and graphical web browsers will then make use of it.
	Favicon string

	// Image is an image URL which represents the article’s content.
	Image string

	// Length is the amount of characters in the article.
	Length int

	// Node is the first element in the HTML document.
	Node *html.Node
}

Article represents the metadata and content of the article.

type Readability

type Readability struct {

	// MaxElemsToParse is the optional maximum number of HTML nodes to parse
	// from the document. If the number of elements in the document is higher
	// than this number, the operation immediately errors.
	MaxElemsToParse int

	// NTopCandidates is the number of top candidates to consider when the
	// parser is analysing how tight the competition is among candidates.
	NTopCandidates int

	// CharThresholds is the default number of chars an article must have in
	// order to return a result.
	CharThresholds int

	// ClassesToPreserve are the classes that readability sets itself.
	ClassesToPreserve []string

	// TagsToScore is element tags to score by default.
	TagsToScore []string
	// contains filtered or unexported fields
}

Readability is an HTML parser that reads and extract relevant content.

func New

func New() *Readability

New returns new Readability with sane defaults to parse simple documents.

func (*Readability) IsReadable

func (r *Readability) IsReadable(input io.Reader) bool

IsReadable decides whether the document is usable or not without parsing the whole thing. In the original `mozilla/readability` library, this method is located in `Readability-readable.js`.

func (*Readability) Parse

func (r *Readability) Parse(input io.Reader, pageURL string) (Article, error)

Parse parses input and find the main readable content.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL