readability

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 2, 2019 License: Apache-2.0 Imports: 10 Imported by: 0

README

Readability

Readability is a library written in Go (golang) to parse, analyze and convert HTML pages into readable content. Originally an Arc90 Experiment, it is now incorporated into Safari’s Reader View.

Despite the ubiquity of reading on the web, readers remain a neglected audience. Much of our talk about web design revolves around a sense of movement: users are thought to be finding, searching, skimming, looking. We measure how frequently they click but not how long they stay on the page. We concern ourselves with their travel and participation–how they move from page to page, who they talk to when they get there–but forget the needs of those whose purpose is to be still. Readers flourish when they have space–some distance from the hubbub of the crowds–and as web designers, there is yet much we can do to help them carve out that space.

In Defense Of Readers, by Mandy Brown

Evolution of Readability Web Engines

Product Year Shutdown
Instapaper 2008 N/A
Arc90 Readability 2009 Sep 30, 2016
Apple Readability 2010 N/A
Microsoft Reading View 2014 N/A
Mozilla Readability 2015 N/A
Mercury Reader 2016 Apr 15, 2019

Reader Mode Parser Diversity

All modern web browsers, except for Google Chrome, include an option to parse, analyze, and extract the main content from web pages to provide what is commonly known as “Reading Mode”. Reading Mode is a separate web rendering mode that strips out repeated and irrelevant content, this allows the web browser to extract the main content and display it cleanly and consistently to the user.

Vendor Product Parser Environments
Mozilla Firefox Mozilla Readability Desktop and Android
GNOME Web Mozilla Readability Desktop
Vivaldi Vivaldi Mozilla Readability Desktop
Yandex Browser Mozilla Readability Desktop
Samsung Browser Mozilla Readability Android
Apple Safari Safari Reader macOS and iOS
Maxthon Maxthon Maxthon Reader Desktop
Microsoft Edge EdgeHTML Windows and Windows Mobile
Microsoft Edge Mobile Chrome DOM Distiller Android
Google Chrome Chrome DOM Distiller Android
Postlight Mercury Reader Web Reader Web / browser extension
Instant Paper Instapaper Instaparser Web / browser extension
Mozilla Pocket Unknown Web / browser extension

Ref: https://web.archive.org/web/20150817073201/http://lab.arc90.com/2009/03/02/readability/

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Article

type Article struct {
	// Title is the heading that preceeds the article’s content, and the basis
	// for the article’s page name and URL. It indicates what the article is
	// about, and distinguishes it from other articles. The title may simply
	// be the name of the subject of the article, or it may be a description
	// of the topic.
	Title string

	// Byline is a printed line of text accompanying a news story, article, or
	// the like, giving the author’s name
	Byline string

	// Dir is the direction of the text in the article.
	//
	// Either Left-to-Right (LTR) or Right-to-Left (RTL).
	Dir string

	// Content is the relevant text in the article with HTML tags.
	Content string

	// TextContent is the relevant text in the article without HTML tags.
	TextContent string

	// Excerpt is the summary for the relevant text in the article.
	Excerpt string

	// SiteName is the name of the original publisher website.
	SiteName string

	// Favicon (short for favorite icon) is a file containing one or more small
	// icons, associated with a particular website or web page. A web designer
	// can create such an icon and upload it to a website (or web page) by
	// several means, and graphical web browsers will then make use of it.
	Favicon string

	// Image is an image URL which represents the article’s content.
	Image string

	// Length is the amount of characters in the article.
	Length int

	// Node is the first element in the HTML document.
	Node *html.Node
}

Article represents the metadata and content of the article.

type Readability

type Readability struct {

	// MaxElemsToParse is the optional maximum number of HTML nodes to parse
	// from the document. If the number of elements in the document is higher
	// than this number, the operation immediately errors.
	MaxElemsToParse int

	// NTopCandidates is the number of top candidates to consider when the
	// parser is analysing how tight the competition is among candidates.
	NTopCandidates int

	// CharThresholds is the default number of chars an article must have in
	// order to return a result.
	CharThresholds int

	// ClassesToPreserve are the classes that readability sets itself.
	ClassesToPreserve []string

	// TagsToScore is element tags to score by default.
	TagsToScore []string
	// contains filtered or unexported fields
}

Readability is an HTML parser that reads and extract relevant content.

func New

func New() *Readability

New returns new Readability with sane defaults to parse simple documents.

func (*Readability) IsReadable

func (r *Readability) IsReadable(input io.Reader) bool

IsReadable decides whether the document is usable or not without parsing the whole thing. In the original `mozilla/readability` library, this method is located in `Readability-readable.js`.

func (*Readability) Parse

func (r *Readability) Parse(input io.Reader, pageURL string) (Article, error)

Parse parses input and find the main readable content.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL