readability

package module
v0.0.0-...-0f3b4a1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 22, 2019 License: MIT Imports: 15 Imported by: 1

README

goreadability

GoDoc Go Report Card Code Coverage Build Status

goreadability is a tool for extracting the primary readable content of a webpage. It is a Go port of arc90's readability project, based on ruby-readability.

From v2.0 goreadability uses opengraph tag values if exists. You can disable opengraph lookup and follow the traditional readability rules by setting Option.LookupOpenGraphTags to false.

Install

go get github.com/philipjkim/goreadability

Example

// URL to extract contents (title, description, images, ...)
url := "https://en.wikipedia.org/wiki/Lego"

// Default option
opt := readability.NewOption()

// You can modify some option values if needed.
opt.ImageRequestTimeout = 3000 // ms

content, err := readability.Extract(url, opt)
if err != nil {
    log.Fatal(err)
}

log.Println(content.Title)
log.Println(content.Description)
log.Println(content.Images)

Testing

go test

# or if you want to see verbose logs:
DEBUG=true go test -v

Command Line Tool

TODO

  • ruby-readability is the base of this project.
  • fastimage finds the type and/or size of a remote image given its uri, by fetching as little as needed.

Potential Issues

TODO

License

MIT

Documentation

Overview

Example
// URL to extract contents (title, description, images, ...)
url := "https://en.wikipedia.org/wiki/Lego"

// Default option
opt := readability.NewOption()

// You can modify some option values if needed.
opt.ImageRequestTimeout = 3000 // ms

content, err := readability.Extract(url, opt)
if err != nil {
	log.Fatal(err)
}

log.Println(content.Title)
log.Println(content.Description)
log.Println(content.Images)
Output:

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func Debug

func Debug()

Debug enables debug logging of the operations done by the library. If called, lots of information will be print to stdout.

Types

type Content

type Content struct {
	Title       string
	Description string
	Author      string
	Images      []Image
}

Content contains primary readable content of a webpage.

func Extract

func Extract(reqURL string, opt *Option) (*Content, error)

Extract requests to reqURL then returns contents extracted from the response.

func ExtractFromDocument

func ExtractFromDocument(doc *goquery.Document, reqURL string, opt *Option) (*Content, error)

ExtractFromDocument returns Content when extraction succeeds, otherwise error. reqURL is required for converting relative image paths to absolute.

If you already have *goquery.Document after requesting HTTP, use this function, otherwise use Extract(reqURL, opt).

type Image

type Image struct {
	URL  string
	Size *fastimage.ImageSize
}

Image contains URL and Size (width and height in pixel).

func (Image) String

func (i Image) String() string

type OpenGraph

type OpenGraph struct {
	Title       string `json:"og:title,omitempty"`
	Description string `json:"og:description,omitempty"`
	ImageURL    string `json:"og:image,omitempty"`
}

OpenGraph contains opengraph meta values.

func (OpenGraph) IsEmpty

func (og OpenGraph) IsEmpty() bool

IsEmpty returns true if all fields of og are empty.

func (*OpenGraph) Set

func (og *OpenGraph) Set(key string, val string, urlStr string) error

Set sets value to the key-related field.

type Option

type Option struct {
	// RetryLength is minimum length for a page description.
	// It will retry to extract page description with more liberal rule
	// if extracted description length is less than this value.
	RetryLength int

	// MinTextLength is minimum length of an inner text for a tag.
	// If a tag has short inner text (length is less than MinTextLength),
	// the text will be discarded from the page description candidates.
	MinTextLength int

	// RemoveUnlikelyCandidates is a flag whether to remove some tags
	// if they are considered relatively unimportant.
	RemoveUnlikelyCandidates bool

	// WeightClasses is a flag whether to give more/less weight to some tags
	// if they contain some positive/negative words in id/class value.
	WeightClasses bool

	// CleanConditionally is a flag whether to remove some tags
	// using various rules in conditionalCleanReason().
	CleanConditionally bool

	// RemoveEmptyNodes is a flag whether to remove some tags which have empty inner text.
	RemoveEmptyNodes bool

	// MinImageWidth is the minimum width (pixel) for choosing images.
	MinImageWidth uint32

	// MinImageHeight is the minimum height (pixel) for choosing images.
	MinImageHeight uint32

	// MaxImageCount is the maximum number of images for a web page.
	MaxImageCount int

	// CheckImageLoopCount is the number of images
	// for parallel requests to fetch the image size.
	// For example, if this value is set to 10,
	// the first 10 img src URLs without width/height attributes
	// will be requested over network.
	// (img tags with both width/height attributes (pixels in int) are not conunted,
	// since they are not requested over network to get image size.)
	CheckImageLoopCount uint

	// ImageRequestTimeout is timeout(ms) for a single image request.
	ImageRequestTimeout uint

	// IgnoreImageFormat is an array of strings for ignoring some images.
	// If an image URL contains at least one of strings in this array, the image will be ignored.
	IgnoreImageFormat []string

	// DescriptionAsPlainText is a flag whether to strip all tags in a description value.
	DescriptionAsPlainText bool

	// DescriptionExtractionTimeout is timeout(ms) for extracting description for a page.
	DescriptionExtractionTimeout uint

	// LookupOpenGraphTags is a flag whether to use opengraph tag value for title, descriptions and image if exists.
	LookupOpenGraphTags bool
}

Option contains variety of options for extracting page content and images.

func NewOption

func NewOption() *Option

NewOption returns the default option.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL