readability

package module
v0.0.0-...-a3db0f1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 28, 2016 License: MIT Imports: 12 Imported by: 0

README

goreadability

GoDoc Go Report Card Build Status

goreadability is a tool for extracting the primary readable content of a webpage. It is a Go port of arc90's readability project, based on ruby-readability.

Install

go get github.com/philipjkim/goreadability

Example

// URL to extract contents (title, description, images, ...)
url := "https://en.wikipedia.org/wiki/Lego"

// Default option
opt := readability.NewOption()

// You can modify some option values if needed.
opt.ImageRequestTimeout = 3000 // ms

content, err := readability.Extract(url, opt)
if err != nil {
    log.Fatal(err)
}

log.Println(content.Title)
log.Println(content.Description)
log.Println(content.Images)

Command Line Tool

TODO

  • ruby-readability is the base of this project.
  • fastimage finds the type and/or size of a remote image given its uri, by fetching as little as needed.

Potential Issues

TODO

License

This code is under the Apache License 2.0. See http://www.apache.org/licenses/LICENSE-2.0.

Bitdeli Badge

Documentation

Overview

Example
// URL to extract contents (title, description, images, ...)
url := "https://en.wikipedia.org/wiki/Lego"

// Default option
opt := readability.NewOption()

// You can modify some option values if needed.
opt.ImageRequestTimeout = 3000 // ms

content, err := readability.Extract(url, opt)
if err != nil {
	log.Fatal(err)
}

log.Println(content.Title)
log.Println(content.Description)
log.Println(content.Images)
Output:

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Content

type Content struct {
	Title       string
	Description string
	Content     string
	Author      string
	Images      []Image
}

Content contains primary readable content of a webpage.

func Extract

func Extract(reqURL string, opt *Option) (*Content, error)

Extract requests to reqURL then returns contents extracted from the response.

func ExtractFromDocument

func ExtractFromDocument(doc *goquery.Document, reqURL string, opt *Option) (*Content, error)

ExtractFromDocument returns Content when extraction succeeds, otherwise error. reqURL is required for converting relative image paths to absolute.

If you already have *goquery.Document after requesting HTTP, use this function, otherwise use Extract(reqURL, opt).

type Image

type Image struct {
	URL  string
	Size *fastimage.ImageSize
}

Image contains URL and Size (width and height in pixel).

func (Image) String

func (i Image) String() string

type Option

type Option struct {
	// RetryLength is minimum length for a page description.
	// It will retry to extract page description with more liberal rule
	// if extracted description length is less than this value.
	RetryLength int

	// MinTextLength is minimum length of an inner text for a tag.
	// If a tag has short inner text (length is less than MinTextLength),
	// the text will be discarded from the page description candidates.
	MinTextLength int

	// RemoveUnlikelyCandidates is a flag whether to remove some tags
	// if they are considered relatively unimportant.
	RemoveUnlikelyCandidates bool

	// WeightClasses is a flag whether to give more/less weight to some tags
	// if they contain some positive/negative words in id/class value.
	WeightClasses bool

	// CleanConditionally is a flag whether to remove some tags
	// using various rules in conditionalCleanReason().
	CleanConditionally bool

	// RemoveEmptyNodes is a flag whether to remove some tags which have empty inner text.
	RemoveEmptyNodes bool

	// MinImageWidth is the minimum width (pixel) for choosing images.
	MinImageWidth uint32

	// MinImageHeight is the minimum height (pixel) for choosing images.
	MinImageHeight uint32

	// MaxImageCount is the maximum number of images for a web page.
	MaxImageCount int

	// CheckImageSize is the flag for check image's size or not
	CheckImageSize bool

	// CheckImageLoopCount is the number of images for parallel requests to fetch the image size.
	// For example, if this value is set to 10,
	// the first 10 image sources in img tag will be requested.
	CheckImageLoopCount uint

	// ImageRequestTimeout is timeout(ms) for a single image request.
	ImageRequestTimeout uint

	// IgnoreImageFormat is an array of strings for ignoring some images.
	// If an image URL contains at least one of strings in this array, the image will be ignored.
	IgnoreImageFormat []string

	// DescriptionAsPlainText is a flag whether to strip all tags in a description value.
	DescriptionAsPlainText bool

	// DescriptionExtractionTimeout is timeout(ms) for extracting description for a page.
	DescriptionExtractionTimeout uint
}

Option contains variety of options for extracting page content and images.

func NewOption

func NewOption() *Option

NewOption returns the default option.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL