readability

package module

v0.0.0-...-0f3b4a1 Latest Latest Go to latest Published: Apr 22, 2019 License: MIT Imports: 15 Imported by: 1

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/philipjkim/goreadability

Links

Open Source Insights

README ¶

goreadability

goreadability is a tool for extracting the primary readable content of a webpage. It is a Go port of arc90's readability project, based on ruby-readability.

From v2.0 goreadability uses opengraph tag values if exists. You can disable opengraph lookup and follow the traditional readability rules by setting Option.LookupOpenGraphTags to false.

Install

go get github.com/philipjkim/goreadability

Example

// URL to extract contents (title, description, images, ...)
url := "https://en.wikipedia.org/wiki/Lego"

// Default option
opt := readability.NewOption()

// You can modify some option values if needed.
opt.ImageRequestTimeout = 3000 // ms

content, err := readability.Extract(url, opt)
if err != nil {
    log.Fatal(err)
}

log.Println(content.Title)
log.Println(content.Description)
log.Println(content.Images)

Testing

go test

# or if you want to see verbose logs:
DEBUG=true go test -v

Command Line Tool

TODO

ruby-readability is the base of this project.
fastimage finds the type and/or size of a remote image given its uri, by fetching as little as needed.

Potential Issues

TODO

License

MIT

Documentation ¶

Overview ¶

Example ¶

// URL to extract contents (title, description, images, ...)
url := "https://en.wikipedia.org/wiki/Lego"

// Default option
opt := readability.NewOption()

// You can modify some option values if needed.
opt.ImageRequestTimeout = 3000 // ms

content, err := readability.Extract(url, opt)
if err != nil {
	log.Fatal(err)
}

log.Println(content.Title)
log.Println(content.Description)
log.Println(content.Images)

Output:

Index ¶

func Debug()
type Content
- func Extract(reqURL string, opt *Option) (*Content, error)
- func ExtractFromDocument(doc *goquery.Document, reqURL string, opt *Option) (*Content, error)
type Image
- func (i Image) String() string
type OpenGraph
- func (og OpenGraph) IsEmpty() bool
- func (og *OpenGraph) Set(key string, val string, urlStr string) error
type Option
- func NewOption() *Option

Examples ¶

Package

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Debug ¶

func Debug()

Debug enables debug logging of the operations done by the library. If called, lots of information will be print to stdout.

Types ¶

type Content ¶

type Content struct {
	Title       string
	Description string
	Author      string
	Images      []Image
}

Content contains primary readable content of a webpage.

func Extract ¶

func Extract(reqURL string, opt *Option) (*Content, error)

Extract requests to reqURL then returns contents extracted from the response.

func ExtractFromDocument ¶

func ExtractFromDocument(doc *goquery.Document, reqURL string, opt *Option) (*Content, error)

ExtractFromDocument returns Content when extraction succeeds, otherwise error. reqURL is required for converting relative image paths to absolute.

If you already have *goquery.Document after requesting HTTP, use this function, otherwise use Extract(reqURL, opt).

type Image ¶

type Image struct {
	URL  string
	Size *fastimage.ImageSize
}

Image contains URL and Size (width and height in pixel).

func (Image) String ¶

func (i Image) String() string

type OpenGraph ¶

type OpenGraph struct {
	Title       string `json:"og:title,omitempty"`
	Description string `json:"og:description,omitempty"`
	ImageURL    string `json:"og:image,omitempty"`
}

OpenGraph contains opengraph meta values.

func (OpenGraph) IsEmpty ¶

func (og OpenGraph) IsEmpty() bool

IsEmpty returns true if all fields of og are empty.

func (*OpenGraph) Set ¶

func (og *OpenGraph) Set(key string, val string, urlStr string) error

Set sets value to the key-related field.

type Option ¶

type Option struct {
	// RetryLength is minimum length for a page description.
	// It will retry to extract page description with more liberal rule
	// if extracted description length is less than this value.
	RetryLength int

	// MinTextLength is minimum length of an inner text for a tag.
	// If a tag has short inner text (length is less than MinTextLength),
	// the text will be discarded from the page description candidates.
	MinTextLength int

	// RemoveUnlikelyCandidates is a flag whether to remove some tags
	// if they are considered relatively unimportant.
	RemoveUnlikelyCandidates bool

	// WeightClasses is a flag whether to give more/less weight to some tags
	// if they contain some positive/negative words in id/class value.
	WeightClasses bool

	// CleanConditionally is a flag whether to remove some tags
	// using various rules in conditionalCleanReason().
	CleanConditionally bool

	// RemoveEmptyNodes is a flag whether to remove some tags which have empty inner text.
	RemoveEmptyNodes bool

	// MinImageWidth is the minimum width (pixel) for choosing images.
	MinImageWidth uint32

	// MinImageHeight is the minimum height (pixel) for choosing images.
	MinImageHeight uint32

	// MaxImageCount is the maximum number of images for a web page.
	MaxImageCount int

	// CheckImageLoopCount is the number of images
	// for parallel requests to fetch the image size.
	// For example, if this value is set to 10,
	// the first 10 img src URLs without width/height attributes
	// will be requested over network.
	// (img tags with both width/height attributes (pixels in int) are not conunted,
	// since they are not requested over network to get image size.)
	CheckImageLoopCount uint

	// ImageRequestTimeout is timeout(ms) for a single image request.
	ImageRequestTimeout uint

	// IgnoreImageFormat is an array of strings for ignoring some images.
	// If an image URL contains at least one of strings in this array, the image will be ignored.
	IgnoreImageFormat []string

	// DescriptionAsPlainText is a flag whether to strip all tags in a description value.
	DescriptionAsPlainText bool

	// DescriptionExtractionTimeout is timeout(ms) for extracting description for a page.
	DescriptionExtractionTimeout uint

	// LookupOpenGraphTags is a flag whether to use opengraph tag value for title, descriptions and image if exists.
	LookupOpenGraphTags bool
}

Option contains variety of options for extracting page content and images.

func NewOption ¶

func NewOption() *Option

NewOption returns the default option.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL