goreadability: github.com/philipjkim/goreadability Index | Examples | Files

package readability

import "github.com/philipjkim/goreadability"

Code:

// URL to extract contents (title, description, images, ...)
url := "https://en.wikipedia.org/wiki/Lego"

// Default option
opt := readability.NewOption()

// You can modify some option values if needed.
opt.ImageRequestTimeout = 3000 // ms

content, err := readability.Extract(url, opt)
if err != nil {
    log.Fatal(err)
}

log.Println(content.Title)
log.Println(content.Description)
log.Println(content.Images)

Index

Examples

Package Files

logger.go opengraph.go readability.go

func Debug Uses

func Debug()

Debug enables debug logging of the operations done by the library. If called, lots of information will be print to stdout.

type Content Uses

type Content struct {
    Title       string
    Description string
    Author      string
    Images      []Image
}

Content contains primary readable content of a webpage.

func Extract Uses

func Extract(reqURL string, opt *Option) (*Content, error)

Extract requests to reqURL then returns contents extracted from the response.

func ExtractFromDocument Uses

func ExtractFromDocument(doc *goquery.Document, reqURL string, opt *Option) (*Content, error)

ExtractFromDocument returns Content when extraction succeeds, otherwise error. reqURL is required for converting relative image paths to absolute.

If you already have *goquery.Document after requesting HTTP, use this function, otherwise use Extract(reqURL, opt).

type Image Uses

type Image struct {
    URL  string
    Size *fastimage.ImageSize
}

Image contains URL and Size (width and height in pixel).

func (Image) String Uses

func (i Image) String() string

type OpenGraph Uses

type OpenGraph struct {
    Title       string `json:"og:title,omitempty"`
    Description string `json:"og:description,omitempty"`
    ImageURL    string `json:"og:image,omitempty"`
}

OpenGraph contains opengraph meta values.

func (OpenGraph) IsEmpty Uses

func (og OpenGraph) IsEmpty() bool

IsEmpty returns true if all fields of og are empty.

func (*OpenGraph) Set Uses

func (og *OpenGraph) Set(key string, val string, urlStr string) error

Set sets value to the key-related field.

type Option Uses

type Option struct {
    // RetryLength is minimum length for a page description.
    // It will retry to extract page description with more liberal rule
    // if extracted description length is less than this value.
    RetryLength int

    // MinTextLength is minimum length of an inner text for a tag.
    // If a tag has short inner text (length is less than MinTextLength),
    // the text will be discarded from the page description candidates.
    MinTextLength int

    // RemoveUnlikelyCandidates is a flag whether to remove some tags
    // if they are considered relatively unimportant.
    RemoveUnlikelyCandidates bool

    // WeightClasses is a flag whether to give more/less weight to some tags
    // if they contain some positive/negative words in id/class value.
    WeightClasses bool

    // CleanConditionally is a flag whether to remove some tags
    // using various rules in conditionalCleanReason().
    CleanConditionally bool

    // RemoveEmptyNodes is a flag whether to remove some tags which have empty inner text.
    RemoveEmptyNodes bool

    // MinImageWidth is the minimum width (pixel) for choosing images.
    MinImageWidth uint32

    // MinImageHeight is the minimum height (pixel) for choosing images.
    MinImageHeight uint32

    // MaxImageCount is the maximum number of images for a web page.
    MaxImageCount int

    // CheckImageLoopCount is the number of images
    // for parallel requests to fetch the image size.
    // For example, if this value is set to 10,
    // the first 10 img src URLs without width/height attributes
    // will be requested over network.
    // (img tags with both width/height attributes (pixels in int) are not conunted,
    // since they are not requested over network to get image size.)
    CheckImageLoopCount uint

    // ImageRequestTimeout is timeout(ms) for a single image request.
    ImageRequestTimeout uint

    // IgnoreImageFormat is an array of strings for ignoring some images.
    // If an image URL contains at least one of strings in this array, the image will be ignored.
    IgnoreImageFormat []string

    // DescriptionAsPlainText is a flag whether to strip all tags in a description value.
    DescriptionAsPlainText bool

    // DescriptionExtractionTimeout is timeout(ms) for extracting description for a page.
    DescriptionExtractionTimeout uint

    // LookupOpenGraphTags is a flag whether to use opengraph tag value for title, descriptions and image if exists.
    LookupOpenGraphTags bool
}

Option contains variety of options for extracting page content and images.

func NewOption Uses

func NewOption() *Option

NewOption returns the default option.

Package readability imports 15 packages (graph). Updated 2019-04-22. Refresh now. Tools for package owners.