distiller

package
v0.0.0-...-977eb4a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 10, 2023 License: MIT Imports: 14 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type LogFlag

type LogFlag uint

LogFlag is enum to specify logging level.

const (
	// If LogEverything is set DistillerLogger will enable all logs.
	LogEverything LogFlag = LogExtraction | LogVisibility | LogPagination | LogTiming

	// If LogExtraction is set DistillerLogger will print info of each process when extracting article.
	LogExtraction LogFlag = 1 << iota

	// If LogVisibility is set DistillerLogger will print info on why an element is visible.
	LogVisibility

	// If LogPagination is set DistillerLogger will print info of pagination process.
	LogPagination

	// If LogTiming is set DistillerLogger will print info of duration of each process when extracting article.
	LogTiming
)

type Options

type Options struct {
	// Flags to specify which info to dump to log.
	LogFlags LogFlag

	// Original URL of the page, which is used in the heuristics in detecting
	// next/prev page links. Will be ignored if Option is used in ApplyForURL.
	OriginalURL *nurl.URL

	// Set to true to skip process for finding pagination.
	SkipPagination bool

	// Algorithm to use for next page detection.
	PaginationAlgo PaginationAlgo
}

Options is configuration for the distiller.

type PaginationAlgo

type PaginationAlgo uint

PaginationAlgo is the algorithm to find the pagination links.

const (
	// PrevNext is the algorithm to find pagination links that work by scoring  each anchor
	// in documents using various heuristics on its href, text, class name and ID. It's quite
	// accurate and used as default algorithm. Unfortunately it uses a lot of regular expressions,
	// so it's a bit slow.
	PrevNext PaginationAlgo = iota

	// PageNumber is algorithm to find pagination links that work by collecting groups of adjacent plain
	// text numbers and outlinks with digital anchor text. A lot faster than PrevNext, but also less
	// accurate.
	PageNumber
)

type Result

type Result struct {
	// URL is the URL of the processed page.
	URL string

	// Title is the title of the processed page.
	Title string

	// MarkupInfo is the metadata of the page. The metadata is extracted following three markup
	// specifications: OpenGraphProtocol, IEReadingView and SchemaOrg. For now, OpenGraph protocol
	// takes precedence because it uses specific meta tags and hence the fastest. The other
	// specifications is used as fallback in case some metadata not found.
	MarkupInfo data.MarkupInfo

	// TimingInfo is the record of the time it takes to do each step in the process of content extraction.
	TimingInfo data.TimingInfo

	// PaginationInfo contains link to previous and next partial page. This is useful for long article or
	// that may be partitioned into several partial pages by its webmaster.
	PaginationInfo data.PaginationInfo

	// WordCount is the count of words within document.
	WordCount int

	// Node is the *html.Node which contain the distilled content.
	Node *html.Node

	// Text is the string which contains the distilled content in text format.
	Text string

	// ContentImages is list of image URLs that used within the distilled content.
	ContentImages []string
}

Result is the final output of the distiller

func Apply

func Apply(doc *html.Node, opts *Options) (*Result, error)

Apply runs distiller for the specified parsed document.

func ApplyForFile

func ApplyForFile(path string, opts *Options) (*Result, error)

ApplyForFile runs distiller for the specified file.

func ApplyForReader

func ApplyForReader(r io.Reader, opts *Options) (*Result, error)

Apply runs distiller for the specified io.Reader.

func ApplyForURL

func ApplyForURL(url string, timeout time.Duration, opts *Options) (*Result, error)

ApplyForURL runs distiller for the specified URL.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL