heuristic

package
v0.0.0-...-977eb4a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 10, 2023 License: MIT Imports: 8 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type BlockProximityFusion

type BlockProximityFusion struct {
	// contains filtered or unexported fields
}

BlockProximityFusion fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit. This probably makes sense only in cases where an upstream filter already has removed some blocks.

func NewBlockProximityFusion

func NewBlockProximityFusion(postFiltering bool) *BlockProximityFusion

func (*BlockProximityFusion) Process

func (f *BlockProximityFusion) Process(doc *webdoc.TextDocument) bool

type DocumentTitleMatch

type DocumentTitleMatch struct {
	// contains filtered or unexported fields
}

DocumentTitleMatch marks TextBlocks which contain parts of the HTML `title` tag, using some heuristics which are quite specific to the news domain.

func NewDocumentTitleMatch

func NewDocumentTitleMatch(wc stringutil.WordCounter, titles ...string) *DocumentTitleMatch

func (*DocumentTitleMatch) Process

func (f *DocumentTitleMatch) Process(doc *webdoc.TextDocument) bool

type ExpandTitleToContent

type ExpandTitleToContent struct{}

ExpandTitleToContent marks all TextBlocks "content" which are between the headline and the part that has already been marked content, if they are marked with label.MightBeContent. This filter is quite specific to the news domain.

func NewExpandTitleToContent

func NewExpandTitleToContent() *ExpandTitleToContent

func (*ExpandTitleToContent) Process

func (f *ExpandTitleToContent) Process(doc *webdoc.TextDocument) bool

type HeadingFusion

type HeadingFusion struct{}

HeadingFusion fuses headings with the blocks after them. If the heading was marked as boilerplate, the fused block will be labeled to prevent BlockProximityFusion from merging through it.

func NewHeadingFusion

func NewHeadingFusion() *HeadingFusion

func (*HeadingFusion) Process

func (f *HeadingFusion) Process(doc *webdoc.TextDocument) bool

type KeepLargestBlock

type KeepLargestBlock struct {
	// contains filtered or unexported fields
}

KeepLargestBlock keeps the largest TextBlock only (by the number of words). In case of more than one block with the same number of words, the first block is chosen. All discarded blocks are marked "not content" and flagged as `label.MightBeContent`. Note that, by default, only TextBlocks marked as "content" are taken into consideration.

func NewKeepLargestBlock

func NewKeepLargestBlock(expandToSiblings bool) *KeepLargestBlock

func (*KeepLargestBlock) Process

func (f *KeepLargestBlock) Process(doc *webdoc.TextDocument) bool

type LargeBlockAroundTagLevelToContent

type LargeBlockAroundTagLevelToContent struct{}

LargeBlockAroundTagLevelToContent marks all blocks as content that: - are on the same or adjacent tag-level as very likely main content (usually the level of the largest block) - have a significant number of words, currently: at least 100

func NewLargeBlockAroundTagLevelToContent

func NewLargeBlockAroundTagLevelToContent() *LargeBlockAroundTagLevelToContent

func (*LargeBlockAroundTagLevelToContent) Process

type ListAtEnd

type ListAtEnd struct{}

ListAtEnd marks nested list-item blocks after the end of the main content.

func NewListAtEnd

func NewListAtEnd() *ListAtEnd

func (*ListAtEnd) Process

func (f *ListAtEnd) Process(doc *webdoc.TextDocument) bool

type SimilarSiblingContent

type SimilarSiblingContent struct {
	AllowCrossTitles   bool
	AllowCrossHeadings bool
	AllowMixedTags     bool
	MaxLinkDensity     float64
	MaxBlockDistance   int
}

SimilarSiblingContent marks "siblings" of content as content if they are "similar" enough.

This calculates "siblings" by finding a "canonical" DOM node for each TextBlock. This node is the highest ancestor of the TextBlock's first contained node that does not contain (in its subtree) the last node of the previous TextBlock or the first node of the next TextBlock.

If a content block and a non-content block are siblings and are "similar" enough, then the non- content block is marked as content. The "similarity" test is configurable in various ways.

func NewSimilarSiblingContentExpansion

func NewSimilarSiblingContentExpansion() *SimilarSiblingContent

func (*SimilarSiblingContent) Process

func (f *SimilarSiblingContent) Process(doc *webdoc.TextDocument) bool

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL