html2article

package module
v0.0.0-...-3755e1c Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 18, 2019 License: MIT Imports: 22 Imported by: 0

README

基于文本密度的html2article实现[golang]

Install

go get -u -v github.com/sundy-li/html2article

Performance

  • Accuracy: >= 98%

  • Qps: 2w/s , 0.06ms/op go test -bench=. BenchmarkExtract-4 20000 66341 ns/op

  • 说明(对比其他开源实现,可能是目前最快的html2article实现,我们测试的数据集约3kw来自于微信公众号,各大类中文科技媒体历史文章,目前能达到98%以上准确率)

  • 除了必要dom解析以及时间解析, 为了高效率实现, 避免了过多的正则匹配

Examples

参考examples from_url.go

package main

import (
	"github.com/sundy-li/html2article"
)

func main() {
	urlStr := "https://www.leiphone.com/news/201602/DsiQtR6c1jCu7iwA.html"
	ext, err := html2article.NewFromUrl(urlStr)
	if err != nil {
		panic(err)
	}
	article, err := ext.ToArticle()
	if err != nil {
		panic(err)
	}
	println("article title is =>", article.Title)
	println("article publishtime is =>", article.Publishtime) //using UTC timezone
	println("article content is =>", article.Content)

	//parse the article to be readability
	article.Readable(urlStr)
	println("read=>", article.ReadContent)
}

Options

	ext.SetOption(&html2article.Option{
		AccurateTitle: true,  //Get the accurate title instead of from title tag
		RemoveNoise: false,  //Remove the noise node such as some footer
	})

Algorithm

Documentation

Overview

COPYRIGHT https://github.com/golang/tools/blob/master/cmd/html2article/conv.go

Index

Constants

This section is empty.

Variables

View Source
var (
	ERROR_NOTFOUND = errors.New("Content not found")
	DEFAULT_OPTION = &Option{
		RemoveNoise: true,
	}
)

Functions

func Compress

func Compress(str string) string

压缩字符串 将多个空格字符压缩为一个空格

func CompressHtml

func CompressHtml(str string) string

这个暂时不用,因为code标签还不好识别

func DecodeHtml

func DecodeHtml(header http.Header, word, src string) (dst string)

func DefCode

func DefCode(header http.Header, html string) string

func NewFromHtml

func NewFromHtml(htmlStr string) (ext *extractor, err error)

func NewFromNode

func NewFromNode(doc *html.Node) (ext *extractor, err error)

func NewFromReader

func NewFromReader(reader io.Reader) (ext *extractor, err error)

func NewFromUrl

func NewFromUrl(urlStr string) (ext *extractor, err error)

Types

type Article

type Article struct {
	// Basic
	Html        string `json:"content_html"`
	Content     string `json:"content"`
	Title       string `json:"title"`
	Publishtime int64  `json:"publish_time"`

	// Others
	Images      []string `json:"images"`
	ReadContent string   `json:"read_content"`
	// contains filtered or unexported fields
}

func (*Article) GetContentNode

func (a *Article) GetContentNode() *html.Node

func (*Article) Paragraphs

func (a *Article) Paragraphs() []string

func (*Article) ParseImage

func (a *Article) ParseImage(urlStr string)

ParseImage parse the image src to the absolute path

func (*Article) ParseReadContent

func (a *Article) ParseReadContent()

ParseReadContent parse the ReadContent to be readability

func (*Article) Readable

func (a *Article) Readable(urlStr string)

type Info

type Info struct {
	TextCount     int
	LinkTextCount int
	TagCount      int
	LinkTagCount  int
	LeafList      []int
	Density       float64
	DensitySum    float64
	Pcount        int
	InputCount    int
	ImageCount    int

	Data string
	// contains filtered or unexported fields
}

func NewInfo

func NewInfo() *Info

func (*Info) CalScore

func (info *Info) CalScore(sn_sum, swn_sum float64)

type Option

type Option struct {
	RemoveNoise   bool // remove noise node
	AccurateTitle bool // find the accurate title node
	UserAgent     string
}

type Style

type Style string

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL