html2article

package module

v0.0.0-...-3755e1c Latest Latest Go to latest Published: Jul 18, 2019 License: MIT Imports: 22 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/ganeasy/html2article

Links

Open Source Insights

README ¶

基于文本密度的html2article实现[golang]

Install

go get -u -v github.com/sundy-li/html2article

Performance

Accuracy: >= 98%
Qps: 2w/s , 0.06ms/op go test -bench=. BenchmarkExtract-4 20000 66341 ns/op
说明(对比其他开源实现,可能是目前最快的html2article实现,我们测试的数据集约3kw来自于微信公众号,各大类中文科技媒体历史文章,目前能达到98%以上准确率)
除了必要dom解析以及时间解析, 为了高效率实现, 避免了过多的正则匹配

Examples

参考examples from_url.go

package main

import (
	"github.com/sundy-li/html2article"
)

func main() {
	urlStr := "https://www.leiphone.com/news/201602/DsiQtR6c1jCu7iwA.html"
	ext, err := html2article.NewFromUrl(urlStr)
	if err != nil {
		panic(err)
	}
	article, err := ext.ToArticle()
	if err != nil {
		panic(err)
	}
	println("article title is =>", article.Title)
	println("article publishtime is =>", article.Publishtime) //using UTC timezone
	println("article content is =>", article.Content)

	//parse the article to be readability
	article.Readable(urlStr)
	println("read=>", article.ReadContent)
}

Options

	ext.SetOption(&html2article.Option{
		AccurateTitle: true,  //Get the accurate title instead of from title tag
		RemoveNoise: false,  //Remove the noise node such as some footer
	})

Algorithm

Documentation ¶

Overview ¶

COPYRIGHT https://github.com/golang/tools/blob/master/cmd/html2article/conv.go

Index ¶

Variables
func Compress(str string) string
func CompressHtml(str string) string
func DecodeHtml(header http.Header, word, src string) (dst string)
func DefCode(header http.Header, html string) string
func NewFromHtml(htmlStr string) (ext *extractor, err error)
func NewFromNode(doc *html.Node) (ext *extractor, err error)
func NewFromReader(reader io.Reader) (ext *extractor, err error)
func NewFromUrl(urlStr string) (ext *extractor, err error)
type Article
type Info
- func NewInfo() *Info
- func (info *Info) CalScore(sn_sum, swn_sum float64)
type Option
type Style

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	ERROR_NOTFOUND = errors.New("Content not found")
	DEFAULT_OPTION = &Option{
		RemoveNoise: true,
	}
)

Functions ¶

func Compress ¶

func Compress(str string) string

压缩字符串将多个空格字符压缩为一个空格

func CompressHtml ¶

func CompressHtml(str string) string

这个暂时不用,因为code标签还不好识别

func DecodeHtml ¶

func DecodeHtml(header http.Header, word, src string) (dst string)

func DefCode ¶

func DefCode(header http.Header, html string) string

func NewFromHtml ¶

func NewFromHtml(htmlStr string) (ext *extractor, err error)

func NewFromNode ¶

func NewFromNode(doc *html.Node) (ext *extractor, err error)

func NewFromReader ¶

func NewFromReader(reader io.Reader) (ext *extractor, err error)

func NewFromUrl ¶

func NewFromUrl(urlStr string) (ext *extractor, err error)

Types ¶

type Article ¶

type Article struct {
	// Basic
	Html        string `json:"content_html"`
	Content     string `json:"content"`
	Title       string `json:"title"`
	Publishtime int64  `json:"publish_time"`

	// Others
	Images      []string `json:"images"`
	ReadContent string   `json:"read_content"`
	// contains filtered or unexported fields
}

func (*Article) GetContentNode ¶

func (a *Article) GetContentNode() *html.Node

func (*Article) Paragraphs ¶

func (a *Article) Paragraphs() []string

func (*Article) ParseImage ¶

func (a *Article) ParseImage(urlStr string)

ParseImage parse the image src to the absolute path

func (*Article) ParseReadContent ¶

func (a *Article) ParseReadContent()

ParseReadContent parse the ReadContent to be readability

func (*Article) Readable ¶

func (a *Article) Readable(urlStr string)

type Info ¶

type Info struct {
	TextCount     int
	LinkTextCount int
	TagCount      int
	LinkTagCount  int
	LeafList      []int
	Density       float64
	DensitySum    float64
	Pcount        int
	InputCount    int
	ImageCount    int

	Data string
	// contains filtered or unexported fields
}

func NewInfo ¶

func NewInfo() *Info

func (*Info) CalScore ¶

func (info *Info) CalScore(sn_sum, swn_sum float64)

type Option ¶

type Option struct {
	RemoveNoise   bool // remove noise node
	AccurateTitle bool // find the accurate title node
	UserAgent     string
}

type Style ¶

type Style string

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL