crawler

command module

v0.0.0-...-278ce41 Latest Latest Go to latest Published: Jan 28, 2024 License: Apache-2.0 Imports: 2 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/dapings/examples

Links

Open Source Insights

README ¶

Crawler

4种网页文本处理手段

正则表达式

// regexp.MustCompile函数会在编译时，提前解析好正则表达式内容，在一定程度上加速程序的运行。
// [\s\S]*?，[\s\S] 任意字符串，*将前面任意字符匹配0次或无数次，?非贪婪匹配，找到第一次出现的地方，就认定匹配成功。
// 由于回溯的原因，复杂的正则表达式，可能比较消耗CPU资源。
var headerReg = regexp.MustCompile(`<div class="news_li"[\s\S]*?<h2>[\s\S]*?<a.*?target="_blank">([\s\S]*?)</a>`)

// FindAllSubmatch返回一个三维字节数组，第三层是字符实际对应的字节数组
matches := headerReg.FindAllSubmatch(body, -1)

XPath(XML Path Language)

定义了一种遍历XML文档中节点层次结构，并返回匹配元素的灵活方法。第三方库github.com/antchfx/htmlquery提供了在HTML中通过XPath匹配XML节点的引擎。

// XPath语法
var xpathReg = `//div[@class="news_li"]/h2/a[@target="_blank"]`
// 解析HTML文本
doc, err := htmlquery.Parse(bytes.NewReader(body))

// 通过XPath语法查找符合条件的节点
nodes := htmlquery.Find(doc, xpathReg)

CSS选择器

CSS是一种定义HTML文档中元素样式的语言。第三库github.com/PuerkitoBio/goquery支持CSS选择器。

var cssReg = "div.news_li h2 a[target=_blank]"
// 加载HTML文本
doc, err := goquery.NewDocumentFromReader(bytes.NewReader(body))
// 根据CSS标签选择器的语法查找匹配的标签，并遍历输出a标签中的文本
doc.Find(cssReg).Each(func(i int, s *goquery.Selection) {
	// 获取匹配的元素文本
	title := s.Text()
	log.Printf("review %d: %s\n", i, title)
})

标准库：strings,bytes,text/encoding,html/charset

小结：由于正则表达式通常比较复杂且性能低下，在实际运用过程中，通常采用XPath,CSS选择器进行结构化查询。XPath是为XML文档设计的，而CSS选择器是为HTML文本专门设计的，更加简单主流。

https://github.com/dreamerjackson/crawler “聚沙万塔-Go语言构建高性能、分布式爬虫项目”

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
auth
client
cmd
internal
collect
engine
extensions
generator
limiter
log
master
parse
doubanbook
doubangroup
doubangroupjs
protos
crawler Package crawler is a reverse proxy.	Package crawler is a reverse proxy.
greeter Package greeter is a reverse proxy.	Package greeter is a reverse proxy.
proxy
spider
sqldb
storage
sqlstorage

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL