spiderPlus

package module
v0.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 10, 2020 License: Apache-2.0 Imports: 14 Imported by: 0

README

spiderPlus

爬取网站文章到Excel

安装

go get github.com/PeterYangs/spiderPlus

使用


import "github.com/PeterYangs/spiderPlus"

spiderPlus.Rule(
"https://www.azg168.cn",//域名
"/bazisuanming/index_[PAGE].html",//栏目,分页用[PAGE]替代
10,//爬取页数
2,//起始页面
"body > div.main.clearfix.w960 > div.main_left.fl.dream_box > ul > li",//列表选择器
"a",//a链接选择器(相对于列表)
"body > div.main.clearfix.w960 > div.main_left.fl > div.art_con_left > h1",//标题选择器
"#azgbody",//内容选择器
"demo",//下载图片路径
"ttt",//内容中的图片前缀
)

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func PathExists

func PathExists(path string) (bool, error)

判断文件夹是否存在

func Rule

func Rule(
	host string,
	channel string,
	limit int, pageStart int,
	listSelector string,
	listHrefSelector string,
	titleSelector string,
	contentSelector string,
	dirs string,
	imagePrefix string)

*

host:域名     例:https://www.d1xz.net/
channel:栏目  例:bazi/list_[PAGE].html
limit: 爬取总页面
pageStart:起始页面
listSelector:列表选择器
listHrefSelector:#列表a链接选择器
titleSelector:标题选择器
contentSelector:内容选择器
dirs:图片文件夹
imagePrefix:图片链接前缀

Types

type Config

type Config struct {
	// contains filtered or unexported fields
}

type Task

type Task struct {
	Id            uint
	CreatedAt     time.Time
	UpdatedAt     time.Time
	CategoryId    int
	Content       string
	Img           string
	Title         string
	Desc          string
	Keyword       string
	WriteType     int
	Expand        string
	PushTime      time.Time
	AdminIdCreate int
}

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL