check_link

package
v0.0.0-...-fcc9234 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 8, 2018 License: GPL-2.0 Imports: 15 Imported by: 0

Documentation

Index

Constants

View Source
const (
	PATTERN_SRC          = `src=\"(.*?)\"`
	PATTERN_HERF         = `href=\"(.*?)\"`
	PATTERN_HTTP         = `^http(.*?)`
	PATTERN_LINK         = `^https?:\/\/[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)`
	PATTERN_SINGLE_SLASH = `^/([^/].*)?$`
	PATTERN_MORE_SLASH   = `^//(.*?)`
	ALLOW_DOMAIN         = `(qiniu.com)|(qiniu.com.cn)`
)

Variables

This section is empty.

Functions

func Crawling

func Crawling(surl string) (ResponseBodyString string, StatusCode int, ContentType string)

获取链接的body,状态码,contentType

func DomArrayToUrl

func DomArrayToUrl(cU CUrl, a [][]string, cH chan<- CUrl, tM map[string]int)

func ExtractBody

func ExtractBody(s string) ([][]string, [][]string)

从body里拿到href和src的相对路径

func GetDomainHost

func GetDomainHost(u string) (string, string, error)

从链接里提取出domain,host

func GetFromRedirectUrl

func GetFromRedirectUrl(lu string, rn int) (string, int, string)

检查重定向是否正确

func GetUrlFromLocation

func GetUrlFromLocation(resp http.Response) string

func IterCrawl

func IterCrawl(cu CUrl, tM map[string]int, cH chan<- CUrl, fA *[]CUrl, eA *[]CUrl, rdl []string)

func LanuchCrawl

func LanuchCrawl(rla []string, lp string, rp string)

func PutChannel

func PutChannel(cu CUrl, ch chan<- CUrl)

将url放入管道

func ReArrayToUrl

func ReArrayToUrl(cU CUrl, a [][]string, cH chan<- CUrl, tM map[string]int)

读取数组内的路径,处理为完整url,如果不在Map里放入ch和map

func ReDomainMatch

func ReDomainMatch(s string) bool

返回匹配href=的相对路径数组

func ReHaveMoreSlash

func ReHaveMoreSlash(s string) bool

匹配多个slash

func ReHaveSinlgeSlash

func ReHaveSinlgeSlash(s string) bool

re匹配单个slash

func ReHrefSubMatch

func ReHrefSubMatch(s string) [][]string

返回匹配href=的相对路径数组

func ReIsHttp

func ReIsHttp(s string) bool

re匹配http链接

func ReIsLink(s string) bool

re匹配链接

func ReLinkSubMatch

func ReLinkSubMatch(s string) [][]string

func ReSrcSubMatch

func ReSrcSubMatch(s string) [][]string

返回匹配src=的相对路径数组

func ReadJsonConfig

func ReadJsonConfig(tm map[string]int, rdl []string) []string

从配置文件中读取配置项并配置

func SpaceMap

func SpaceMap(str string) string

去掉全部空格

func StatAndCreate

func StatAndCreate(p string) error

func StitchDomain

func StitchDomain(s string, h string) string

将Scheme和Host拼接为domain

func StitchUrl

func StitchUrl(DomainString string, PathString string) (UString string)

拼接domain和path

func UrlToChMAP

func UrlToChMAP(cu CUrl, ch chan<- CUrl, tm map[string]int)

将连接放入channel和map

Types

type CUrl

type CUrl struct {
	Id          bson.ObjectId `json:"id" bson:"_id"`
	CrawlUrl    string        `json:"CrawlUrl" bson:"crawl_url"`
	StatusCode  int           `json:"StatusCode" bson:"status_code"`
	Origin      string        `json:"Origin" bson:"origin"`
	Domain      string        `json:"Domain" bson:"domain"`
	RefUrl      string        `json:"RefUrl" bson:"ref_url"`
	ContentType string        `json:"ContentType" bson:"content_type"`

	QueryError string `json:"QueryError" bson:"query_error"`
	// contains filtered or unexported fields
}

-1 链接放入管道未爬取 -2 http请求报错 -3 读取管道超时,一般为没有新链接放入管道,自动结束

func GetChannel

func GetChannel(ch chan CUrl) CUrl

从管道中取出一个url

type ConfigJson

type ConfigJson struct {
	WhiteLink      []string `json:"WhiteLink"`
	RestrictDomain []string `json:"RestrictDomain"`
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL