sitewalker

package module
v0.0.0-...-1f20d1c Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 6, 2022 License: Apache-2.0 Imports: 10 Imported by: 0

README

site-walker

为SEO站长提供的网址信息爬取的工具,会进行一些通用的分类和聚合工作

适用于中国SEO环境

dependencies

colly

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GetATagAnchor

func GetATagAnchor(a *goquery.Selection) string

Types

type DomainFilter

type DomainFilter interface {
	IsAllowed(domain string) bool
	Add(domain string)
}

func NewDomainFilter

func NewDomainFilter() DomainFilter
type Link struct {
	Href     string   `json:"href"`
	URL      *url.URL `json:"-"`
	Text     string   `json:"text"`
	LinkType LinkType `json:"link_type"`
}

网站的链接信息

func ParseATag2Link(a *goquery.Selection, pageURL *url.URL) *Link

type LinkType

type LinkType int
const (
	LinkTypeText LinkType = iota
	LinkTypeImg
)

type Page

type Page struct {
	// seo text 信息
	Title       string   `json:"title"`
	Description string   `json:"description"`
	Keywords    []string `json:"keywords"`
	// h1标签的内容
	H1 string `json:"h1"`

	// 页面的原始url
	RawURL string   `json:"raw_url"`
	URL    *url.URL `json:"url"`

	// 页面中的链接
	Links []*Link `json:"links"`
	// 页面中的外部链接
	ExternalLinks []*Link `json:"external_links"`
	// 网站网页数据
	Html []byte `json:"html"`
	// contains filtered or unexported fields
}

网站的页面信息

type SiteWalker

type SiteWalker struct {
	// contains filtered or unexported fields
}

func NewSiteWalker

func NewSiteWalker(opts ...SiteWalkerOption) *SiteWalker

func (*SiteWalker) Walk

func (sw *SiteWalker) Walk(homeUrl string, allowedDomains []string) (*WebSite, error)

type SiteWalkerOption

type SiteWalkerOption func(sw *SiteWalker)

func WithCacheDir

func WithCacheDir(dir string) SiteWalkerOption

缓存目录

func WithDelay

func WithDelay(randomDelay time.Duration, delay time.Duration) SiteWalkerOption

withDelay

func WithDeviceType

func WithDeviceType(isMobile bool) SiteWalkerOption

device type will be used to decide UserAgent

func WithParallelism

func WithParallelism(n int) SiteWalkerOption

并发数

func WithTimeout

func WithTimeout(timeout time.Duration) SiteWalkerOption

with timeout

func WithUserAgent

func WithUserAgent(ua string) SiteWalkerOption

WithUserAgent this WithUserAgent will cover DeviceType WithUserAgent

type WebSite

type WebSite struct {
	Protocol string  `json:"protocol"`
	Domain   string  `json:"domain"`
	HomePage *Page   `json:"home_page"`
	Pages    []*Page `json:"pages"`
}

网站的基本信息

Directories

Path Synopsis
cmd
util

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL