robots

package
v0.14.0-alpha Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 23, 2020 License: MIT Imports: 11 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func MatchURLRule

func MatchURLRule(rule, url string) bool

MatchURLRule will return true if the given robot exclusion rule matches the given URL. Supports wildcards ('*') and end of line ('$').

Types

type InvalidRobots

type InvalidRobots struct {
	Domain string
	Err    string
}

InvalidRobots indicates an invalid robots.txt file.

func (InvalidRobots) Error

func (e InvalidRobots) Error() string

type RobotDenied

type RobotDenied struct {
	URL url.URL
}

RobotDenied indicates a request was denied by a site's robots.txt file.

func (RobotDenied) Error

func (e RobotDenied) Error() string

type RobotFile

type RobotFile struct {
	// contains filtered or unexported fields
}

RobotFile holds all the information in a robots exclusion file.

func NewRobotFileFromReader

func NewRobotFileFromReader(in io.Reader) (*RobotFile, error)

RobotFileFromReader will parse a robot exclusion file from an io.Reader. Returns a default error if it encounters an invalid directive.

func NewRobotFileFromURL

func NewRobotFileFromURL(url *url.URL, client http.RoundTripper) (*RobotFile, error)

func (*RobotFile) Allowed

func (l *RobotFile) Allowed(userAgent, url string) bool

Allowed returns true if the user agent is allowed to access the given url.

func (*RobotFile) GetDelay

func (l *RobotFile) GetDelay(userAgent string, defaultDelay time.Duration) time.Duration

GetDelay returns the User-agent specific crawl-delay if it exists, otherwise the catch-all delay. Returns def if neither a specific or global crawl-delay exist.

func (*RobotFile) GetSitemap

func (l *RobotFile) GetSitemap(userAgent string, client http.RoundTripper) (*Sitemap, error)

Sitemap returns the URL to the sitemap for the given User-agent. Returns the default sitemap if no User-agent specific sitemap was specified, otherwise nil.

func (*RobotFile) GetUserAgentRules

func (l *RobotFile) GetUserAgentRules(userAgent string) *UserAgentRules

GetUserAgentRules gets the rules for the userAgent, returns the default (*) group if it was present and no other groups apply. Returns nil if no groups apply and no default group was supplied.

type RobotRules

type RobotRules struct {
	// contains filtered or unexported fields
}

RobotRules holds the robot exclusions for multiple domains.

func NewRobotRules

func NewRobotRules() *RobotRules

NewRobotRules instantiates a new robot limit cache.

func (*RobotRules) AddLimits

func (c *RobotRules) AddLimits(robotFile *RobotFile, host string)

AddLimits adds or replaces the limits for a host.

func (*RobotRules) Allowed

func (c *RobotRules) Allowed(userAgent string, url *url.URL) (bool, error)

Allowed returns true if the userAgent is allowed to access the given path on the given domain. Returns error if no robot file is cached for the given domain.

func (*RobotRules) GetRulesForHost

func (c *RobotRules) GetRulesForHost(host string) (*RobotFile, error)

GetRulesForHost gets the rules for a host. Returns an error when no limits are cached for the given host.

type Sitemap

type Sitemap struct {
	Index  []SitemapLocation `xml:"sitemap"`
	URLSet []SitemapLocation `xml:"url"`
}

func NewSitemap

func NewSitemap() *Sitemap

func NewSitemapFromReader

func NewSitemapFromReader(reader io.Reader) (*Sitemap, error)

func NewSitemapFromURL

func NewSitemapFromURL(url string, client http.RoundTripper) (*Sitemap, error)

func (*Sitemap) GetLocations

func (s *Sitemap) GetLocations(client http.RoundTripper, limit int) ([]SitemapLocation, error)

GetLocations gets up to <limit> sitemap locations. Sitemaps usually come in pages of 50k entries, this means the limit may be exceeded by up to 49_999 entries.

type SitemapLocation

type SitemapLocation struct {
	Loc        string    `xml:"loc"`
	LastMod    time.Time `xml:"lastmod"`
	ChangeFreq string    `xml:"changefreq"`
	Priority   float64   `xml:"priority"`
}

type UserAgentRules

type UserAgentRules struct {
	// contains filtered or unexported fields
}

UserAgentRules holds limits for a single user agent.

func (*UserAgentRules) Allowed

func (g *UserAgentRules) Allowed(url string) bool

Allowed returns true if the url is allowed by the group rules. Check if the group applies to the user agent first by using Applies.

func (*UserAgentRules) Applies

func (g *UserAgentRules) Applies(userAgent string) bool

Applies returns true if the group applies to the given userAgent

func (*UserAgentRules) GetDelay

func (g *UserAgentRules) GetDelay(defaultDelay time.Duration) time.Duration

GetDelay returns the Crawl-delay. Returns defaultDelay if no crawl delay was specified.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL