robots

package

v0.14.0-alpha Latest Latest Go to latest Published: Jun 23, 2020 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/KillianMeersman/wander

Documentation ¶

Index ¶

func MatchURLRule(rule, url string) bool
type InvalidRobots
- func (e InvalidRobots) Error() string
type RobotDenied
- func (e RobotDenied) Error() string
type RobotFile
- func NewRobotFileFromReader(in io.Reader) (*RobotFile, error)
- func NewRobotFileFromURL(url *url.URL, client http.RoundTripper) (*RobotFile, error)
type RobotRules
- func NewRobotRules() *RobotRules
type Sitemap
- func (s *Sitemap) GetLocations(client http.RoundTripper, limit int) ([]SitemapLocation, error)
type SitemapLocation
type UserAgentRules

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func MatchURLRule ¶

func MatchURLRule(rule, url string) bool

MatchURLRule will return true if the given robot exclusion rule matches the given URL. Supports wildcards ('*') and end of line ('$').

Types ¶

type InvalidRobots ¶

type InvalidRobots struct {
	Domain string
	Err    string
}

InvalidRobots indicates an invalid robots.txt file.

func (InvalidRobots) Error ¶

func (e InvalidRobots) Error() string

type RobotDenied ¶

type RobotDenied struct {
	URL url.URL
}

RobotDenied indicates a request was denied by a site's robots.txt file.

func (RobotDenied) Error ¶

func (e RobotDenied) Error() string

type RobotFile ¶

type RobotFile struct {
	// contains filtered or unexported fields
}

RobotFile holds all the information in a robots exclusion file.

func NewRobotFileFromReader ¶

func NewRobotFileFromReader(in io.Reader) (*RobotFile, error)

RobotFileFromReader will parse a robot exclusion file from an io.Reader. Returns a default error if it encounters an invalid directive.

func NewRobotFileFromURL ¶

func NewRobotFileFromURL(url *url.URL, client http.RoundTripper) (*RobotFile, error)

func (*RobotFile) Allowed ¶

func (l *RobotFile) Allowed(userAgent, url string) bool

Allowed returns true if the user agent is allowed to access the given url.

func (*RobotFile) GetDelay ¶

func (l *RobotFile) GetDelay(userAgent string, defaultDelay time.Duration) time.Duration

GetDelay returns the User-agent specific crawl-delay if it exists, otherwise the catch-all delay. Returns def if neither a specific or global crawl-delay exist.

func (*RobotFile) GetSitemap ¶

func (l *RobotFile) GetSitemap(userAgent string, client http.RoundTripper) (*Sitemap, error)

Sitemap returns the URL to the sitemap for the given User-agent. Returns the default sitemap if no User-agent specific sitemap was specified, otherwise nil.

func (*RobotFile) GetUserAgentRules ¶

func (l *RobotFile) GetUserAgentRules(userAgent string) *UserAgentRules

GetUserAgentRules gets the rules for the userAgent, returns the default (*) group if it was present and no other groups apply. Returns nil if no groups apply and no default group was supplied.

type RobotRules ¶

type RobotRules struct {
	// contains filtered or unexported fields
}

RobotRules holds the robot exclusions for multiple domains.

func NewRobotRules ¶

func NewRobotRules() *RobotRules

NewRobotRules instantiates a new robot limit cache.

func (*RobotRules) AddLimits ¶

func (c *RobotRules) AddLimits(robotFile *RobotFile, host string)

AddLimits adds or replaces the limits for a host.

func (*RobotRules) Allowed ¶

func (c *RobotRules) Allowed(userAgent string, url *url.URL) (bool, error)

Allowed returns true if the userAgent is allowed to access the given path on the given domain. Returns error if no robot file is cached for the given domain.

func (*RobotRules) GetRulesForHost ¶

func (c *RobotRules) GetRulesForHost(host string) (*RobotFile, error)

GetRulesForHost gets the rules for a host. Returns an error when no limits are cached for the given host.

type Sitemap ¶

type Sitemap struct {
	Index  []SitemapLocation `xml:"sitemap"`
	URLSet []SitemapLocation `xml:"url"`
}

func NewSitemap ¶

func NewSitemap() *Sitemap

func NewSitemapFromReader ¶

func NewSitemapFromReader(reader io.Reader) (*Sitemap, error)

func NewSitemapFromURL ¶

func NewSitemapFromURL(url string, client http.RoundTripper) (*Sitemap, error)

func (*Sitemap) GetLocations ¶

func (s *Sitemap) GetLocations(client http.RoundTripper, limit int) ([]SitemapLocation, error)

GetLocations gets up to <limit> sitemap locations. Sitemaps usually come in pages of 50k entries, this means the limit may be exceeded by up to 49_999 entries.

type SitemapLocation ¶

type SitemapLocation struct {
	Loc        string    `xml:"loc"`
	LastMod    time.Time `xml:"lastmod"`
	ChangeFreq string    `xml:"changefreq"`
	Priority   float64   `xml:"priority"`
}

type UserAgentRules ¶

type UserAgentRules struct {
	// contains filtered or unexported fields
}

UserAgentRules holds limits for a single user agent.

func (*UserAgentRules) Allowed ¶

func (g *UserAgentRules) Allowed(url string) bool

Allowed returns true if the url is allowed by the group rules. Check if the group applies to the user agent first by using Applies.

func (*UserAgentRules) Applies ¶

func (g *UserAgentRules) Applies(userAgent string) bool

Applies returns true if the group applies to the given userAgent

func (*UserAgentRules) GetDelay ¶

func (g *UserAgentRules) GetDelay(defaultDelay time.Duration) time.Duration

GetDelay returns the Crawl-delay. Returns defaultDelay if no crawl delay was specified.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL