Documentation ¶
Overview ¶
Package robotstxt implements the Robots Exclusion Protocol, https://en.wikipedia.org/wiki/Robots_exclusion_standard, with a simple API.
Specification ¶
A large portion of how this package handles the specification comes from https://developers.google.com/search/reference/robots_txt. In fact this package tests against all of the examples listed at https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values plus many more.
Important Notes From the Spec ¶
1. User Agents are case insensitive so "googlebot" and "Googlebot" are the same thing.
2. Directive "Allow" and "Disallow" values are case sensitive so "/pricing" and "/Pricing" are not the same thing.
3. The entire file must be valid UTF-8 encoded, this package will return an error if that is not the case.
4. The most specific user agent wins.
5. Allow and disallow directives also respect the one that is most specific and in the event of a tie the allow directive will win.
6. Directives listed in the robots.txt file apply only to a host, protocol, and port number, https://developers.google.com/search/reference/robots_txt#file-location--range-of-validity. This package validates the host, protocol, and port number every time it is asked if a robot "CanCrawl" a path and the path contains the host, protocol, and port.
robotsTxt := robotstxt.New("https://www.dumpsters.com", ` User-agent: * Disallow: "/wiki/" `) robotsTxt.CanCrawl("googlebot", "/products/") // True robotsTxt.CanCrawl("googlebot", "https://www.dumpsters.com/products/") // True robotsTxt.CanCrawl("googlebot", "http://www.dumpsters.com/products/") // False - the URL did not match the URL provided when "robotsTxt" was created
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type RobotsTxt ¶
type RobotsTxt struct {
// contains filtered or unexported fields
}
RobotsTxt exposes all of the things you would want to know about a robots.txt file without giving direct access to the directives defined. Directives such as allow and disallow are not important for a robot (user-agent) to know about, they are implementation details, instead a robot just needs to know if it is allowed to crawl a given path so this interface provides a "CanCrawl" method as opposed to giving you direct access to allow and disallow.
func New ¶
New creates a RobotsTxt.
Example ¶
robotsTxt, _ := robotstxt.New("https://www.dumpsters.com", strings.NewReader(` # Robots.txt test file # 06/04/2018 # Indented comments are allowed User-agent : * Crawl-delay: 5 Disallow: /cms/ Disallow: /pricing/frontend Disallow: /pricing/admin/ # SPA application built into the site Disallow : *?s=lightbox Disallow: /se/en$ Disallow:*/retail/*/frontend/* Allow: /be/fr_fr/retail/fr/ # Multiple groups with all access User-agent: AdsBot-Google User-agent: AdsBot-Bing Allow: / # Multiple sitemaps Sitemap: https://www.dumpsters.com/sitemap.xml Sitemap: https://www.dumpsters.com/sitemap-launch-index.xml `)) canCrawl, err := robotsTxt.CanCrawl("googlebot", "/cms/pages") fmt.Println(canCrawl) fmt.Println(err) fmt.Println(robotsTxt.Sitemaps()) fmt.Println(robotsTxt.URL()) fmt.Println(robotsTxt.CrawlDelay("googlebot"))
Output: false <nil> [https://www.dumpsters.com/sitemap.xml https://www.dumpsters.com/sitemap-launch-index.xml] https://www.dumpsters.com:443 5s
func NewFromFile ¶
NewFromFile is a convenience function that creates a RobotsTxt from a local file.
Example ¶
filePath, err := filepath.Abs("./robots.txt") fmt.Println(err) robotsTxt, err := robotstxt.NewFromFile("https://www.dumpsters.com", filePath) fmt.Println(err) canCrawl, err := robotsTxt.CanCrawl("googlebot", "/cms/pages") fmt.Println(canCrawl) fmt.Println(err) fmt.Println(robotsTxt.Sitemaps()) fmt.Println(robotsTxt.URL()) fmt.Println(robotsTxt.CrawlDelay("googlebot"))
Output: <nil> <nil> false <nil> [https://www.dumpsters.com/sitemap.xml https://www.dumpsters.com/sitemap-launch-index.xml] https://www.dumpsters.com:443 5s
func NewFromURL ¶
func NewFromURL(url string, getFn func(url string) (resp *http.Response, err error)) (*RobotsTxt, error)
NewFromURL is a convenience function that retrieves a robots.txt for a given scheme, host,and an optional port number. According to the spec the robots.txt file must always live at the top level directory, https://developers.google.com/search/reference/robots_txt#file-location--range-of-validity, so everything that is not the top level is ignored. It is expected that the "getFn" passed in is capable of doing the HTTP request, usually coming from "http.Get" or the "http.Client.Get".
The following are examples of only looking at the top level for /robots.txt:
Given: Looks for: https://www.dumpsters.com/pricing/roll-off-dumpsters -> https://www.dumpsters.com/robots.txt https://www.dumpsters.com -> https://www.dumpsters.com/robots.txt https://www.dumpsters.com/robots.txt -> https://www.dumpsters.com/robots.txt
Example ¶
robotsTxt, err := robotstxt.NewFromURL("https://www.dumpsters.com", http.Get) fmt.Println(err) canCrawl, err := robotsTxt.CanCrawl("googlebot", "/bdso/pages") fmt.Println(canCrawl) fmt.Println(err) fmt.Println(robotsTxt.Sitemaps()) fmt.Println(robotsTxt.URL()) fmt.Println(robotsTxt.CrawlDelay("googlebot")) // <nil> // false // <nil> // [https://www.dumpsters.com/sitemap.xml https://www.dumpsters.com/sitemap-launch-index.xml] // https://www.dumpsters.com:443 // 5s
Output:
func (*RobotsTxt) CanCrawl ¶
CanCrawl determines whether or not a given robot (user-agent) is allowed to crawl a URL based on allow and disallow directives in the robots.txt.
func (*RobotsTxt) CrawlDelay ¶
How long should a robot wait between accessing pages on a site.
func (*RobotsTxt) URL ¶
Getter that returns the URL a particular robots.txt file is associated with, i.e. https://www.dumpsters.com:443. The port is assumed from the protocol if it is not provided during creation.