robots

package module

v2.0.5 Latest Latest Go to latest Published: Nov 17, 2019 License: MIT Imports: 9 Imported by: 6

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/benjaminestes/robots

Links

Open Source Insights

README ¶

robots

Package robots implements robots.txt file parsing based on Google's specification:

https://developers.google.com/search/reference/robots_txt

Documentation

For installation, usage and description, please see the documentation:

https://godoc.org/github.com/benjaminestes/robots

License

MIT

Documentation ¶

Overview ¶

Package robots implements robots.txt parsing and matching based on Google's specification. For a robots.txt primer, please read the full specification at: https://developers.google.com/search/reference/robots_txt.

What clients need to think about ¶

Clients of this package have one obligation: when testing whether a URL can be crawled, use the correct robots.txt file. The specification uses scheme, port, and punycode variations to define which URLs are in scope.

To get the right robots.txt file, use Locate. Locate takes as its only argument the URL you want to access. It returns the URL of the robots.txt file that governs access. Locate will always return a single unique robots.txt URL for all input URLs sharing a scope.

In practice, a client pattern for testing whether a URL is accessible would be: a) Locate the robots.txt file for the URL; b) check whether you have fetched data for that robots.txt file; c) if yes, use the data to Test the URL against your user agent; d) if no, fetch the robots.txt data and try again.

For details, see "File location & range of validity" in the specification: https://developers.google.com/search/reference/robots_txt#file-location--range-of-validity.

How bad input is handled ¶

A generous parser is specified. A valid line is accepted, and an invalid line is silently discarded. This is true even if the content parsed is in an unexpected format, like HTML.

For details, see "File format" in the specification: https://developers.google.com/search/reference/robots_txt#file-format

Effect of robots.txt status code ¶

The specification states that a crawler will assume all URLs are accessible, even if there is no robots.txt file, or the body of the robots.txt file is empty. So a robots.txt file with a 404 status code will result in all URLs being crawlable. The exception to this is a 5xx status code. This is treated as a temporary "full disallow" of crawling.

For details, see "Handling HTTP result codes" in the specification: https://developers.google.com/search/reference/robots_txt#handling-http-result-codes

Index ¶

func Locate(rawurl string) (string, error)
type Robots
- func From(status int, in io.Reader) (*Robots, error)

Examples ¶

Robots

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Locate ¶

func Locate(rawurl string) (string, error)

Locate takes a string representing an absolute URL and returns the absolute URL of the robots.txt that would govern its crawlability (assuming such a file exists).

Locate covers all special cases of the specification, including punycode domains, domain and protocol case-insensitivity, and default ports for certain protocols. It is guaranteed to produce the same robots.txt URL for any input URLs that share a scope.

Types ¶

type Robots ¶

type Robots struct {
	// contains filtered or unexported fields
}

Robots represents an object whose methods govern access to URLs within the scope of a robots.txt file, and what sitemaps, if any, have been discovered during parsing.

Example ¶

package main

import (
	"net/http"

	"github.com/benjaminestes/robots"
)

func main() {
	robotsURL, err := robots.Locate("https://www.example.com/page.html")
	if err != nil {
		// Handle error - couldn't parse input URL.
	}

	resp, err := http.Get(robotsURL)
	if err != nil {
		// Handle error.
	}
	defer resp.Body.Close()

	r, err := robots.From(200, resp.Body)
	if err != nil {
		// Handle error - couldn't read from input.
	}

	if r.Test("Crawlerbot", "/") {
		// You're good to crawl "/".
	}
	if r.Tester("Crawlerbot")("/page.html") {
		// You're good to crawl "/page.html".
	}

	for _, sitemap := range r.Sitemaps() {
		// As the caller, we are responsible for ensuring that
		// the sitemap URL is in scope of the robots.txt file
		// we used before we try to access it.
		sitemapRobotsURL, err := robots.Locate(sitemap)
		if err != nil {
			// Couldn't parse sitemap URL - probably we should skip.
			continue
		}
		if sitemapRobotsURL == robotsURL && r.Test("Crawlerbot", sitemap) {
			resp, err := http.Get(sitemap)
			if err != nil {
				// Handle error.
			}
			defer resp.Body.Close()
			// ...do something with sitemap.
		}
	}
}

Output:

func From ¶

func From(status int, in io.Reader) (*Robots, error)

From produces a Robots object from an HTTP status code and a robots.txt file represented as an io.Reader. The status code is required; a nil value or empty io.Reader argument will be handled gracefully. In that case, behavior will be determined solely by the response status.

The attitude of the specification is permissive concerning parser errors: all valid input is accepted, and invalid input is silently rejected without failing. Therefore, From will only signal an error condition if it fails to read from the input at all.

func (*Robots) Sitemaps ¶

func (r *Robots) Sitemaps() []string

Sitemaps returns a list of sitemap URLs dicovered during parsing. The specification requires sitemap URLs in robots.txt files to be absolute, but this is the responsibility of the robots.txt author.

func (*Robots) Test ¶

func (r *Robots) Test(name, rawurl string) bool

Test takes an agent string and a rawurl string and checks whether the r allows name to access the path component of rawurl.

Only the path of rawurl is used. For details, see method Tester.

func (*Robots) Tester ¶

func (r *Robots) Tester(name string) func(rawurl string) bool

Tester takes string naming a user agent. It returns a predicate with a single string parameter representing a URL. This predicate can be used to check whether r allows name to crawl the path component of rawurl.

Only the path part of rawurl is considered. Therefore, rawurl can be absolute or relative. It is the caller's responsibility to ensure that the Robots object is applicable to rawurl: no error can be provided if this is not the case. To ensure the Robots object is applicable to rawurl, use the Locate function.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL