robots

package module
v2.0.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 17, 2019 License: MIT Imports: 9 Imported by: 6

README

robots

Package robots implements robots.txt file parsing based on Google's specification:

https://developers.google.com/search/reference/robots_txt

Documentation

For installation, usage and description, please see the documentation:

https://godoc.org/github.com/benjaminestes/robots

License

MIT

Documentation

Overview

Package robots implements robots.txt parsing and matching based on Google's specification. For a robots.txt primer, please read the full specification at: https://developers.google.com/search/reference/robots_txt.

What clients need to think about

Clients of this package have one obligation: when testing whether a URL can be crawled, use the correct robots.txt file. The specification uses scheme, port, and punycode variations to define which URLs are in scope.

To get the right robots.txt file, use Locate. Locate takes as its only argument the URL you want to access. It returns the URL of the robots.txt file that governs access. Locate will always return a single unique robots.txt URL for all input URLs sharing a scope.

In practice, a client pattern for testing whether a URL is accessible would be: a) Locate the robots.txt file for the URL; b) check whether you have fetched data for that robots.txt file; c) if yes, use the data to Test the URL against your user agent; d) if no, fetch the robots.txt data and try again.

For details, see "File location & range of validity" in the specification: https://developers.google.com/search/reference/robots_txt#file-location--range-of-validity.

How bad input is handled

A generous parser is specified. A valid line is accepted, and an invalid line is silently discarded. This is true even if the content parsed is in an unexpected format, like HTML.

For details, see "File format" in the specification: https://developers.google.com/search/reference/robots_txt#file-format

Effect of robots.txt status code

The specification states that a crawler will assume all URLs are accessible, even if there is no robots.txt file, or the body of the robots.txt file is empty. So a robots.txt file with a 404 status code will result in all URLs being crawlable. The exception to this is a 5xx status code. This is treated as a temporary "full disallow" of crawling.

For details, see "Handling HTTP result codes" in the specification: https://developers.google.com/search/reference/robots_txt#handling-http-result-codes

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func Locate

func Locate(rawurl string) (string, error)

Locate takes a string representing an absolute URL and returns the absolute URL of the robots.txt that would govern its crawlability (assuming such a file exists).

Locate covers all special cases of the specification, including punycode domains, domain and protocol case-insensitivity, and default ports for certain protocols. It is guaranteed to produce the same robots.txt URL for any input URLs that share a scope.

Types

type Robots

type Robots struct {
	// contains filtered or unexported fields
}

Robots represents an object whose methods govern access to URLs within the scope of a robots.txt file, and what sitemaps, if any, have been discovered during parsing.

Example
package main

import (
	"net/http"

	"github.com/benjaminestes/robots"
)

func main() {
	robotsURL, err := robots.Locate("https://www.example.com/page.html")
	if err != nil {
		// Handle error - couldn't parse input URL.
	}

	resp, err := http.Get(robotsURL)
	if err != nil {
		// Handle error.
	}
	defer resp.Body.Close()

	r, err := robots.From(200, resp.Body)
	if err != nil {
		// Handle error - couldn't read from input.
	}

	if r.Test("Crawlerbot", "/") {
		// You're good to crawl "/".
	}
	if r.Tester("Crawlerbot")("/page.html") {
		// You're good to crawl "/page.html".
	}

	for _, sitemap := range r.Sitemaps() {
		// As the caller, we are responsible for ensuring that
		// the sitemap URL is in scope of the robots.txt file
		// we used before we try to access it.
		sitemapRobotsURL, err := robots.Locate(sitemap)
		if err != nil {
			// Couldn't parse sitemap URL - probably we should skip.
			continue
		}
		if sitemapRobotsURL == robotsURL && r.Test("Crawlerbot", sitemap) {
			resp, err := http.Get(sitemap)
			if err != nil {
				// Handle error.
			}
			defer resp.Body.Close()
			// ...do something with sitemap.
		}
	}
}
Output:

func From

func From(status int, in io.Reader) (*Robots, error)

From produces a Robots object from an HTTP status code and a robots.txt file represented as an io.Reader. The status code is required; a nil value or empty io.Reader argument will be handled gracefully. In that case, behavior will be determined solely by the response status.

The attitude of the specification is permissive concerning parser errors: all valid input is accepted, and invalid input is silently rejected without failing. Therefore, From will only signal an error condition if it fails to read from the input at all.

func (*Robots) Sitemaps

func (r *Robots) Sitemaps() []string

Sitemaps returns a list of sitemap URLs dicovered during parsing. The specification requires sitemap URLs in robots.txt files to be absolute, but this is the responsibility of the robots.txt author.

func (*Robots) Test

func (r *Robots) Test(name, rawurl string) bool

Test takes an agent string and a rawurl string and checks whether the r allows name to access the path component of rawurl.

Only the path of rawurl is used. For details, see method Tester.

func (*Robots) Tester

func (r *Robots) Tester(name string) func(rawurl string) bool

Tester takes string naming a user agent. It returns a predicate with a single string parameter representing a URL. This predicate can be used to check whether r allows name to crawl the path component of rawurl.

Only the path part of rawurl is considered. Therefore, rawurl can be absolute or relative. It is the caller's responsibility to ensure that the Robots object is applicable to rawurl: no error can be provided if this is not the case. To ensure the Robots object is applicable to rawurl, use the Locate function.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL