robotstxt

package module
v2.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 16, 2019 License: MIT Imports: 13 Imported by: 0

README

robotstxt

Package robotstxt implements the Robots Exclusion Protocol, https://en.wikipedia.org/wiki/Robots_exclusion_standard, with a simple API.

Go Report Card Coverage Status

Link to the GoDocs -> here.

Basic Examples

1. Creating a robotsTxt with a URL

This is the most common way to use this package.

package main

import (
    "fmt"
    "github.com/itmayziii/robotstxt"
)

func main () {
	ch := make(chan robotstxt.ProtocolResult)
	go robotstxt.NewFromURL("https://www.dumpsters.com", ch)
	robotsTxt := <-ch
	
	fmt.Println(robotsTxt.Error)
	canCrawl, err := robotsTxt.Protocol.CanCrawl("googlebot", "/bdso/pages")
	fmt.Println(canCrawl)
	fmt.Println(err)
	// <nil>
	// false
	// <nil>
}
2. Creating a robotsTxt Manually

You likely will not be doing this method as you would need to parse get the robots.txt from the server yourself.

package main

import (
    "fmt"
    "github.com/itmayziii/robotstxt"
)

func main () {
    robotsTxt, _ := robotstxt.New("", `
# Robots.txt test file
# 06/04/2018
    # Indented comments are allowed
        
User-agent : *
Crawl-delay: 5
Disallow: /cms/
Disallow: /pricing/frontend
Disallow: /pricing/admin/ # SPA application built into the site
Disallow : *?s=lightbox
Disallow: /se/en$
Disallow:*/retail/*/frontend/*
        
Allow: /be/fr_fr/retail/fr/
        
# Multiple groups with all access
User-agent: AdsBot-Google
User-agent: AdsBot-Bing
Allow: /
        
# Multiple sitemaps
Sitemap: https://www.dumpsters.com/sitemap.xml
Sitemap: https://www.dumpsters.com/sitemap-launch-index.xml
`)
    canCrawl, err := robotsTxt.CanCrawl("googlebot", "/cms/pages")
    fmt.Println(canCrawl)
    fmt.Println(err)
    // Output:
    // false
    // <nil>
}

Specification

A large portion of how this package handles the specification comes from https://developers.google.com/search/reference/robots_txt. In fact this package tests against all of the examples listed at https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values plus many more.

Important Notes From the Spec
  1. User Agents are case insensitive so "googlebot" and "Googlebot" are the same thing.

  2. Directive "Allow" and "Disallow" values are case sensitive so "/pricing" and "/Pricing" are not the same thing.

  3. The entire file must be valid UTF-8 encoded, this package will return an error if that is not the case.

  4. The most specific user agent wins.

  5. Allow and disallow directives also respect the one that is most specific based on length and in the event of a tie the allow directive will win, i.e. disallow: /cms/ loses to allow: /cms/ and to allow: /cms* but not to allow: /cms.

  6. Directives listed in the robots.txt file apply only to a host, protocol, and port number, https://developers.google.com/search/reference/robots_txt#file-location--range-of-validity. This package validates the host, protocol, and port number every time it is asked if a robot "CanCrawl" a path and the path contains the host, protocol, and port.

 robotsTxt := robotstxt.New("https://www.dumpsters.com", `
     User-agent: *
     Disallow: "/wiki/"
 `)
 robotsTxt.CanCrawl("googlebot", "/products/") // True
 robotsTxt.CanCrawl("googlebot", "https://www.dumpsters.com/products/") // True
 robotsTxt.CanCrawl("googlebot", "http://www.dumpsters.com/products/") // False - the URL did not match the URL provided when "robotsTxt" was created

Roadmap

  • Respect a "noindex" meta tag and HTTP response header as described here. There a couple of considerations to be taken into account before implementing this:
    • We need to leave the current CanCrawl method as is since it is meant to determine whether or not a robot can crawl a page prior to actually loading the page. The "noindex" and meta tag and HTTP response header by nature of where they are located only happen after the crawler has loaded the page.
    • Maybe 2 methods would be needed to implement this. One method that would retrieve the response for the user and hand back an instance of RobotsExclusionProtocol as well as the response itself, something like CanCrawlPage, which of course would also go through the robots.txt logic before even requesting the page. A separate second method that would take an already retrieved response that the user has goes through the same logic that the first method does. I'm not 100% sure we would need both methods but I can see why some people would want to retrieve the HTTP response themselves.
  • Potentially support the Host directive as described here.

Documentation

Overview

Package robotstxt implements the Robots Exclusion Protocol, https://en.wikipedia.org/wiki/Robots_exclusion_standard, with a simple API.

Specification

A large portion of how this package handles the specification comes from https://developers.google.com/search/reference/robots_txt. In fact this package tests against all of the examples listed at https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values plus many more.

Important Notes From the Spec

1. User Agents are case insensitive so "googlebot" and "Googlebot" are the same thing.

2. Directive "Allow" and "Disallow" values are case sensitive so "/pricing" and "/Pricing" are not the same thing.

3. The entire file must be valid UTF-8 encoded, this package will return an error if that is not the case.

4. The most specific user agent wins.

5. Allow and disallow directives also respect the one that is most specific and in the event of a tie the allow directive will win.

6. Directives listed in the robots.txt file apply only to a host, protocol, and port number, https://developers.google.com/search/reference/robots_txt#file-location--range-of-validity. This package validates the host, protocol, and port number every time it is asked if a robot "CanCrawl" a path and the path contains the host, protocol, and port.

robotsTxt := robotstxt.New("https://www.dumpsters.com", `
    User-agent: *
    Disallow: "/wiki/"
`)
robotsTxt.CanCrawl("googlebot", "/products/") // True
robotsTxt.CanCrawl("googlebot", "https://www.dumpsters.com/products/") // True
robotsTxt.CanCrawl("googlebot", "http://www.dumpsters.com/products/") // False - the URL did not match the URL provided when "robotsTxt" was created

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type RobotsTxt

type RobotsTxt struct {
	// contains filtered or unexported fields
}

RobotsTxt exposes all of the things you would want to know about a robots.txt file without giving direct access to the directives defined. Directives such as allow and disallow are not important for a robot (user-agent) to know about, they are implementation details, instead a robot just needs to know if it is allowed to crawl a given path so this interface provides a "CanCrawl" method as opposed to giving you direct access to allow and disallow.

func New

func New(url string, robotsTxtReader io.Reader) (*RobotsTxt, error)

New creates a RobotsTxt.

Example
robotsTxt, _ := robotstxt.New("https://www.dumpsters.com", strings.NewReader(`
# Robots.txt test file
# 06/04/2018
      # Indented comments are allowed

User-agent : *
Crawl-delay: 5
Disallow: /cms/
Disallow: /pricing/frontend
Disallow: /pricing/admin/ # SPA application built into the site
Disallow : *?s=lightbox
Disallow: /se/en$
Disallow:*/retail/*/frontend/*

Allow: /be/fr_fr/retail/fr/

# Multiple groups with all access
User-agent: AdsBot-Google
User-agent: AdsBot-Bing
Allow: /

# Multiple sitemaps
Sitemap: https://www.dumpsters.com/sitemap.xml
Sitemap: https://www.dumpsters.com/sitemap-launch-index.xml
`))
canCrawl, err := robotsTxt.CanCrawl("googlebot", "/cms/pages")
fmt.Println(canCrawl)
fmt.Println(err)
fmt.Println(robotsTxt.Sitemaps())
fmt.Println(robotsTxt.URL())
fmt.Println(robotsTxt.CrawlDelay("googlebot"))
Output:

false
<nil>
[https://www.dumpsters.com/sitemap.xml https://www.dumpsters.com/sitemap-launch-index.xml]
https://www.dumpsters.com:443
5s

func NewFromFile

func NewFromFile(url, path string) (*RobotsTxt, error)

NewFromFile is a convenience function that creates a RobotsTxt from a local file.

Example
filePath, err := filepath.Abs("./robots.txt")
fmt.Println(err)

robotsTxt, err := robotstxt.NewFromFile("https://www.dumpsters.com", filePath)
fmt.Println(err)

canCrawl, err := robotsTxt.CanCrawl("googlebot", "/cms/pages")
fmt.Println(canCrawl)
fmt.Println(err)
fmt.Println(robotsTxt.Sitemaps())
fmt.Println(robotsTxt.URL())
fmt.Println(robotsTxt.CrawlDelay("googlebot"))
Output:

<nil>
<nil>
false
<nil>
[https://www.dumpsters.com/sitemap.xml https://www.dumpsters.com/sitemap-launch-index.xml]
https://www.dumpsters.com:443
5s

func NewFromURL

func NewFromURL(url string, getFn func(url string) (resp *http.Response, err error)) (*RobotsTxt, error)

NewFromURL is a convenience function that retrieves a robots.txt for a given scheme, host,and an optional port number. According to the spec the robots.txt file must always live at the top level directory, https://developers.google.com/search/reference/robots_txt#file-location--range-of-validity, so everything that is not the top level is ignored. It is expected that the "getFn" passed in is capable of doing the HTTP request, usually coming from "http.Get" or the "http.Client.Get".

The following are examples of only looking at the top level for /robots.txt:

Given:                                                  Looks for:
https://www.dumpsters.com/pricing/roll-off-dumpsters -> https://www.dumpsters.com/robots.txt
https://www.dumpsters.com                            -> https://www.dumpsters.com/robots.txt
https://www.dumpsters.com/robots.txt                 -> https://www.dumpsters.com/robots.txt
Example
robotsTxt, err := robotstxt.NewFromURL("https://www.dumpsters.com", http.Get)
fmt.Println(err)

canCrawl, err := robotsTxt.CanCrawl("googlebot", "/bdso/pages")
fmt.Println(canCrawl)
fmt.Println(err)
fmt.Println(robotsTxt.Sitemaps())
fmt.Println(robotsTxt.URL())
fmt.Println(robotsTxt.CrawlDelay("googlebot"))
// <nil>
// false
// <nil>
// [https://www.dumpsters.com/sitemap.xml https://www.dumpsters.com/sitemap-launch-index.xml]
// https://www.dumpsters.com:443
// 5s
Output:

func (*RobotsTxt) CanCrawl

func (robotsTxt *RobotsTxt) CanCrawl(robotName, url string) (bool, error)

CanCrawl determines whether or not a given robot (user-agent) is allowed to crawl a URL based on allow and disallow directives in the robots.txt.

func (*RobotsTxt) CrawlDelay

func (robotsTxt *RobotsTxt) CrawlDelay(robotName string) time.Duration

How long should a robot wait between accessing pages on a site.

func (*RobotsTxt) Sitemaps

func (robotsTxt *RobotsTxt) Sitemaps() []string

Returns the sitemaps that are defined in the robots.txt.

func (*RobotsTxt) URL

func (robotsTxt *RobotsTxt) URL() string

Getter that returns the URL a particular robots.txt file is associated with, i.e. https://www.dumpsters.com:443. The port is assumed from the protocol if it is not provided during creation.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL