The highest tagged major version is v2.

robotstxt

package module

v1.0.2 Latest Latest Go to latest Published: May 12, 2019 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/itmayziii/robotstxt

Links

Open Source Insights

README ¶

robotstxt

Package robotstxt implements the Robots Exclusion Protocol, https://en.wikipedia.org/wiki/Robots_exclusion_standard, with a simple API.

Link to the GoDocs -> here.

Basic Examples

1. Creating a robotsTxt with a URL

This is the most common way to use this package.

package main

import (
    "fmt"
    "github.com/itmayziii/robotstxt"
)

func main () {
	ch := make(chan robotstxt.ProtocolResult)
	go robotstxt.NewFromURL("https://www.dumpsters.com", ch)
	robotsTxt := <-ch
	
	fmt.Println(robotsTxt.Error)
	canCrawl, err := robotsTxt.Protocol.CanCrawl("googlebot", "/bdso/pages")
	fmt.Println(canCrawl)
	fmt.Println(err)
	// <nil>
	// false
	// <nil>
}

2. Creating a robotsTxt Manually

You likely will not be doing this method as you would need to parse get the robots.txt from the server yourself.

package main

import (
    "fmt"
    "github.com/itmayziii/robotstxt"
)

func main () {
    robotsTxt, _ := robotstxt.New("", `
# Robots.txt test file
# 06/04/2018
    # Indented comments are allowed
        
User-agent : *
Crawl-delay: 5
Disallow: /cms/
Disallow: /pricing/frontend
Disallow: /pricing/admin/ # SPA application built into the site
Disallow : *?s=lightbox
Disallow: /se/en$
Disallow:*/retail/*/frontend/*
        
Allow: /be/fr_fr/retail/fr/
        
# Multiple groups with all access
User-agent: AdsBot-Google
User-agent: AdsBot-Bing
Allow: /
        
# Multiple sitemaps
Sitemap: https://www.dumpsters.com/sitemap.xml
Sitemap: https://www.dumpsters.com/sitemap-launch-index.xml
`)
    canCrawl, err := robotsTxt.CanCrawl("googlebot", "/cms/pages")
    fmt.Println(canCrawl)
    fmt.Println(err)
    // Output:
    // false
    // <nil>
}

Specification

A large portion of how this package handles the specification comes from https://developers.google.com/search/reference/robots_txt. In fact this package tests against all of the examples listed at https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values plus many more.

Important Notes From the Spec

User Agents are case insensitive so "googlebot" and "Googlebot" are the same thing.
Directive "Allow" and "Disallow" values are case sensitive so "/pricing" and "/Pricing" are not the same thing.
The entire file must be valid UTF-8 encoded, this package will return an error if that is not the case.
The most specific user agent wins.
Allow and disallow directives also respect the one that is most specific and in the event of a tie the allow directive will win.
Directives listed in the robots.txt file apply only to a host, protocol, and port number, https://developers.google.com/search/reference/robots_txt#file-location--range-of-validity. This package validates the host, protocol, and port number every time it is asked if a robot "CanCrawl" a path and the path contains the host, protocol, and port.

 robotsTxt := robotstxt.New("https://www.dumpsters.com", `
     User-agent: *
     Disallow: "/wiki/"
 `)
 robotsTxt.CanCrawl("googlebot", "/products/") // True
 robotsTxt.CanCrawl("googlebot", "https://www.dumpsters.com/products/") // True
 robotsTxt.CanCrawl("googlebot", "http://www.dumpsters.com/products/") // False - the URL did not match the URL provided when "robotsTxt" was created

Documentation ¶

Overview ¶

Package robotstxt implements the Robots Exclusion Protocol, https://en.wikipedia.org/wiki/Robots_exclusion_standard, with a simple API.

Specification ¶

A large portion of how this package handles the specification comes from https://developers.google.com/search/reference/robots_txt. In fact this package tests against all of the examples listed at https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values plus many more.

Important Notes From the Spec ¶

1. User Agents are case insensitive so "googlebot" and "Googlebot" are the same thing.

2. Directive "Allow" and "Disallow" values are case sensitive so "/pricing" and "/Pricing" are not the same thing.

3. The entire file must be valid UTF-8 encoded, this package will return an error if that is not the case.

4. The most specific user agent wins.

5. Allow and disallow directives also respect the one that is most specific and in the event of a tie the allow directive will win.

6. Directives listed in the robots.txt file apply only to a host, protocol, and port number, https://developers.google.com/search/reference/robots_txt#file-location--range-of-validity. This package validates the host, protocol, and port number every time it is asked if a robot "CanCrawl" a path and the path contains the host, protocol, and port.

robotsTxt := robotstxt.New("https://www.dumpsters.com", `
    User-agent: *
    Disallow: "/wiki/"
`)
robotsTxt.CanCrawl("googlebot", "/products/") // True
robotsTxt.CanCrawl("googlebot", "https://www.dumpsters.com/products/") // True
robotsTxt.CanCrawl("googlebot", "http://www.dumpsters.com/products/") // False - the URL did not match the URL provided when "robotsTxt" was created

Index ¶

func NewFromFile(url, filePath string, ch chan ProtocolResult)
func NewFromURL(url string, ch chan ProtocolResult)
type ProtocolResult
type RobotsExclusionProtocol
- func New(url, robotsTxtContent string) (RobotsExclusionProtocol, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func NewFromFile ¶

func NewFromFile(url, filePath string, ch chan ProtocolResult)

NewFromFile creates an implementation of RobotsExclusionProtocol from a local file.

Example ¶

package main

import (
	"fmt"
	"github.com/itmayziii/robotstxt"
	"path/filepath"
)

func main() {
	filePath, err := filepath.Abs("./robots.txt")
	fmt.Println(err)

	ch := make(chan robotstxt.ProtocolResult)
	go robotstxt.NewFromFile("https://www.dumpsters.com", filePath, ch)
	protocolResult := <-ch

	fmt.Println(protocolResult.Error)
	canCrawl, err := protocolResult.Protocol.CanCrawl("googlebot", "/cms/pages")
	fmt.Println(canCrawl)
	fmt.Println(err)
	fmt.Println(protocolResult.Protocol.Sitemaps())
	fmt.Println(protocolResult.Protocol.URL())
	fmt.Println(protocolResult.Protocol.CrawlDelay("googlebot"))
}

Output:

<nil>
<nil>
false
<nil>
[https://www.dumpsters.com/sitemap.xml https://www.dumpsters.com/sitemap-launch-index.xml]
https://www.dumpsters.com:443
5s

func NewFromURL ¶ added in v1.0.0

func NewFromURL(url string, ch chan ProtocolResult)

NewFromURL retrieves a robots.txt for a given scheme, host, and an optional port number. According to the spec the robots.txt file must always live at the top level directory, https://developers.google.com/search/reference/robots_txt#file-location--range-of-validity, so everything that is not the top level is ignored.

The following are examples of only looking at the top level for /robots.txt:

Given:                                                  Looks for:
https://www.dumpsters.com/pricing/roll-off-dumpsters -> https://www.dumpsters.com/robots.txt
https://www.dumpsters.com                            -> https://www.dumpsters.com/robots.txt
https://www.dumpsters.com/robots.txt                 -> https://www.dumpsters.com/robots.txt

Example ¶

package main

import (
	"fmt"
	"github.com/itmayziii/robotstxt"
)

func main() {
	ch := make(chan robotstxt.ProtocolResult)
	go robotstxt.NewFromURL("https://www.dumpsters.com", ch)
	protocolResult := <-ch

	fmt.Println(protocolResult.Error)
	canCrawl, err := protocolResult.Protocol.CanCrawl("googlebot", "/bdso/pages")
	fmt.Println(canCrawl)
	fmt.Println(err)
	fmt.Println(protocolResult.Protocol.Sitemaps())
	fmt.Println(protocolResult.Protocol.URL())
	fmt.Println(protocolResult.Protocol.CrawlDelay("googlebot"))
	// <nil>
	// false
	// <nil>
	// [https://www.dumpsters.com/sitemap.xml https://www.dumpsters.com/sitemap-launch-index.xml]
	// https://www.dumpsters.com:443
	// 5s
}

Output:

Types ¶

type ProtocolResult ¶ added in v1.0.0

type ProtocolResult struct {
	Protocol RobotsExclusionProtocol
	Error    error
}

ProtocolResult is used for concurrent operations such as NewFromFile and NewFromURL.

type RobotsExclusionProtocol ¶

type RobotsExclusionProtocol interface {
	// CanCrawl determines whether or not a given robot (user-agent) is allowed to crawl a URL based on allow and disallow directives in the
	// robots.txt.
	CanCrawl(robotName, url string) (bool, error)
	// Returns the sitemaps that are defined in the robots.txt.
	Sitemaps() []string
	// Getter that returns the URL a particular robots.txt file is associated with,
	// i.e. https://www.dumpsters.com:443. The port is assumed from the protocol if it is not provided during creation.
	URL() string
	// How long should a robot wait between accessing pages on a site.
	CrawlDelay(robotName string) time.Duration
}

RobotsExclusionProtocol exposes all of the things you would want to know about a robots.txt file without giving direct access to the directives defined. Directives such as allow and disallow are not important for a robot (user-agent) to know about, they are implementation details, instead a robot just needs to know if it is allowed to crawl a given path so this interface provides a "CanCrawl" method as opposed to giving you direct access to allow and disallow.

func New ¶

func New(url, robotsTxtContent string) (RobotsExclusionProtocol, error)

New creates an implementation of RobotsExclusionProtocol.

Example ¶

package main

import (
	"fmt"
	"github.com/itmayziii/robotstxt"
)

func main() {
	robotsTxt, _ := robotstxt.New("https://www.dumpsters.com", `
# Robots.txt test file
# 06/04/2018
      # Indented comments are allowed

User-agent : *
Crawl-delay: 5
Disallow: /cms/
Disallow: /pricing/frontend
Disallow: /pricing/admin/ # SPA application built into the site
Disallow : *?s=lightbox
Disallow: /se/en$
Disallow:*/retail/*/frontend/*

Allow: /be/fr_fr/retail/fr/

# Multiple groups with all access
User-agent: AdsBot-Google
User-agent: AdsBot-Bing
Allow: /

# Multiple sitemaps
Sitemap: https://www.dumpsters.com/sitemap.xml
Sitemap: https://www.dumpsters.com/sitemap-launch-index.xml
`)
	canCrawl, err := robotsTxt.CanCrawl("googlebot", "/cms/pages")
	fmt.Println(canCrawl)
	fmt.Println(err)
	fmt.Println(robotsTxt.Sitemaps())
	fmt.Println(robotsTxt.URL())
	fmt.Println(robotsTxt.CrawlDelay("googlebot"))
}

Output:

false
<nil>
[https://www.dumpsters.com/sitemap.xml https://www.dumpsters.com/sitemap-launch-index.xml]
https://www.dumpsters.com:443
5s

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL