grobotstxt

package module

v1.0.3 Latest Latest Go to latest Published: Mar 16, 2022 License: Apache-2.0 Imports: 4 Imported by: 1

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/jimsmart/grobotstxt

Links

Open Source Insights

README ¶

grobotstxt

grobotstxt is a native Go port of Google's robots.txt parser and matcher C++ library.

Direct function-for-function conversion/port
Preserves all behaviour of original library
All 100% of original test suite functionality
Minor language-specific cleanups
Added a helper to extract Sitemap URIs
Super simple API

As per Google's original library, we include a small standalone binary executable, for webmasters, that allows testing a single URL and user-agent against a robots.txt. Ours is called icanhasrobot, and its inputs and outputs are compatible with the original tool.

About

Quoting the README from Google's robots.txt parser and matcher repo:

The Robots Exclusion Protocol (REP) is a standard that enables website owners to control which URLs may be accessed by automated clients (i.e. crawlers) through a simple text file with a specific syntax. It's one of the basic building blocks of the internet as we know it and what allows search engines to operate.

Because the REP was only a de-facto standard for the past 25 years, different implementers implement parsing of robots.txt slightly differently, leading to confusion. This project aims to fix that by releasing the parser that Google uses.

The library is slightly modified (i.e. some internal headers and equivalent symbols) production code used by Googlebot, Google's crawler, to determine which URLs it may access based on rules provided by webmasters in robots.txt files. The library is released open-source to help developers build tools that better reflect Google's robots.txt parsing and matching.

Package grobotstxt aims to be a faithful conversion, from C++ to Go, of Google's robots.txt parser and matcher.

Quickstart

Installation

For developers

Get the package (only needed if not using modules):

go get github.com/jimsmart/grobotstxt

Use the package within your code (see examples below):

import "github.com/jimsmart/grobotstxt"

For webmasters

Assumes Go is installed, and its environment is already set up.

Fetch the package:

go get github.com/jimsmart/grobotstxt

Build and install the standalone binary executable:

go install github.com/jimsmart/grobotstxt/...

By default, the resulting binary executable will be ~/go/bin/icanhasrobot (assuming no customisation has been made to $GOPATH or $GOBIN).

Use the tool:

$ icanhasrobot ~/local/path/to/robots.txt YourBot https://example.com/url
user-agent 'YourBot' with URI 'https://example.com/url': ALLOWED

Additionally, one can pass multiple user-agent names to the tool, using comma-separated values, e.g.

$ icanhasrobot ~/local/path/to/robots.txt Googlebot,Googlebot-image https://example.com/url
user-agent 'Googlebot,Googlebot-image' with URI 'https://example.com/url': ALLOWED

If $GOBIN is not included in your environment's $PATH, use the full path ~/go/bin/icanhasrobot when invoking the executable.

Example Code

`AgentAllowed`

import "github.com/jimsmart/grobotstxt"

// Contents of robots.txt file.
robotsTxt := `
    # robots.txt with restricted area

    User-agent: *
    Disallow: /members/*

    Sitemap: http://example.net/sitemap.xml
`

// Target URI.
uri := "http://example.net/members/index.html"


// Is bot allowed to visit this page?
ok := grobotstxt.AgentAllowed(robotsTxt, "FooBot/1.0", uri)

`Sitemaps`

Additionally, one can also extract all Sitemap URIs from a given robots.txt file:

sitemaps := grobotstxt.Sitemaps(robotsTxt)

Documentation

GoDocs https://godoc.org/github.com/jimsmart/grobotstxt

Testing

To run the tests execute go test inside the project folder.

For a full coverage report, try:

go test -coverprofile=coverage.out && go tool cover -html=coverage.out

Notes

The original library required that the URI passed to the AgentAllowed and AgentsAllowed functions, or to the URI parameter of the standalone binary tool, should follow the encoding/escaping format specified by RFC3986, because the library itself does not perform URI normalisation.

In Go, with its native UTF-8 strings, this requirement is not in line with other commonly used APIs, and is therefore somewhat of a surprising/unexpected behaviour to Go developers.

Because of this, the Go API presented here has been ammended to automatically handle UTF-8 URIs, and performs any necessary normalisation internally.

This is the only behavioural change between grobotstxt and the original C++ library.

License

Like the original library, package grobotstxt is licensed under the terms of the Apache License, Version 2.0.

See LICENSE for more information.

Links

Original project: Google robots.txt parser and matcher library

History

v1.0.3 (2022-03-16) Updates from upstream: Allow additional miss-spelling of 'disallow'. Additional tests. Make icanhasrobot tool return better exit codes. Make icanhasrobots work with multiple UAs.
v1.0.2 (2022-03-16) Bugfix: Allow wider range of characters for user-agent.
v1.0.1 (2021-04-19) Updated modules. Switch from Travis CI to GitHub Actions.
v1.0.0 (2021-04-18) Tagged as stable.
v0.2.1 (2021-01-16) Expose more methods of RobotsMatcher as public. Thanks to anatolym
v0.2.0 (2020-04-24) Removed requirement for pre-encoded RFC3986 URIs on front-facing API.
v0.1.0 (2020-04-23) Initial release.

Documentation ¶

Overview ¶

Package grobotstxt is a Go port of Google's robots.txt parser and matcher C++ library.

See: https://github.com/google/robotstxt

Index ¶

Variables
func AgentAllowed(robotsBody string, userAgent string, uri string) bool
func AgentsAllowed(robotsBody string, userAgents []string, uri string) bool
func Matches(path, pattern string) bool
func Parse(robotsBody string, handler ParseHandler)
func Sitemaps(robotsBody string) []string
type LongestMatchStrategy
- func (s LongestMatchStrategy) MatchAllow(path, pattern string) int
- func (s LongestMatchStrategy) MatchDisallow(path, pattern string) int
type MatchStrategy
type ParseHandler
type Parser
- func NewParser(robotsBody string, handler ParseHandler) *Parser
- func (p *Parser) Parse()
type RobotsMatcher
- func NewRobotsMatcher() *RobotsMatcher

Constants ¶

This section is empty.

Variables ¶

View Source

var AllowFrequentTypos = true

AllowFrequentTypos enables the parsing of common typos in robots.txt, such as DISALOW.

Functions ¶

func AgentAllowed ¶

func AgentAllowed(robotsBody string, userAgent string, uri string) bool

AgentAllowed parses the given robots.txt content, matching it against the given userAgent and URI, and returns true if the given URI is allowed to be fetched by the given user agent.

AgentAllowed will also return false if the given URI is invalid (cannot successfully be parsed by url.Parse).

Example ¶

package main

import (
	"fmt"

	"github.com/jimsmart/grobotstxt"
)

func main() {

	robotsTxt := `
	# robots.txt with restricted area

	User-agent: *
	Disallow: /members/*
`
	ok := grobotstxt.AgentAllowed(robotsTxt, "FooBot/1.0", "http://example.net/members/index.html")
	fmt.Println(ok)

}

Output:

false

func AgentsAllowed ¶

func AgentsAllowed(robotsBody string, userAgents []string, uri string) bool

AgentsAllowed parses the given robots.txt content, matching it against the given userAgents and URI, and returns true if the given URI is allowed to be fetched by any user agent in the list.

AgentsAllowed will also return false if the given URI is invalid (cannot successfully be parsed by url.Parse).

func Matches ¶

func Matches(path, pattern string) bool

Matches implements robots.txt pattern matching.

Returns true if URI path matches the specified pattern. Pattern is anchored at the beginning of path. '$' is special only at the end of pattern.

Since both path and pattern are externally determined (by the webmaster), we make sure to have acceptable worst-case performance.

func Parse ¶

func Parse(robotsBody string, handler ParseHandler)

Parse uses the given robots.txt body and ParseHandler to create a Parser, and calls its Parse method.

func Sitemaps ¶

func Sitemaps(robotsBody string) []string

Sitemaps extracts all "Sitemap:" values from the given robots.txt content.

Example ¶

package main

import (
	"fmt"

	"github.com/jimsmart/grobotstxt"
)

func main() {

	robotsTxt := `
	# robots.txt with sitemaps

	User-agent: *
	Disallow: /members/*

	Sitemap: http://example.net/sitemap.xml
	Sitemap: http://example.net/sitemap2.xml
`
	sitemaps := grobotstxt.Sitemaps(robotsTxt)
	fmt.Println(sitemaps)

}

Output:

[http://example.net/sitemap.xml http://example.net/sitemap2.xml]

Types ¶

type LongestMatchStrategy ¶

type LongestMatchStrategy struct{}

LongestMatchStrategy implements the default robots.txt matching strategy.

The maximum number of characters matched by a pattern is returned as its match priority.

func (LongestMatchStrategy) MatchAllow ¶

func (s LongestMatchStrategy) MatchAllow(path, pattern string) int

func (LongestMatchStrategy) MatchDisallow ¶

func (s LongestMatchStrategy) MatchDisallow(path, pattern string) int

type MatchStrategy ¶

type MatchStrategy interface {
	MatchAllow(path, pattern string) int
	MatchDisallow(path, pattern string) int
}

A MatchStrategy defines a strategy for matching individual lines in a robots.txt file.

Each Match* method should return a match priority, which is interpreted as:

match priority < 0:  No match.

match priority == 0: Match, but treat it as if matched an empty pattern.

match priority > 0:  Match.

type ParseHandler ¶

type ParseHandler interface {
	HandleRobotsStart()
	HandleRobotsEnd()
	HandleUserAgent(lineNum int, value string)
	HandleAllow(lineNum int, value string)
	HandleDisallow(lineNum int, value string)
	HandleSitemap(lineNum int, value string)
	HandleUnknownAction(lineNum int, action, value string)
}

ParseHandler is a handler for directives found in robots.txt. These callbacks are called by Parse() in the sequence they have been found in the file.

type Parser ¶

type Parser struct {
	// contains filtered or unexported fields
}

func NewParser ¶

func NewParser(robotsBody string, handler ParseHandler) *Parser

func (*Parser) Parse ¶

func (p *Parser) Parse()

Parse body of this Parser's robots.txt and emit parse callbacks. This will accept typical typos found in robots.txt, such as 'disalow'.

Note, this function will accept all kind of input but will skip everything that does not look like a robots directive.

type RobotsMatcher ¶

type RobotsMatcher struct {
	MatchStrategy MatchStrategy
	// contains filtered or unexported fields
}

RobotsMatcher — matches robots.txt against URIs.

The RobotsMatcher uses a default match strategy for Allow/Disallow patterns which is the official way of Google crawler to match robots.txt. It is also possible to provide a custom match strategy.

The entry point for the user is to call one of the AgentAllowed() methods that return directly if a URI is being allowed according to the robots.txt and the crawl agent.

The RobotsMatcher can be re-used for URIs/robots.txt but is not concurrency-safe.

func NewRobotsMatcher ¶

func NewRobotsMatcher() *RobotsMatcher

NewRobotsMatcher creates a RobotsMatcher with the default matching strategy. The default matching strategy is longest-match as opposed to the former internet draft that provisioned first-match strategy. Analysis shows that longest-match, while more restrictive for crawlers, is what webmasters assume when writing directives. For example, in case of conflicting matches (both Allow and Disallow), the longest match is the one the user wants. For example, in case of a robots.txt file that has the following rules

Allow: /
Disallow: /cgi-bin

it's pretty obvious what the webmaster wants: they want to allow crawl of every URI except /cgi-bin. However, according to the expired internet standard, crawlers should be allowed to crawl everything with such a rule.

func (*RobotsMatcher) AgentAllowed ¶

func (m *RobotsMatcher) AgentAllowed(robotsBody string, userAgent string, uri string) bool

AgentAllowed parses the given robots.txt content, matching it against the given userAgent and URI, and returns true if the given URI is allowed to be fetched by the given user agent.

AgentAllowed will also return false if the given URI is invalid (cannot successfully be parsed by url.Parse).

func (*RobotsMatcher) AgentsAllowed ¶

func (m *RobotsMatcher) AgentsAllowed(robotsBody string, userAgents []string, uri string) bool

AgentsAllowed parses the given robots.txt content, matching it against the given userAgents and URI, and returns true if the given URI is allowed to be fetched by any user agent in the list.

AgentsAllowed will also return false if the given URI is invalid (cannot successfully be parsed by url.Parse).

func (*RobotsMatcher) Disallowed ¶ added in v0.2.1

func (m *RobotsMatcher) Disallowed() bool

Disallowed returns true if we are disallowed from crawling a matching URI.

func (*RobotsMatcher) DisallowedIgnoreGlobal ¶ added in v0.2.1

func (m *RobotsMatcher) DisallowedIgnoreGlobal() bool

DisallowedIgnoreGlobal returns true if we are disallowed from crawling a matching URI. Ignores any rules specified for the default user agent, and bases its results only on the specified user agents.

func (*RobotsMatcher) EverSeenSpecificAgent ¶ added in v0.2.1

func (m *RobotsMatcher) EverSeenSpecificAgent() bool

EverSeenSpecificAgent returns true iff, when AgentsAllowed() was called, the robots file referred explicitly to one of the specified user agents.

func (*RobotsMatcher) HandleAllow ¶

func (m *RobotsMatcher) HandleAllow(lineNum int, value string)

HandleAllow is called for every "Allow:" line in robots.txt.

func (*RobotsMatcher) HandleDisallow ¶

func (m *RobotsMatcher) HandleDisallow(lineNum int, value string)

HandleDisallow is called for every "Disallow:" line in robots.txt.

func (*RobotsMatcher) HandleRobotsEnd ¶

func (m *RobotsMatcher) HandleRobotsEnd()

HandleRobotsEnd is called at the end of parsing the robots.txt file.

For RobotsMatcher, this does nothing.

func (*RobotsMatcher) HandleRobotsStart ¶

func (m *RobotsMatcher) HandleRobotsStart()

HandleRobotsStart is called at the start of parsing a robots.txt file, and resets all instance member variables.

func (*RobotsMatcher) HandleSitemap ¶

func (m *RobotsMatcher) HandleSitemap(lineNum int, value string)

HandleSitemap is called for every "Sitemap:" line in robots.txt.

For RobotsMatcher, this does nothing.

func (*RobotsMatcher) HandleUnknownAction ¶

func (m *RobotsMatcher) HandleUnknownAction(lineNum int, action, value string)

HandleUnknownAction is called for every unrecognized line in robots.txt.

For RobotsMatcher, this does nothing.

func (*RobotsMatcher) HandleUserAgent ¶

func (m *RobotsMatcher) HandleUserAgent(lineNum int, userAgent string)

HandleUserAgent is called for every "User-Agent:" line in robots.txt.

func (*RobotsMatcher) MatchingLine ¶ added in v0.2.1

func (m *RobotsMatcher) MatchingLine() int

MatchingLine returns the line that matched or 0 if none matched.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
icanhasrobot

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL