grobotstxt

package module
v1.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 16, 2022 License: Apache-2.0 Imports: 4 Imported by: 1

README

grobotstxt

Apache 2.0 Build Status codecov Go Report Card Used By Godoc

grobotstxt is a native Go port of Google's robots.txt parser and matcher C++ library.

  • Direct function-for-function conversion/port
  • Preserves all behaviour of original library
  • All 100% of original test suite functionality
  • Minor language-specific cleanups
  • Added a helper to extract Sitemap URIs
  • Super simple API

As per Google's original library, we include a small standalone binary executable, for webmasters, that allows testing a single URL and user-agent against a robots.txt. Ours is called icanhasrobot, and its inputs and outputs are compatible with the original tool.

About

Quoting the README from Google's robots.txt parser and matcher repo:

The Robots Exclusion Protocol (REP) is a standard that enables website owners to control which URLs may be accessed by automated clients (i.e. crawlers) through a simple text file with a specific syntax. It's one of the basic building blocks of the internet as we know it and what allows search engines to operate.

Because the REP was only a de-facto standard for the past 25 years, different implementers implement parsing of robots.txt slightly differently, leading to confusion. This project aims to fix that by releasing the parser that Google uses.

The library is slightly modified (i.e. some internal headers and equivalent symbols) production code used by Googlebot, Google's crawler, to determine which URLs it may access based on rules provided by webmasters in robots.txt files. The library is released open-source to help developers build tools that better reflect Google's robots.txt parsing and matching.

Package grobotstxt aims to be a faithful conversion, from C++ to Go, of Google's robots.txt parser and matcher.

Quickstart

Installation
For developers

Get the package (only needed if not using modules):

go get github.com/jimsmart/grobotstxt

Use the package within your code (see examples below):

import "github.com/jimsmart/grobotstxt"
For webmasters

Assumes Go is installed, and its environment is already set up.

Fetch the package:

go get github.com/jimsmart/grobotstxt

Build and install the standalone binary executable:

go install github.com/jimsmart/grobotstxt/...

By default, the resulting binary executable will be ~/go/bin/icanhasrobot (assuming no customisation has been made to $GOPATH or $GOBIN).

Use the tool:

$ icanhasrobot ~/local/path/to/robots.txt YourBot https://example.com/url
user-agent 'YourBot' with URI 'https://example.com/url': ALLOWED

Additionally, one can pass multiple user-agent names to the tool, using comma-separated values, e.g.

$ icanhasrobot ~/local/path/to/robots.txt Googlebot,Googlebot-image https://example.com/url
user-agent 'Googlebot,Googlebot-image' with URI 'https://example.com/url': ALLOWED

If $GOBIN is not included in your environment's $PATH, use the full path ~/go/bin/icanhasrobot when invoking the executable.

Example Code
AgentAllowed
import "github.com/jimsmart/grobotstxt"

// Contents of robots.txt file.
robotsTxt := `
    # robots.txt with restricted area

    User-agent: *
    Disallow: /members/*

    Sitemap: http://example.net/sitemap.xml
`

// Target URI.
uri := "http://example.net/members/index.html"


// Is bot allowed to visit this page?
ok := grobotstxt.AgentAllowed(robotsTxt, "FooBot/1.0", uri)

See also AgentsAllowed.

Sitemaps

Additionally, one can also extract all Sitemap URIs from a given robots.txt file:

sitemaps := grobotstxt.Sitemaps(robotsTxt)

Documentation

GoDocs https://godoc.org/github.com/jimsmart/grobotstxt

Testing

To run the tests execute go test inside the project folder.

For a full coverage report, try:

go test -coverprofile=coverage.out && go tool cover -html=coverage.out

Notes

The original library required that the URI passed to the AgentAllowed and AgentsAllowed functions, or to the URI parameter of the standalone binary tool, should follow the encoding/escaping format specified by RFC3986, because the library itself does not perform URI normalisation.

In Go, with its native UTF-8 strings, this requirement is not in line with other commonly used APIs, and is therefore somewhat of a surprising/unexpected behaviour to Go developers.

Because of this, the Go API presented here has been ammended to automatically handle UTF-8 URIs, and performs any necessary normalisation internally.

This is the only behavioural change between grobotstxt and the original C++ library.

License

Like the original library, package grobotstxt is licensed under the terms of the Apache License, Version 2.0.

See LICENSE for more information.

History

  • v1.0.3 (2022-03-16) Updates from upstream: Allow additional miss-spelling of 'disallow'. Additional tests. Make icanhasrobot tool return better exit codes. Make icanhasrobots work with multiple UAs.
  • v1.0.2 (2022-03-16) Bugfix: Allow wider range of characters for user-agent.
  • v1.0.1 (2021-04-19) Updated modules. Switch from Travis CI to GitHub Actions.
  • v1.0.0 (2021-04-18) Tagged as stable.
  • v0.2.1 (2021-01-16) Expose more methods of RobotsMatcher as public. Thanks to anatolym
  • v0.2.0 (2020-04-24) Removed requirement for pre-encoded RFC3986 URIs on front-facing API.
  • v0.1.0 (2020-04-23) Initial release.

Documentation

Overview

Package grobotstxt is a Go port of Google's robots.txt parser and matcher C++ library.

See: https://github.com/google/robotstxt

Index

Examples

Constants

This section is empty.

Variables

View Source
var AllowFrequentTypos = true

AllowFrequentTypos enables the parsing of common typos in robots.txt, such as DISALOW.

Functions

func AgentAllowed

func AgentAllowed(robotsBody string, userAgent string, uri string) bool

AgentAllowed parses the given robots.txt content, matching it against the given userAgent and URI, and returns true if the given URI is allowed to be fetched by the given user agent.

AgentAllowed will also return false if the given URI is invalid (cannot successfully be parsed by url.Parse).

Example
package main

import (
	"fmt"

	"github.com/jimsmart/grobotstxt"
)

func main() {

	robotsTxt := `
	# robots.txt with restricted area

	User-agent: *
	Disallow: /members/*
`
	ok := grobotstxt.AgentAllowed(robotsTxt, "FooBot/1.0", "http://example.net/members/index.html")
	fmt.Println(ok)

}
Output:

false

func AgentsAllowed

func AgentsAllowed(robotsBody string, userAgents []string, uri string) bool

AgentsAllowed parses the given robots.txt content, matching it against the given userAgents and URI, and returns true if the given URI is allowed to be fetched by any user agent in the list.

AgentsAllowed will also return false if the given URI is invalid (cannot successfully be parsed by url.Parse).

func Matches

func Matches(path, pattern string) bool

Matches implements robots.txt pattern matching.

Returns true if URI path matches the specified pattern. Pattern is anchored at the beginning of path. '$' is special only at the end of pattern.

Since both path and pattern are externally determined (by the webmaster), we make sure to have acceptable worst-case performance.

func Parse

func Parse(robotsBody string, handler ParseHandler)

Parse uses the given robots.txt body and ParseHandler to create a Parser, and calls its Parse method.

func Sitemaps

func Sitemaps(robotsBody string) []string

Sitemaps extracts all "Sitemap:" values from the given robots.txt content.

Example
package main

import (
	"fmt"

	"github.com/jimsmart/grobotstxt"
)

func main() {

	robotsTxt := `
	# robots.txt with sitemaps

	User-agent: *
	Disallow: /members/*

	Sitemap: http://example.net/sitemap.xml
	Sitemap: http://example.net/sitemap2.xml
`
	sitemaps := grobotstxt.Sitemaps(robotsTxt)
	fmt.Println(sitemaps)

}
Output:

[http://example.net/sitemap.xml http://example.net/sitemap2.xml]

Types

type LongestMatchStrategy

type LongestMatchStrategy struct{}

LongestMatchStrategy implements the default robots.txt matching strategy.

The maximum number of characters matched by a pattern is returned as its match priority.

func (LongestMatchStrategy) MatchAllow

func (s LongestMatchStrategy) MatchAllow(path, pattern string) int

func (LongestMatchStrategy) MatchDisallow

func (s LongestMatchStrategy) MatchDisallow(path, pattern string) int

type MatchStrategy

type MatchStrategy interface {
	MatchAllow(path, pattern string) int
	MatchDisallow(path, pattern string) int
}

A MatchStrategy defines a strategy for matching individual lines in a robots.txt file.

Each Match* method should return a match priority, which is interpreted as:

match priority < 0:  No match.

match priority == 0: Match, but treat it as if matched an empty pattern.

match priority > 0:  Match.

type ParseHandler

type ParseHandler interface {
	HandleRobotsStart()
	HandleRobotsEnd()
	HandleUserAgent(lineNum int, value string)
	HandleAllow(lineNum int, value string)
	HandleDisallow(lineNum int, value string)
	HandleSitemap(lineNum int, value string)
	HandleUnknownAction(lineNum int, action, value string)
}

ParseHandler is a handler for directives found in robots.txt. These callbacks are called by Parse() in the sequence they have been found in the file.

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

func NewParser

func NewParser(robotsBody string, handler ParseHandler) *Parser

func (*Parser) Parse

func (p *Parser) Parse()

Parse body of this Parser's robots.txt and emit parse callbacks. This will accept typical typos found in robots.txt, such as 'disalow'.

Note, this function will accept all kind of input but will skip everything that does not look like a robots directive.

type RobotsMatcher

type RobotsMatcher struct {
	MatchStrategy MatchStrategy
	// contains filtered or unexported fields
}

RobotsMatcher — matches robots.txt against URIs.

The RobotsMatcher uses a default match strategy for Allow/Disallow patterns which is the official way of Google crawler to match robots.txt. It is also possible to provide a custom match strategy.

The entry point for the user is to call one of the AgentAllowed() methods that return directly if a URI is being allowed according to the robots.txt and the crawl agent.

The RobotsMatcher can be re-used for URIs/robots.txt but is not concurrency-safe.

func NewRobotsMatcher

func NewRobotsMatcher() *RobotsMatcher

NewRobotsMatcher creates a RobotsMatcher with the default matching strategy. The default matching strategy is longest-match as opposed to the former internet draft that provisioned first-match strategy. Analysis shows that longest-match, while more restrictive for crawlers, is what webmasters assume when writing directives. For example, in case of conflicting matches (both Allow and Disallow), the longest match is the one the user wants. For example, in case of a robots.txt file that has the following rules

Allow: /
Disallow: /cgi-bin

it's pretty obvious what the webmaster wants: they want to allow crawl of every URI except /cgi-bin. However, according to the expired internet standard, crawlers should be allowed to crawl everything with such a rule.

func (*RobotsMatcher) AgentAllowed

func (m *RobotsMatcher) AgentAllowed(robotsBody string, userAgent string, uri string) bool

AgentAllowed parses the given robots.txt content, matching it against the given userAgent and URI, and returns true if the given URI is allowed to be fetched by the given user agent.

AgentAllowed will also return false if the given URI is invalid (cannot successfully be parsed by url.Parse).

func (*RobotsMatcher) AgentsAllowed

func (m *RobotsMatcher) AgentsAllowed(robotsBody string, userAgents []string, uri string) bool

AgentsAllowed parses the given robots.txt content, matching it against the given userAgents and URI, and returns true if the given URI is allowed to be fetched by any user agent in the list.

AgentsAllowed will also return false if the given URI is invalid (cannot successfully be parsed by url.Parse).

func (*RobotsMatcher) Disallowed added in v0.2.1

func (m *RobotsMatcher) Disallowed() bool

Disallowed returns true if we are disallowed from crawling a matching URI.

func (*RobotsMatcher) DisallowedIgnoreGlobal added in v0.2.1

func (m *RobotsMatcher) DisallowedIgnoreGlobal() bool

DisallowedIgnoreGlobal returns true if we are disallowed from crawling a matching URI. Ignores any rules specified for the default user agent, and bases its results only on the specified user agents.

func (*RobotsMatcher) EverSeenSpecificAgent added in v0.2.1

func (m *RobotsMatcher) EverSeenSpecificAgent() bool

EverSeenSpecificAgent returns true iff, when AgentsAllowed() was called, the robots file referred explicitly to one of the specified user agents.

func (*RobotsMatcher) HandleAllow

func (m *RobotsMatcher) HandleAllow(lineNum int, value string)

HandleAllow is called for every "Allow:" line in robots.txt.

func (*RobotsMatcher) HandleDisallow

func (m *RobotsMatcher) HandleDisallow(lineNum int, value string)

HandleDisallow is called for every "Disallow:" line in robots.txt.

func (*RobotsMatcher) HandleRobotsEnd

func (m *RobotsMatcher) HandleRobotsEnd()

HandleRobotsEnd is called at the end of parsing the robots.txt file.

For RobotsMatcher, this does nothing.

func (*RobotsMatcher) HandleRobotsStart

func (m *RobotsMatcher) HandleRobotsStart()

HandleRobotsStart is called at the start of parsing a robots.txt file, and resets all instance member variables.

func (*RobotsMatcher) HandleSitemap

func (m *RobotsMatcher) HandleSitemap(lineNum int, value string)

HandleSitemap is called for every "Sitemap:" line in robots.txt.

For RobotsMatcher, this does nothing.

func (*RobotsMatcher) HandleUnknownAction

func (m *RobotsMatcher) HandleUnknownAction(lineNum int, action, value string)

HandleUnknownAction is called for every unrecognized line in robots.txt.

For RobotsMatcher, this does nothing.

func (*RobotsMatcher) HandleUserAgent

func (m *RobotsMatcher) HandleUserAgent(lineNum int, userAgent string)

HandleUserAgent is called for every "User-Agent:" line in robots.txt.

func (*RobotsMatcher) MatchingLine added in v0.2.1

func (m *RobotsMatcher) MatchingLine() int

MatchingLine returns the line that matched or 0 if none matched.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL