Documentation ¶
Overview ¶
Package grobotstxt is a Go port of Google's robots.txt parser and matcher C++ library.
Index ¶
- Variables
- func AgentAllowed(robotsBody string, userAgent string, uri string) bool
- func AgentsAllowed(robotsBody string, userAgents []string, uri string) bool
- func Matches(path, pattern string) bool
- func Parse(robotsBody string, handler ParseHandler)
- func Sitemaps(robotsBody string) []string
- type LongestMatchStrategy
- type MatchStrategy
- type ParseHandler
- type Parser
- type RobotsMatcher
- func (m *RobotsMatcher) AgentAllowed(robotsBody string, userAgent string, uri string) bool
- func (m *RobotsMatcher) AgentsAllowed(robotsBody string, userAgents []string, uri string) bool
- func (m *RobotsMatcher) Disallowed() bool
- func (m *RobotsMatcher) DisallowedIgnoreGlobal() bool
- func (m *RobotsMatcher) EverSeenSpecificAgent() bool
- func (m *RobotsMatcher) HandleAllow(lineNum int, value string)
- func (m *RobotsMatcher) HandleDisallow(lineNum int, value string)
- func (m *RobotsMatcher) HandleRobotsEnd()
- func (m *RobotsMatcher) HandleRobotsStart()
- func (m *RobotsMatcher) HandleSitemap(lineNum int, value string)
- func (m *RobotsMatcher) HandleUnknownAction(lineNum int, action, value string)
- func (m *RobotsMatcher) HandleUserAgent(lineNum int, userAgent string)
- func (m *RobotsMatcher) MatchingLine() int
Examples ¶
Constants ¶
This section is empty.
Variables ¶
var AllowFrequentTypos = true
AllowFrequentTypos enables the parsing of common typos in robots.txt, such as DISALOW.
Functions ¶
func AgentAllowed ¶
AgentAllowed parses the given robots.txt content, matching it against the given userAgent and URI, and returns true if the given URI is allowed to be fetched by the given user agent.
AgentAllowed will also return false if the given URI is invalid (cannot successfully be parsed by url.Parse).
Example ¶
package main import ( "fmt" "github.com/jimsmart/grobotstxt" ) func main() { robotsTxt := ` # robots.txt with restricted area User-agent: * Disallow: /members/* ` ok := grobotstxt.AgentAllowed(robotsTxt, "FooBot/1.0", "http://example.net/members/index.html") fmt.Println(ok) }
Output: false
func AgentsAllowed ¶
AgentsAllowed parses the given robots.txt content, matching it against the given userAgents and URI, and returns true if the given URI is allowed to be fetched by any user agent in the list.
AgentsAllowed will also return false if the given URI is invalid (cannot successfully be parsed by url.Parse).
func Matches ¶
Matches implements robots.txt pattern matching.
Returns true if URI path matches the specified pattern. Pattern is anchored at the beginning of path. '$' is special only at the end of pattern.
Since both path and pattern are externally determined (by the webmaster), we make sure to have acceptable worst-case performance.
func Parse ¶
func Parse(robotsBody string, handler ParseHandler)
Parse uses the given robots.txt body and ParseHandler to create a Parser, and calls its Parse method.
func Sitemaps ¶
Sitemaps extracts all "Sitemap:" values from the given robots.txt content.
Example ¶
package main import ( "fmt" "github.com/jimsmart/grobotstxt" ) func main() { robotsTxt := ` # robots.txt with sitemaps User-agent: * Disallow: /members/* Sitemap: http://example.net/sitemap.xml Sitemap: http://example.net/sitemap2.xml ` sitemaps := grobotstxt.Sitemaps(robotsTxt) fmt.Println(sitemaps) }
Output: [http://example.net/sitemap.xml http://example.net/sitemap2.xml]
Types ¶
type LongestMatchStrategy ¶
type LongestMatchStrategy struct{}
LongestMatchStrategy implements the default robots.txt matching strategy.
The maximum number of characters matched by a pattern is returned as its match priority.
func (LongestMatchStrategy) MatchAllow ¶
func (s LongestMatchStrategy) MatchAllow(path, pattern string) int
func (LongestMatchStrategy) MatchDisallow ¶
func (s LongestMatchStrategy) MatchDisallow(path, pattern string) int
type MatchStrategy ¶
type MatchStrategy interface { MatchAllow(path, pattern string) int MatchDisallow(path, pattern string) int }
A MatchStrategy defines a strategy for matching individual lines in a robots.txt file.
Each Match* method should return a match priority, which is interpreted as:
match priority < 0: No match. match priority == 0: Match, but treat it as if matched an empty pattern. match priority > 0: Match.
type ParseHandler ¶
type ParseHandler interface { HandleRobotsStart() HandleRobotsEnd() HandleUserAgent(lineNum int, value string) HandleAllow(lineNum int, value string) HandleDisallow(lineNum int, value string) HandleSitemap(lineNum int, value string) HandleUnknownAction(lineNum int, action, value string) }
ParseHandler is a handler for directives found in robots.txt. These callbacks are called by Parse() in the sequence they have been found in the file.
type Parser ¶
type Parser struct {
// contains filtered or unexported fields
}
func NewParser ¶
func NewParser(robotsBody string, handler ParseHandler) *Parser
func (*Parser) Parse ¶
func (p *Parser) Parse()
Parse body of this Parser's robots.txt and emit parse callbacks. This will accept typical typos found in robots.txt, such as 'disalow'.
Note, this function will accept all kind of input but will skip everything that does not look like a robots directive.
type RobotsMatcher ¶
type RobotsMatcher struct { MatchStrategy MatchStrategy // contains filtered or unexported fields }
RobotsMatcher — matches robots.txt against URIs.
The RobotsMatcher uses a default match strategy for Allow/Disallow patterns which is the official way of Google crawler to match robots.txt. It is also possible to provide a custom match strategy.
The entry point for the user is to call one of the AgentAllowed() methods that return directly if a URI is being allowed according to the robots.txt and the crawl agent.
The RobotsMatcher can be re-used for URIs/robots.txt but is not concurrency-safe.
func NewRobotsMatcher ¶
func NewRobotsMatcher() *RobotsMatcher
NewRobotsMatcher creates a RobotsMatcher with the default matching strategy. The default matching strategy is longest-match as opposed to the former internet draft that provisioned first-match strategy. Analysis shows that longest-match, while more restrictive for crawlers, is what webmasters assume when writing directives. For example, in case of conflicting matches (both Allow and Disallow), the longest match is the one the user wants. For example, in case of a robots.txt file that has the following rules
Allow: / Disallow: /cgi-bin
it's pretty obvious what the webmaster wants: they want to allow crawl of every URI except /cgi-bin. However, according to the expired internet standard, crawlers should be allowed to crawl everything with such a rule.
func (*RobotsMatcher) AgentAllowed ¶
func (m *RobotsMatcher) AgentAllowed(robotsBody string, userAgent string, uri string) bool
AgentAllowed parses the given robots.txt content, matching it against the given userAgent and URI, and returns true if the given URI is allowed to be fetched by the given user agent.
AgentAllowed will also return false if the given URI is invalid (cannot successfully be parsed by url.Parse).
func (*RobotsMatcher) AgentsAllowed ¶
func (m *RobotsMatcher) AgentsAllowed(robotsBody string, userAgents []string, uri string) bool
AgentsAllowed parses the given robots.txt content, matching it against the given userAgents and URI, and returns true if the given URI is allowed to be fetched by any user agent in the list.
AgentsAllowed will also return false if the given URI is invalid (cannot successfully be parsed by url.Parse).
func (*RobotsMatcher) Disallowed ¶ added in v0.2.1
func (m *RobotsMatcher) Disallowed() bool
Disallowed returns true if we are disallowed from crawling a matching URI.
func (*RobotsMatcher) DisallowedIgnoreGlobal ¶ added in v0.2.1
func (m *RobotsMatcher) DisallowedIgnoreGlobal() bool
DisallowedIgnoreGlobal returns true if we are disallowed from crawling a matching URI. Ignores any rules specified for the default user agent, and bases its results only on the specified user agents.
func (*RobotsMatcher) EverSeenSpecificAgent ¶ added in v0.2.1
func (m *RobotsMatcher) EverSeenSpecificAgent() bool
EverSeenSpecificAgent returns true iff, when AgentsAllowed() was called, the robots file referred explicitly to one of the specified user agents.
func (*RobotsMatcher) HandleAllow ¶
func (m *RobotsMatcher) HandleAllow(lineNum int, value string)
HandleAllow is called for every "Allow:" line in robots.txt.
func (*RobotsMatcher) HandleDisallow ¶
func (m *RobotsMatcher) HandleDisallow(lineNum int, value string)
HandleDisallow is called for every "Disallow:" line in robots.txt.
func (*RobotsMatcher) HandleRobotsEnd ¶
func (m *RobotsMatcher) HandleRobotsEnd()
HandleRobotsEnd is called at the end of parsing the robots.txt file.
For RobotsMatcher, this does nothing.
func (*RobotsMatcher) HandleRobotsStart ¶
func (m *RobotsMatcher) HandleRobotsStart()
HandleRobotsStart is called at the start of parsing a robots.txt file, and resets all instance member variables.
func (*RobotsMatcher) HandleSitemap ¶
func (m *RobotsMatcher) HandleSitemap(lineNum int, value string)
HandleSitemap is called for every "Sitemap:" line in robots.txt.
For RobotsMatcher, this does nothing.
func (*RobotsMatcher) HandleUnknownAction ¶
func (m *RobotsMatcher) HandleUnknownAction(lineNum int, action, value string)
HandleUnknownAction is called for every unrecognized line in robots.txt.
For RobotsMatcher, this does nothing.
func (*RobotsMatcher) HandleUserAgent ¶
func (m *RobotsMatcher) HandleUserAgent(lineNum int, userAgent string)
HandleUserAgent is called for every "User-Agent:" line in robots.txt.
func (*RobotsMatcher) MatchingLine ¶ added in v0.2.1
func (m *RobotsMatcher) MatchingLine() int
MatchingLine returns the line that matched or 0 if none matched.