crawler

command

v0.0.0-...-33f3de0 Latest Latest Go to latest Published: Dec 27, 2018 License: MIT Imports: 24 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

This nearly complete rewrite of the old bot has some nice new features:

Flexible crawling backend support: Make it easy to implement additional protocols like HTTP, NFS, …
HTTP crawling backend: Finally get rid of old and busted FTP!
Crawling rate limiting: Limit the amount of requests so fileservers can breath again
Last Seen Timestamp: Files which have not been seen for a long time can be prioritized lower in results
Elasticsearch >5 support
Dependency management: Get rid of shitty gopm

Dependencies are managed using dep.

Create a new crawler implementing the Crawler interface (see crawler.go)
Make sure your crawler supports:
- robots.txt parsing (see existing crawlers for examples)
- rate limiting
Add initialization of your crawler to the switch in crawler.go

entry
- string
- required
- no default
- URI of the entry point, might optionally be with path, port, username, password e.g. ftp://hans:geheim@foo:21/bar/baz/
- Default username and password: anonymous:anonymous
turnDelay
- integer
- optional
- default: 10
- Amount of seconds to wait after the complete server got crawled before starting again
maxRequestPerSecond
- integer
- optional
- default: 0
- Maximum amount of requests per second. 0 means unlimited
robotName
- string
- optional
- default: TortureBot
- Name of the robot to check against in /robots.txt
obeyRobotsTxt
- boolean
- optional
- default: true
- Whether to check pathes against /robots.txt

entry
- string
- required
- no default
- URI of the entry point, might optionally be with path, port, username, password e.g. http://hans:geheim@foo:21/bar/baz/
turnDelay
- integer
- optional
- default: 10
- Amount of seconds to wait after the complete server got crawled before starting again
maxRequestPerSecond
- integer
- optional
- default: 0
- Maximum amount of requests per second. 0 means unlimited
robotName
- string
- optional
- default: TortureBot
- Name of the robot to check against in /robots.txt
obeyRobotsTxt
- boolean
- optional
- default: true
- Whether to check pathes against /robots.txt
maxBodySize
- integer
- optional
- default: 1 Megabyte (10000000)
- Maximum downloaded body size. In order to search for link, we have to download pages.
maxPathDepth
- integer
- optional
- default: 20
- Maximum path depth. Used to "catch" symlink loops

There is no documentation for this package.