Documentation ¶
Overview ¶
geomi, 거미, is a basic spider for crawling specific webs. It was designed to crawl either a site, or a subset of a site, indexing the pages and where they link to. Any links that go outside of the provided base URL are noted but not followed.
Even though geomi was designed with the intended use being crawling ones own site, or a client's site, it does come with some basic behavior configuration to make it a friendly bot:
- respects ROBOTS.txt TODO
- configurable concurrent walkers TODO
- configurable wait interval range TODO
- configurable max requests per: TODO
- min
- hour
- day
- month
To start, go get the geomi package:
go get github.com/mohae/geomi
Import it into your code:
import github.com/mohae/geomi
Set up the site for the spider:
s := geomi.NewSite("http://example.com")
The url that is passed to the NewSite() function is the base case, after which a BFS traversal is done.
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ( DefaultFetchInterval time.Duration = time.Second // default min. time between fetches DefaultJitter time.Duration = time.Second // default max additional, random, fetch delay DefaultRobotUserAgent string = "Googlebot (geomi)" // default user agent identifier for the bot. DefaultUserAgent string = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" // the default user agent )
Defaults
Functions ¶
This section is empty.
Types ¶
type Config ¶
type Config struct { CheckExternalLinks bool // Whether a HEAD should be performed on external links FetchInterval time.Duration // The minimum time between fetching URLS Jitter time.Duration // The max amount of jitter to add to the FetchInterval, the actual jitter is random. RespectRobots bool // Whether the robots.txt should be respected RestrictToScheme bool // Whether the crawl should be restricted to the base URL's scheme RobotUserAgent string // The user agent for the robot UserAgent string // The user agent to use. }
func NewConfig ¶
func NewConfig() *Config
NewConfig returns a Config struct with Geomi defaults applied.
func (*Config) SetFetchInterval ¶
SetFetchInterval sets both the FetchInterval and Jitter to the passed value. Min FetchInterval will always be == t while the Max Fetchinterval will always be 2t, with most fetches being a random value between t and 2t.
type Fetcher ¶
type Fetcher interface { // Fetch returns the body of URL and // a slice of URLs found on that page Fetch(url string) (body string, r ResponseInfo, urls []string) }
Fetcher is an interface that makes it easier to test. In the future, it may be useful for other purposes, but that would require exporting the method that uses it first.
type Page ¶
a page is a url. This is usually some content wiht a number of elements, but it could be something as simple as the url for cat.jpg on your site and represent that image.
type ResponseInfo ¶
ResponseInfo contains the status and error information from a get TODO:
Add Expire date Add time for request to return Should the body be in here?
type Spider ¶
type Spider struct { *queue.Queue sync.Mutex *url.URL // the start url Config *Config Pages map[string]Page // contains filtered or unexported fields }
Spider crawls the target. It contains all information needed to manage the crawl including keeping track of work to do, work done, and the results.
func NewSpider ¶
returns a Spider with the its site's baseUrl set. The baseUrl is the start point for the crawl. It is also the restriction on the crawl:
func NewSpiderFromConfig ¶
NewSpiderFromConfig returns a spider with it's configuration set to the passed config.
func (*Spider) Crawl ¶
Crawl is the exposed method for starting a crawl at baseURL. The crawl private method does the actual work. The depth is the maximum depth, or distance, to crawl from the baseURL. If depth == -1, no limits are set and it is expected that the entire site will be crawled.
func (*Spider) ExternalHosts ¶
ExternalHosts returns a sorted list of external hosts
func (*Spider) ExternalLinks ¶
ExternalLinks returns a sorted list of external Links