geomi

package module
v0.0.0-...-ad738db Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 25, 2015 License: MIT Imports: 14 Imported by: 1

README

geomi

거미: Korean for spider, geomi is a spider that restricts itself to the provided web.

In progress

Geomi currently will crawl all linked children of the provided URL that are not external sites. Any found links that are part of the domain being crawled but whose path is outside of the provided URL for the root node will not be crawled, e.g. for a start point of http://golang.org/cmd/, if a link to http://golang.org/pkg/ is found, it will not be crawled as its path is not within http://golang.org/comd/.

Geomi currently supports waiting between fetchs and respecting the site's robot.txt, if it exists. Currently, geomi does not support concurrent fetchers for crawling a site. At the moment, all crawling is done by 1 process.

About

The purpose of geomi is to provide a package that crawls a specific site or subset of a site, indexing its links and content. This is accomplished by creating a spider with a base url, including scheme.

The base url is the start point from which the spider will start crawling. Any links that are either external to the site, or outside of the baseURL, will be indexed but not crawled by geomi.

Any node that geomi crawls will have its response body saved, along with all links on the page. The distance of the node from the base url will also be recorded.

The depth to which geomi will crawl is configurable. If there are no limits, the spider should be passed a depth value of -1. This will result in all children of the base url that have links to be indexed.

The amount of time geomi should wait between fetches is configurable. By default, geomi does not wait between fetches. To set an amount of time geomi should wait after fetching a url before fetching another, use Spider.SetFetchInterval(n), where n is an int64 integer representing the amount of time in milliseconds that geomi should wait. Geomi also adds a random amount of jitter to the wait with a maximum additional wait time equal to 20% of the passed fetch interval value, e.g. setting the fetch interval to 1000ms (1 second) will result in a random additional wait of 0-200ms, so the max wait between fetches would be 1200ms (1.2 seconds). This is mainly for concurrent fetching situations to prevent a thundering herd. Currently, geomi does not support concurrent fetchers.

Geomi will respect the a site's robot.txt unless it is explicitely told not to.

Geomi tracks what URLs have been fetched, the error code, if any, the content body, and any non # links found in the body.

For an example of an implementation, see kraul. It's implementation may not be totally up to date, but I do my best to keep it current. Kraul may not use all of geomi's functionality.

Usage

TODO

  • add support for recording how long the response took.
  • record crawl time for historical purposes
  • add support for getting information out of the spider. Supplying a custom fetcher may be the route taken instead. Not sure as I haven't yet pondered this. In general, support needs to be added for making the information that the spider gathers useful. In Process
  • add concurrent fetching support with a configurable limit on the number of concurrent fetchers. The fetchInterval and jitter functionality that geomi currenlty has was done with concurrency in mind.

Possible functionality

This is a list of functionality that may be added to geomi, but not guaranteed. This list is in addition to the core functionality that geomi would have once it is completed.

  • optional support for retrieving links outside of base. If enabled, geomi would retrieve the otherwise excluded link and save both the response body and response code. This would enable detection of changes to linked content, content that has moved, and dead links. Links on the retrieved page would not be extracted and geomi would not do additional crawling from the node in question.

Licensing

Copyright 2015 by Joel Scoble, rights reserved. Geomi is provided under the MIT license with no additional warranty or support provided, implicitly or explicitly. For more information, please refer to the included LICENSE file.

Documentation

Overview

geomi, 거미, is a basic spider for crawling specific webs. It was designed to crawl either a site, or a subset of a site, indexing the pages and where they link to. Any links that go outside of the provided base URL are noted but not followed.

Even though geomi was designed with the intended use being crawling ones own site, or a client's site, it does come with some basic behavior configuration to make it a friendly bot:

  • respects ROBOTS.txt TODO
  • configurable concurrent walkers TODO
  • configurable wait interval range TODO
  • configurable max requests per: TODO
  • min
  • hour
  • day
  • month

To start, go get the geomi package:

go get github.com/mohae/geomi

Import it into your code:

import github.com/mohae/geomi

Set up the site for the spider:

s := geomi.NewSite("http://example.com")

The url that is passed to the NewSite() function is the base case, after which a BFS traversal is done.

Index

Constants

This section is empty.

Variables

View Source
var (
	DefaultFetchInterval  time.Duration = time.Second                                                                                            // default min. time between fetches
	DefaultJitter         time.Duration = time.Second                                                                                            // default max additional, random, fetch delay
	DefaultRobotUserAgent string        = "Googlebot (geomi)"                                                                                    // default user agent identifier for the bot.
	DefaultUserAgent      string        = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" // the default user agent
)

Defaults

Functions

This section is empty.

Types

type Config

type Config struct {
	CheckExternalLinks bool          // Whether a HEAD should be performed on external links
	FetchInterval      time.Duration // The minimum time between fetching URLS
	Jitter             time.Duration // The max amount of jitter to add to the FetchInterval, the actual jitter is random.
	RespectRobots      bool          // Whether the robots.txt should be respected
	RestrictToScheme   bool          // Whether the crawl should be restricted to the base URL's scheme
	RobotUserAgent     string        // The user agent for the robot
	UserAgent          string        // The user agent to use.
}

func NewConfig

func NewConfig() *Config

NewConfig returns a Config struct with Geomi defaults applied.

func (*Config) SetFetchInterval

func (c *Config) SetFetchInterval(t time.Duration)

SetFetchInterval sets both the FetchInterval and Jitter to the passed value. Min FetchInterval will always be == t while the Max Fetchinterval will always be 2t, with most fetches being a random value between t and 2t.

type Fetcher

type Fetcher interface {
	// Fetch returns the body of URL and
	// a slice of URLs found on that page
	Fetch(url string) (body string, r ResponseInfo, urls []string)
}

Fetcher is an interface that makes it easier to test. In the future, it may be useful for other purposes, but that would require exporting the method that uses it first.

type Page

type Page struct {
	*url.URL
	// contains filtered or unexported fields
}

a page is a url. This is usually some content wiht a number of elements, but it could be something as simple as the url for cat.jpg on your site and represent that image.

type ResponseInfo

type ResponseInfo struct {
	Status     string
	StatusCode int
	Err        error
}

ResponseInfo contains the status and error information from a get TODO:

Add Expire date
Add time for request to return
Should the body be in here?

type Site

type Site struct {
	*url.URL
}

Site is a type that implements fetcher

func (Site) Fetch

func (s Site) Fetch(url string) (body string, r ResponseInfo, urls []string)

Implements fetcher. TODO: make the design cleaner

type Spider

type Spider struct {
	*queue.Queue
	sync.Mutex

	*url.URL // the start url
	Config   *Config

	Pages map[string]Page
	// contains filtered or unexported fields
}

Spider crawls the target. It contains all information needed to manage the crawl including keeping track of work to do, work done, and the results.

func NewSpider

func NewSpider(start string) (*Spider, error)

returns a Spider with the its site's baseUrl set. The baseUrl is the start point for the crawl. It is also the restriction on the crawl:

func NewSpiderFromConfig

func NewSpiderFromConfig(start string, c *Config) (*Spider, error)

NewSpiderFromConfig returns a spider with it's configuration set to the passed config.

func (*Spider) Crawl

func (s *Spider) Crawl(depth int) (message string, err error)

Crawl is the exposed method for starting a crawl at baseURL. The crawl private method does the actual work. The depth is the maximum depth, or distance, to crawl from the baseURL. If depth == -1, no limits are set and it is expected that the entire site will be crawled.

func (*Spider) ExternalHosts

func (s *Spider) ExternalHosts() []string

ExternalHosts returns a sorted list of external hosts

func (s *Spider) ExternalLinks() []string

ExternalLinks returns a sorted list of external Links

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL