geomi

package module

v0.0.0-...-ad738db Latest Latest Go to latest Published: Sep 25, 2015 License: MIT Imports: 14 Imported by: 1

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/mohae/geomi

Links

Open Source Insights

README ¶

geomi

거미: Korean for spider, geomi is a spider that restricts itself to the provided web.

In progress

Geomi currently will crawl all linked children of the provided URL that are not external sites. Any found links that are part of the domain being crawled but whose path is outside of the provided URL for the root node will not be crawled, e.g. for a start point of http://golang.org/cmd/, if a link to http://golang.org/pkg/ is found, it will not be crawled as its path is not within http://golang.org/comd/.

Geomi currently supports waiting between fetchs and respecting the site's robot.txt, if it exists. Currently, geomi does not support concurrent fetchers for crawling a site. At the moment, all crawling is done by 1 process.

About

The purpose of geomi is to provide a package that crawls a specific site or subset of a site, indexing its links and content. This is accomplished by creating a spider with a base url, including scheme.

The base url is the start point from which the spider will start crawling. Any links that are either external to the site, or outside of the baseURL, will be indexed but not crawled by geomi.

Any node that geomi crawls will have its response body saved, along with all links on the page. The distance of the node from the base url will also be recorded.

The depth to which geomi will crawl is configurable. If there are no limits, the spider should be passed a depth value of -1. This will result in all children of the base url that have links to be indexed.

The amount of time geomi should wait between fetches is configurable. By default, geomi does not wait between fetches. To set an amount of time geomi should wait after fetching a url before fetching another, use Spider.SetFetchInterval(n), where n is an int64 integer representing the amount of time in milliseconds that geomi should wait. Geomi also adds a random amount of jitter to the wait with a maximum additional wait time equal to 20% of the passed fetch interval value, e.g. setting the fetch interval to 1000ms (1 second) will result in a random additional wait of 0-200ms, so the max wait between fetches would be 1200ms (1.2 seconds). This is mainly for concurrent fetching situations to prevent a thundering herd. Currently, geomi does not support concurrent fetchers.

Geomi will respect the a site's robot.txt unless it is explicitely told not to.

Geomi tracks what URLs have been fetched, the error code, if any, the content body, and any non # links found in the body.

For an example of an implementation, see kraul. It's implementation may not be totally up to date, but I do my best to keep it current. Kraul may not use all of geomi's functionality.

Usage

TODO

add support for recording how long the response took.
record crawl time for historical purposes
add support for getting information out of the spider. Supplying a custom fetcher may be the route taken instead. Not sure as I haven't yet pondered this. In general, support needs to be added for making the information that the spider gathers useful. In Process
add concurrent fetching support with a configurable limit on the number of concurrent fetchers. The fetchInterval and jitter functionality that geomi currenlty has was done with concurrency in mind.

Possible functionality

This is a list of functionality that may be added to geomi, but not guaranteed. This list is in addition to the core functionality that geomi would have once it is completed.

optional support for retrieving links outside of base. If enabled, geomi would retrieve the otherwise excluded link and save both the response body and response code. This would enable detection of changes to linked content, content that has moved, and dead links. Links on the retrieved page would not be extracted and geomi would not do additional crawling from the node in question.

Licensing

Copyright 2015 by Joel Scoble, rights reserved. Geomi is provided under the MIT license with no additional warranty or support provided, implicitly or explicitly. For more information, please refer to the included LICENSE file.

Documentation ¶

Overview ¶

geomi, 거미, is a basic spider for crawling specific webs. It was designed to crawl either a site, or a subset of a site, indexing the pages and where they link to. Any links that go outside of the provided base URL are noted but not followed.

Even though geomi was designed with the intended use being crawling ones own site, or a client's site, it does come with some basic behavior configuration to make it a friendly bot:

respects ROBOTS.txt TODO
configurable concurrent walkers TODO
configurable wait interval range TODO
configurable max requests per: TODO
min
hour
day
month

To start, go get the geomi package:

go get github.com/mohae/geomi

Import it into your code:

import github.com/mohae/geomi

Set up the site for the spider:

s := geomi.NewSite("http://example.com")

The url that is passed to the NewSite() function is the base case, after which a BFS traversal is done.

Index ¶

Variables
type Config
- func NewConfig() *Config
- func (c *Config) SetFetchInterval(t time.Duration)
type Fetcher
type Page
type ResponseInfo
type Site
- func (s Site) Fetch(url string) (body string, r ResponseInfo, urls []string)
type Spider
- func NewSpider(start string) (*Spider, error)
- func NewSpiderFromConfig(start string, c *Config) (*Spider, error)

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	DefaultFetchInterval  time.Duration = time.Second                                                                                            // default min. time between fetches
	DefaultJitter         time.Duration = time.Second                                                                                            // default max additional, random, fetch delay
	DefaultRobotUserAgent string        = "Googlebot (geomi)"                                                                                    // default user agent identifier for the bot.
	DefaultUserAgent      string        = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" // the default user agent
)

Defaults

Functions ¶

This section is empty.

Types ¶

type Config ¶

type Config struct {
	CheckExternalLinks bool          // Whether a HEAD should be performed on external links
	FetchInterval      time.Duration // The minimum time between fetching URLS
	Jitter             time.Duration // The max amount of jitter to add to the FetchInterval, the actual jitter is random.
	RespectRobots      bool          // Whether the robots.txt should be respected
	RestrictToScheme   bool          // Whether the crawl should be restricted to the base URL's scheme
	RobotUserAgent     string        // The user agent for the robot
	UserAgent          string        // The user agent to use.
}

func NewConfig ¶

func NewConfig() *Config

NewConfig returns a Config struct with Geomi defaults applied.

func (*Config) SetFetchInterval ¶

func (c *Config) SetFetchInterval(t time.Duration)

SetFetchInterval sets both the FetchInterval and Jitter to the passed value. Min FetchInterval will always be == t while the Max Fetchinterval will always be 2t, with most fetches being a random value between t and 2t.

type Fetcher ¶

type Fetcher interface {
	// Fetch returns the body of URL and
	// a slice of URLs found on that page
	Fetch(url string) (body string, r ResponseInfo, urls []string)
}

Fetcher is an interface that makes it easier to test. In the future, it may be useful for other purposes, but that would require exporting the method that uses it first.

type Page ¶

type Page struct {
	*url.URL
	// contains filtered or unexported fields
}

a page is a url. This is usually some content wiht a number of elements, but it could be something as simple as the url for cat.jpg on your site and represent that image.

type ResponseInfo ¶

type ResponseInfo struct {
	Status     string
	StatusCode int
	Err        error
}

ResponseInfo contains the status and error information from a get TODO:

Add Expire date
Add time for request to return
Should the body be in here?

type Site ¶

type Site struct {
	*url.URL
}

Site is a type that implements fetcher

func (Site) Fetch ¶

func (s Site) Fetch(url string) (body string, r ResponseInfo, urls []string)

Implements fetcher. TODO: make the design cleaner

type Spider ¶

type Spider struct {
	*queue.Queue
	sync.Mutex

	*url.URL // the start url
	Config   *Config

	Pages map[string]Page
	// contains filtered or unexported fields
}

Spider crawls the target. It contains all information needed to manage the crawl including keeping track of work to do, work done, and the results.

func NewSpider ¶

func NewSpider(start string) (*Spider, error)

returns a Spider with the its site's baseUrl set. The baseUrl is the start point for the crawl. It is also the restriction on the crawl:

func NewSpiderFromConfig ¶

func NewSpiderFromConfig(start string, c *Config) (*Spider, error)

NewSpiderFromConfig returns a spider with it's configuration set to the passed config.

func (*Spider) Crawl ¶

func (s *Spider) Crawl(depth int) (message string, err error)

Crawl is the exposed method for starting a crawl at baseURL. The crawl private method does the actual work. The depth is the maximum depth, or distance, to crawl from the baseURL. If depth == -1, no limits are set and it is expected that the entire site will be crawled.

func (*Spider) ExternalHosts ¶

func (s *Spider) ExternalHosts() []string

ExternalHosts returns a sorted list of external hosts

func (*Spider) ExternalLinks ¶

func (s *Spider) ExternalLinks() []string

ExternalLinks returns a sorted list of external Links

Source Files ¶

View all Source files

geomi.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL