walker

package module
v0.0.0-...-6b20280 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 27, 2014 License: BSD-3-Clause Imports: 27 Imported by: 0

README

Walker

An efficient, scalable, continuous crawler leveraging Go/Cassandra

Alpha Warning

This project is a work in progress and not ready for production release. Much of the design described below is pending development. Stay tuned for an Alpha release.

Overview

Walker is a web crawler on it's feet. It has been built from the start to be horizontally scalable, smart about recrawling, lean on storage, flexible about what can be done with data, and easy to set up. Use it if you:

  • Want a broad or scalable focused crawl of the web
  • Want to prioritize what you (re)crawl, and how often
  • Want control over where you store crawled data and what you use it for (walker stores links and metadata internally, passing pages and files on to you)
  • Want a smart crawler that will avoid junk (ex. crawler traps)
  • Want the performance of Cassandra and flexibility to do batch processing
  • Want to crawl non-html file types
  • Aren't interested in built-in web graph generation and search indexing (or want to do it yourself)

Architecture in brief

Walker takes advantage of Cassandra's distributed nature to store all links it has crawled and still needs to crawl. The database holds these links, all domains we've seen (with metadata), and new segments (groups of links) to crawl for a given domain.

The fetcher manager component claims domains (meaning: fetchers can be distributed to anywhere they can connect to Cassandra), reads in their segments, and crawls pages politely, respecting robots.txt rules. It will parse pages for new links to feed into the system and output crawled content. You can add your own content processor or use a built-in one like writing pages to local files.

The dispatcher runs batch jobs looking for domains that don't yet have segments generated, reads the links we already have, and intelligently chooses a subset to crawl next.

Note: the fetchers uses a pluggable datastore component to tell it what to crawl (see the Datastore interface). Though the Cassandra datastore is the primarily supported implementation, the fetchers could be backed by alternative implementations (in-memory, classic SQL, etc.) that may not need a dispatcher to run at all.

Console

Walker comes with a friendly console accessible from the browser. It provides an easy way to add new links to your crawl and see information about what you have crawled so far.

TODO: console screenshot

Getting started

Setup

Make sure you have go installed and a GOPATH set:

go get github.com/iParadigms/walker

To get going quickly, you need to install Cassandra. A simple install of Cassandra on Centos 6 is demonstrated below. See the datastax documentation non-RHEL-based installs and recommended settings (Oracle Java is recommended but not required)

echo "[datastax]
name = DataStax Repo for Apache Cassandra
baseurl = http://rpm.datastax.com/community
enabled = 1
gpgcheck = 0" | sudo tee /etc/yum.repos.d/datastax.repo

sudo yum install java-1.7.0-openjdk dsc20

sudo service cassandra start # it can take a few minutes for this to actually start up

In order to run walker and cassandra on your local machine, you may need to make the following changes to cassandra.yaml:

  • Change listen_address to empty
  • Change rpc_address to 0.0.0.0
  • sudo service cassandra restart

Basic crawl

Once you've build a walker binary, you can crawl with the default handler easily, which simply writes pages to a directory structure in $PWD.

# These assume walker is in your $PATH
walker crawl # start crawling; runs a fetch manager, dispatcher, and console all-in-one
walker seed -u http://<test_site>.com # give it a seed URL
# Visit http://<your_machine>:3000 in your browser to see the console

# See more help info and other commands:
walker help

Writing your own handler

In most cases you will want to use walker for some kind of processing. The easiest way is to create a new Go project that implements your own handler. You can still take advantage of walker's command-line interface if you don't need to change it. For example:

package main

import "github.com/iParadigms/walker/cmd"

type MyHandler struct{}

func (h *MyHandler) HandleResponse(res *walker.FetchResults) {
	// Do something with the response...
}

func main() {
	cmd.Handler(&MyHandler{})
	cmd.Execute()
}

You can then run walker using your own handler easily:

go run main.go # Has the same CLI as the walker binary

Advanced features and configuration

See walker.yaml for extensive descriptions of the various configuration parameters available for walker. This file is the primary way of configuring your crawl. It is not required to be exist, but will be read if it is in the working directory of the walker process or configured with a command line parameter.

A small sampling of common configuration items:

# Whether to dynamically add new-found domains (or their links) to the crawl (a
# broad crawl) or discard them, assuming desired domains are manually seeded.
add_new_domains: false

# Configure the User-Agent header
user_agent: Walker (http://github.com/iParadigms/walker)

# Configure which formats this crawler Accepts
accept_formats: ["text/html", "text/*"]

# Which link to accept based on protocol (a.k.a. schema)
accept_protocols: ["http", "https"]

# Cassandra configuration for the datastore.
# Generally these are used to create a gocql.ClusterConfig object
# (https://godoc.org/github.com/gocql/gocql#ClusterConfig).
#
# keyspace shouldn't generally need to be changed; it is mainly changed in
# testing as an extra layer of safety.
#
# replication_factor is used when defining the keyspace.
cassandra:
    hosts: ["localhost"]
    keyspace: "walker"
    replication_factor: 3

Documentation

TODO: add url of documentation site (walker.github.io?)

Also see our FAQ

Contributing

See contributing for information about development, logging, and running tests.

Documentation

Overview

Package walker is an efficient, scalable, continuous crawler leveraging Go and Cassandra

This package provides the core walker libraries. The development API is documented here. See http://github.com/iParadigms/walker or README.md for an overview of the project.

Index

Constants

This section is empty.

Variables

View Source
var ConfigName string = "walker.yaml"

ConfigName is the path (can be relative or absolute) to the config file that should be read.

View Source
var NotYetCrawled time.Time

NotYetCrawled is a convenience for time.Unix(0, 0), used as a crawl time in Walker for links that have not yet been fetched.

Functions

func CreateCassandraSchema

func CreateCassandraSchema() error

CreateCassandraSchema creates the walker schema in the configured Cassandra database. It requires that the keyspace not already exist (so as to losing non-test data), with the exception of the walker_test schema, which it will drop automatically.

func DNSCachingDial

func DNSCachingDial(dial func(network, addr string) (net.Conn, error), maxEntries int) func(network, addr string) (net.Conn, error)

DNSCachingDial wraps the given dial function with Caching of DNS resolutions. When a hostname is found in the cache it will call the provided dial with the IP address instead of the hostname, so no DNS lookup need be performed. It will also cache DNS failures.

func GetCassandraConfig

func GetCassandraConfig() *gocql.ClusterConfig

func GetCassandraSchema

func GetCassandraSchema() (string, error)

func ReadConfigFile

func ReadConfigFile(path string) error

ReadConfigFile sets a new path to find the walker yaml config file and forces a reload of the config.

func SetDefaultConfig

func SetDefaultConfig()

SetDefaultConfig resets the Config object to default values, regardless of what was set by any configuration file.

Types

type CassandraDatastore

type CassandraDatastore struct {
	// contains filtered or unexported fields
}

CassandraDatastore is the primary Datastore implementation, using Apache Cassandra as a highly scalable backend.

func NewCassandraDatastore

func NewCassandraDatastore() (*CassandraDatastore, error)

func (*CassandraDatastore) ClaimNewHost

func (ds *CassandraDatastore) ClaimNewHost() string

func (*CassandraDatastore) LinksForHost

func (ds *CassandraDatastore) LinksForHost(domain string) <-chan *URL

func (*CassandraDatastore) StoreParsedURL

func (ds *CassandraDatastore) StoreParsedURL(u *URL, fr *FetchResults)

func (*CassandraDatastore) StoreURLFetchResults

func (ds *CassandraDatastore) StoreURLFetchResults(fr *FetchResults)

func (*CassandraDatastore) UnclaimHost

func (ds *CassandraDatastore) UnclaimHost(host string)

type CassandraDispatcher

type CassandraDispatcher struct {
	// contains filtered or unexported fields
}

CassandraDispatcher analyzes what we've crawled so far (generally on a per-domain basis) and updates the database. At minimum this means generating new segments to crawl in the `segments` table, but it can also mean updating domain_info if we find out new things about a domain.

This dispatcher has been designed to run simultaneously with the fetchmanager. Fetchers and dispatchers claim domains in Cassandra, so the dispatcher can operate on the domains not currently being crawled (and vice versa).

func (*CassandraDispatcher) StartDispatcher

func (d *CassandraDispatcher) StartDispatcher() error

func (*CassandraDispatcher) StopDispatcher

func (d *CassandraDispatcher) StopDispatcher() error

type Datastore

type Datastore interface {
	// ClaimNewHost returns a hostname that is now claimed for this crawler to
	// crawl. A segment of links for this host is assumed to be available.
	// Returns the domain of the segment it claimed, or "" if there are none
	// available.
	ClaimNewHost() string

	// UnclaimHost indicates that all links from `LinksForHost` have been
	// processed, so other work may be done with this host. For example the
	// dispatcher will be free analyze the links and generate a new segment.
	UnclaimHost(host string)

	// LinksForHost returns a channel that will feed URLs for a given host.
	LinksForHost(host string) <-chan *URL

	// StoreURLFetchResults takes the return data/metadata from a fetch and
	// stores the visit. Fetchers will call this once for each link in the
	// segment being crawled.
	StoreURLFetchResults(fr *FetchResults)

	// StoreParsedURL stores a URL parsed out of a page (i.e. a URL we may not
	// have crawled yet). `u` is the URL to store. `fr` is the FetchResults
	// object for the fetch from which we got the URL, for any context the
	// datastore may want. A datastore implementation should handle `fr` being
	// nil, so links can be seeded without a fetch having occurred.
	//
	// URLs passed to StoreParsedURL should be absolute.
	//
	// This layer should handle efficiently deduplicating
	// links (i.e. a fetcher should be safe feeding the same URL many times.
	StoreParsedURL(u *URL, fr *FetchResults)
}

Datastore defines the interface for an object to be used as walker's datastore.

Note that this is for link and metadata storage required to make walker function properly. It has nothing to do with storing fetched content (see `Handler` for that).

type Dispatcher

type Dispatcher interface {
	// StartDispatcher should be a blocking call that starts the dispatcher. It
	// should return an error if it could not start or stop properly and nil
	// when it has safely shut down and stopped all internal processing.
	StartDispatcher() error

	// Stop signals the dispatcher to stop. It should block until all internal
	// goroutines have stopped.
	StopDispatcher() error
}

Dispatcher defines the calls a dispatcher should respond to. A dispatcher would typically be paired with a particular Datastore, and not all Datastore implementations may need a Dispatcher.

A basic crawl will likely run the dispatcher in the same process as the fetchers, but higher-scale crawl setups may run dispatchers separately.

type FetchManager

type FetchManager struct {
	// Handler must be set to handle fetch responses.
	Handler Handler

	// Datastore must be set to drive the fetching.
	Datastore Datastore

	// Transport can be set to override the default network transport the
	// FetchManager is going to use. Good for faking remote servers for
	// testing.
	Transport http.RoundTripper
	// contains filtered or unexported fields
}

FetchManager configures and runs the crawl.

The calling code must create a FetchManager, set a Datastore and handlers, then call `Start()`

func (*FetchManager) Start

func (fm *FetchManager) Start()

Start begins processing assuming that the datastore and any handlers have been set. This is a blocking call (run in a goroutine if you want to do other things)

You cannot change the datastore or handlers after starting.

func (*FetchManager) Stop

func (fm *FetchManager) Stop()

Stop notifies the fetchers to finish their current requests. It blocks until all fetchers have finished.

type FetchResults

type FetchResults struct {

	// URL that was requested; will always be populated. If this URL redirects,
	// RedirectedFrom will contain a list of all requested URLS.
	URL *URL

	// A list of redirects. During this request cycle, the first request URL is stored
	// in URL. The second request (first redirect) is stored in RedirectedFrom[0]. And
	// the Nth request (N-1 th redirect) will be stored in RedirectedFrom[N-2],
	// and this is the URL that furnished the http.Response.
	RedirectedFrom []*URL

	// Response object; nil if there was a FetchError or ExcludedByRobots is
	// true. Response.Body may not be the same object the HTTP request actually
	// returns; the fetcher may have read in the response to parse out links,
	// replacing Response.Body with an alternate reader.
	Response *http.Response

	// FetchError if the net/http request had an error (non-2XX HTTP response
	// codes are not considered errors)
	FetchError error

	// Time at the beginning of the request (if a request was made)
	FetchTime time.Time

	// True if we did not request this link because it is excluded by
	// robots.txt rules
	ExcludedByRobots bool

	// The Content-Type of the fetched page.
	MimeType string
}

FetchResults contains all relevant context and return data from an individual fetch. Handlers receive this to process results.

type Handler

type Handler interface {
	// HandleResponse will be called by fetchers as they make requests.
	// Handlers can do whatever they want with responses. HandleResponse will
	// be called as long as the request successfully reached the remote server
	// and got an HTTP code. This means there should never be a FetchError set
	// on the FetchResults.
	HandleResponse(res *FetchResults)
}

Handler defines the interface for objects that will be set as handlers on a FetchManager.

type PriorityUrl

type PriorityUrl []*URL

PriorityUrl is a heap of URLs, where the next element Pop'ed off the list points to the oldest (as measured by LastCrawled) element in the list. This class is designed to be used with the container/heap package. This type is currently only used in generateSegments

func (PriorityUrl) Len

func (pq PriorityUrl) Len() int

func (PriorityUrl) Less

func (pq PriorityUrl) Less(i, j int) bool

func (*PriorityUrl) Pop

func (pq *PriorityUrl) Pop() interface{}

func (*PriorityUrl) Push

func (pq *PriorityUrl) Push(x interface{})

func (PriorityUrl) Swap

func (pq PriorityUrl) Swap(i, j int)

type SimpleWriterHandler

type SimpleWriterHandler struct{}

SimpleWriterHandler just writes returned pages as files locally, naming the file after the URL of the request.

func (*SimpleWriterHandler) HandleResponse

func (h *SimpleWriterHandler) HandleResponse(fr *FetchResults)

type URL

type URL struct {
	*url.URL

	// LastCrawled is the last time we crawled this URL, for example to use a
	// Last-Modified header.
	LastCrawled time.Time
}

URL is the walker URL object, which embeds *url.URL but has extra data and capabilities used by walker. Note that LastCrawled should not be set to its zero value, it should be set to NotYetCrawled.

func CreateURL

func CreateURL(domain, subdomain, path, protocol string, lastcrawled time.Time) (*URL, error)

CreateURL creates a walker URL from values usually pulled out of the datastore. subdomain may optionally include a trailing '.', and path may optionally include a prefixed '/'.

func ParseURL

func ParseURL(ref string) (*URL, error)

ParseURL is the walker.URL equivalent of url.Parse

func (*URL) MakeAbsolute

func (u *URL) MakeAbsolute(base *URL)

MakeAbsolute uses URL.ResolveReference to make this URL object an absolute reference (having Schema and Host), if it is not one already. It is resolved using `base` as the base URL.

func (*URL) Subdomain

func (u *URL) Subdomain() (string, error)

Subdomain provides the remaining subdomain after removing the ToplevelDomainPlusOne. For example http://www.bbc.co.uk/ will return 'www' as the subdomain (note that there is no trailing period). If there is no subdomain it will return "".

func (*URL) TLDPlusOneAndSubdomain

func (u *URL) TLDPlusOneAndSubdomain() (string, string, error)

TLDPlusOneAndSubdomain is a convenience function that calls ToplevelDomainPlusOne and Subdomain, returning an error if we could not get either one. The first return is the TLD+1 and second is the subdomain

func (*URL) ToplevelDomainPlusOne

func (u *URL) ToplevelDomainPlusOne() (string, error)

ToplevelDomainPlusOne returns the Effective Toplevel Domain of this host as defined by https://publicsuffix.org/, plus one extra domain component.

For example the TLD of http://www.bbc.co.uk/ is 'co.uk', plus one is 'bbc.co.uk'. Walker uses these TLD+1 domains as the primary unit of grouping.

type WalkerConfig

type WalkerConfig struct {
	AddNewDomains           bool     `yaml:"add_new_domains"`
	AddedDomainsCacheSize   int      `yaml:"added_domains_cache_size"`
	MaxDNSCacheEntries      int      `yaml:"max_dns_cache_entries"`
	UserAgent               string   `yaml:"user_agent"`
	AcceptFormats           []string `yaml:"accept_formats"`
	AcceptProtocols         []string `yaml:"accept_protocols"`
	DefaultCrawlDelay       int      `yaml:"default_crawl_delay"`
	MaxHTTPContentSizeBytes int64    `yaml:"max_http_content_size_bytes"`
	IgnoreTags              []string `yaml:"ignore_tags"`
	//TODO: allow -1 as a no max value
	MaxLinksPerPage         int  `yaml:"max_links_per_page"`
	NumSimultaneousFetchers int  `yaml:"num_simultaneous_fetchers"`
	BlacklistPrivateIPs     bool `yaml:"blacklist_private_ips"`

	Dispatcher struct {
		MaxLinksPerSegment   int     `yaml:"num_links_per_segment"`
		RefreshPercentage    float64 `yaml:"refresh_percentage"`
		NumConcurrentDomains int     `yaml:"num_concurrent_domains"`
	} `yaml:"dispatcher"`

	Cassandra struct {
		Hosts             []string `yaml:"hosts"`
		Keyspace          string   `yaml:"keyspace"`
		ReplicationFactor int      `yaml:"replication_factor"`
	} `yaml:"cassandra"`

	Console struct {
		Port              int    `yaml:"port"`
		TemplateDirectory string `yaml:"template_directory"`
		PublicFolder      string `yaml:"public_folder"`
	} `yaml:"console"`
}

WalkerConfig defines the available global configuration parameters for walker. It reads values straight from the config file (walker.yaml by default). See sample-walker.yaml for explanations and default values.

var Config WalkerConfig

Config is the configuration instance the rest of walker should access for global configuration values. See WalkerConfig for available config members.

Directories

Path Synopsis
This file contains the web-facing handlers.
This file contains the web-facing handlers.
mimetools.Matcher allows matching against wildcard mediaTypes.
mimetools.Matcher allows matching against wildcard mediaTypes.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL