Documentation ¶
Overview ¶
Package walker is an efficient, scalable, continuous crawler leveraging Go and Cassandra
This package provides the core walker libraries. The development API is documented here. See http://github.com/iParadigms/walker or README.md for an overview of the project.
Index ¶
- Variables
- func CreateCassandraSchema() error
- func DNSCachingDial(dial func(network, addr string) (net.Conn, error), maxEntries int) func(network, addr string) (net.Conn, error)
- func GetCassandraConfig() *gocql.ClusterConfig
- func GetCassandraSchema() (string, error)
- func ReadConfigFile(path string) error
- func SetDefaultConfig()
- type CassandraDatastore
- func (ds *CassandraDatastore) ClaimNewHost() string
- func (ds *CassandraDatastore) LinksForHost(domain string) <-chan *URL
- func (ds *CassandraDatastore) StoreParsedURL(u *URL, fr *FetchResults)
- func (ds *CassandraDatastore) StoreURLFetchResults(fr *FetchResults)
- func (ds *CassandraDatastore) UnclaimHost(host string)
- type CassandraDispatcher
- type Datastore
- type Dispatcher
- type FetchManager
- type FetchResults
- type Handler
- type PriorityUrl
- type SimpleWriterHandler
- type URL
- type WalkerConfig
Constants ¶
This section is empty.
Variables ¶
var ConfigName string = "walker.yaml"
ConfigName is the path (can be relative or absolute) to the config file that should be read.
var NotYetCrawled time.Time
NotYetCrawled is a convenience for time.Unix(0, 0), used as a crawl time in Walker for links that have not yet been fetched.
Functions ¶
func CreateCassandraSchema ¶
func CreateCassandraSchema() error
CreateCassandraSchema creates the walker schema in the configured Cassandra database. It requires that the keyspace not already exist (so as to losing non-test data), with the exception of the walker_test schema, which it will drop automatically.
func DNSCachingDial ¶
func DNSCachingDial(dial func(network, addr string) (net.Conn, error), maxEntries int) func(network, addr string) (net.Conn, error)
DNSCachingDial wraps the given dial function with Caching of DNS resolutions. When a hostname is found in the cache it will call the provided dial with the IP address instead of the hostname, so no DNS lookup need be performed. It will also cache DNS failures.
func GetCassandraConfig ¶
func GetCassandraConfig() *gocql.ClusterConfig
func GetCassandraSchema ¶
func ReadConfigFile ¶
ReadConfigFile sets a new path to find the walker yaml config file and forces a reload of the config.
func SetDefaultConfig ¶
func SetDefaultConfig()
SetDefaultConfig resets the Config object to default values, regardless of what was set by any configuration file.
Types ¶
type CassandraDatastore ¶
type CassandraDatastore struct {
// contains filtered or unexported fields
}
CassandraDatastore is the primary Datastore implementation, using Apache Cassandra as a highly scalable backend.
func NewCassandraDatastore ¶
func NewCassandraDatastore() (*CassandraDatastore, error)
func (*CassandraDatastore) ClaimNewHost ¶
func (ds *CassandraDatastore) ClaimNewHost() string
func (*CassandraDatastore) LinksForHost ¶
func (ds *CassandraDatastore) LinksForHost(domain string) <-chan *URL
func (*CassandraDatastore) StoreParsedURL ¶
func (ds *CassandraDatastore) StoreParsedURL(u *URL, fr *FetchResults)
func (*CassandraDatastore) StoreURLFetchResults ¶
func (ds *CassandraDatastore) StoreURLFetchResults(fr *FetchResults)
func (*CassandraDatastore) UnclaimHost ¶
func (ds *CassandraDatastore) UnclaimHost(host string)
type CassandraDispatcher ¶
type CassandraDispatcher struct {
// contains filtered or unexported fields
}
CassandraDispatcher analyzes what we've crawled so far (generally on a per-domain basis) and updates the database. At minimum this means generating new segments to crawl in the `segments` table, but it can also mean updating domain_info if we find out new things about a domain.
This dispatcher has been designed to run simultaneously with the fetchmanager. Fetchers and dispatchers claim domains in Cassandra, so the dispatcher can operate on the domains not currently being crawled (and vice versa).
func (*CassandraDispatcher) StartDispatcher ¶
func (d *CassandraDispatcher) StartDispatcher() error
func (*CassandraDispatcher) StopDispatcher ¶
func (d *CassandraDispatcher) StopDispatcher() error
type Datastore ¶
type Datastore interface { // ClaimNewHost returns a hostname that is now claimed for this crawler to // crawl. A segment of links for this host is assumed to be available. // Returns the domain of the segment it claimed, or "" if there are none // available. ClaimNewHost() string // UnclaimHost indicates that all links from `LinksForHost` have been // processed, so other work may be done with this host. For example the // dispatcher will be free analyze the links and generate a new segment. UnclaimHost(host string) // LinksForHost returns a channel that will feed URLs for a given host. LinksForHost(host string) <-chan *URL // StoreURLFetchResults takes the return data/metadata from a fetch and // stores the visit. Fetchers will call this once for each link in the // segment being crawled. StoreURLFetchResults(fr *FetchResults) // StoreParsedURL stores a URL parsed out of a page (i.e. a URL we may not // have crawled yet). `u` is the URL to store. `fr` is the FetchResults // object for the fetch from which we got the URL, for any context the // datastore may want. A datastore implementation should handle `fr` being // nil, so links can be seeded without a fetch having occurred. // // URLs passed to StoreParsedURL should be absolute. // // This layer should handle efficiently deduplicating // links (i.e. a fetcher should be safe feeding the same URL many times. StoreParsedURL(u *URL, fr *FetchResults) }
Datastore defines the interface for an object to be used as walker's datastore.
Note that this is for link and metadata storage required to make walker function properly. It has nothing to do with storing fetched content (see `Handler` for that).
type Dispatcher ¶
type Dispatcher interface { // StartDispatcher should be a blocking call that starts the dispatcher. It // should return an error if it could not start or stop properly and nil // when it has safely shut down and stopped all internal processing. StartDispatcher() error // Stop signals the dispatcher to stop. It should block until all internal // goroutines have stopped. StopDispatcher() error }
Dispatcher defines the calls a dispatcher should respond to. A dispatcher would typically be paired with a particular Datastore, and not all Datastore implementations may need a Dispatcher.
A basic crawl will likely run the dispatcher in the same process as the fetchers, but higher-scale crawl setups may run dispatchers separately.
type FetchManager ¶
type FetchManager struct { // Handler must be set to handle fetch responses. Handler Handler // Datastore must be set to drive the fetching. Datastore Datastore // Transport can be set to override the default network transport the // FetchManager is going to use. Good for faking remote servers for // testing. Transport http.RoundTripper // contains filtered or unexported fields }
FetchManager configures and runs the crawl.
The calling code must create a FetchManager, set a Datastore and handlers, then call `Start()`
func (*FetchManager) Start ¶
func (fm *FetchManager) Start()
Start begins processing assuming that the datastore and any handlers have been set. This is a blocking call (run in a goroutine if you want to do other things)
You cannot change the datastore or handlers after starting.
func (*FetchManager) Stop ¶
func (fm *FetchManager) Stop()
Stop notifies the fetchers to finish their current requests. It blocks until all fetchers have finished.
type FetchResults ¶
type FetchResults struct { // URL that was requested; will always be populated. If this URL redirects, // RedirectedFrom will contain a list of all requested URLS. URL *URL // A list of redirects. During this request cycle, the first request URL is stored // in URL. The second request (first redirect) is stored in RedirectedFrom[0]. And // the Nth request (N-1 th redirect) will be stored in RedirectedFrom[N-2], // and this is the URL that furnished the http.Response. RedirectedFrom []*URL // Response object; nil if there was a FetchError or ExcludedByRobots is // true. Response.Body may not be the same object the HTTP request actually // returns; the fetcher may have read in the response to parse out links, // replacing Response.Body with an alternate reader. Response *http.Response // FetchError if the net/http request had an error (non-2XX HTTP response // codes are not considered errors) FetchError error // Time at the beginning of the request (if a request was made) FetchTime time.Time // True if we did not request this link because it is excluded by // robots.txt rules ExcludedByRobots bool // The Content-Type of the fetched page. MimeType string }
FetchResults contains all relevant context and return data from an individual fetch. Handlers receive this to process results.
type Handler ¶
type Handler interface { // HandleResponse will be called by fetchers as they make requests. // Handlers can do whatever they want with responses. HandleResponse will // be called as long as the request successfully reached the remote server // and got an HTTP code. This means there should never be a FetchError set // on the FetchResults. HandleResponse(res *FetchResults) }
Handler defines the interface for objects that will be set as handlers on a FetchManager.
type PriorityUrl ¶
type PriorityUrl []*URL
PriorityUrl is a heap of URLs, where the next element Pop'ed off the list points to the oldest (as measured by LastCrawled) element in the list. This class is designed to be used with the container/heap package. This type is currently only used in generateSegments
func (PriorityUrl) Len ¶
func (pq PriorityUrl) Len() int
func (PriorityUrl) Less ¶
func (pq PriorityUrl) Less(i, j int) bool
func (*PriorityUrl) Pop ¶
func (pq *PriorityUrl) Pop() interface{}
func (*PriorityUrl) Push ¶
func (pq *PriorityUrl) Push(x interface{})
func (PriorityUrl) Swap ¶
func (pq PriorityUrl) Swap(i, j int)
type SimpleWriterHandler ¶
type SimpleWriterHandler struct{}
SimpleWriterHandler just writes returned pages as files locally, naming the file after the URL of the request.
func (*SimpleWriterHandler) HandleResponse ¶
func (h *SimpleWriterHandler) HandleResponse(fr *FetchResults)
type URL ¶
type URL struct { *url.URL // LastCrawled is the last time we crawled this URL, for example to use a // Last-Modified header. LastCrawled time.Time }
URL is the walker URL object, which embeds *url.URL but has extra data and capabilities used by walker. Note that LastCrawled should not be set to its zero value, it should be set to NotYetCrawled.
func CreateURL ¶
CreateURL creates a walker URL from values usually pulled out of the datastore. subdomain may optionally include a trailing '.', and path may optionally include a prefixed '/'.
func (*URL) MakeAbsolute ¶
MakeAbsolute uses URL.ResolveReference to make this URL object an absolute reference (having Schema and Host), if it is not one already. It is resolved using `base` as the base URL.
func (*URL) Subdomain ¶
Subdomain provides the remaining subdomain after removing the ToplevelDomainPlusOne. For example http://www.bbc.co.uk/ will return 'www' as the subdomain (note that there is no trailing period). If there is no subdomain it will return "".
func (*URL) TLDPlusOneAndSubdomain ¶
TLDPlusOneAndSubdomain is a convenience function that calls ToplevelDomainPlusOne and Subdomain, returning an error if we could not get either one. The first return is the TLD+1 and second is the subdomain
func (*URL) ToplevelDomainPlusOne ¶
ToplevelDomainPlusOne returns the Effective Toplevel Domain of this host as defined by https://publicsuffix.org/, plus one extra domain component.
For example the TLD of http://www.bbc.co.uk/ is 'co.uk', plus one is 'bbc.co.uk'. Walker uses these TLD+1 domains as the primary unit of grouping.
type WalkerConfig ¶
type WalkerConfig struct { AddNewDomains bool `yaml:"add_new_domains"` AddedDomainsCacheSize int `yaml:"added_domains_cache_size"` MaxDNSCacheEntries int `yaml:"max_dns_cache_entries"` UserAgent string `yaml:"user_agent"` AcceptFormats []string `yaml:"accept_formats"` AcceptProtocols []string `yaml:"accept_protocols"` DefaultCrawlDelay int `yaml:"default_crawl_delay"` MaxHTTPContentSizeBytes int64 `yaml:"max_http_content_size_bytes"` IgnoreTags []string `yaml:"ignore_tags"` //TODO: allow -1 as a no max value MaxLinksPerPage int `yaml:"max_links_per_page"` NumSimultaneousFetchers int `yaml:"num_simultaneous_fetchers"` BlacklistPrivateIPs bool `yaml:"blacklist_private_ips"` Dispatcher struct { MaxLinksPerSegment int `yaml:"num_links_per_segment"` RefreshPercentage float64 `yaml:"refresh_percentage"` NumConcurrentDomains int `yaml:"num_concurrent_domains"` } `yaml:"dispatcher"` Cassandra struct { Hosts []string `yaml:"hosts"` Keyspace string `yaml:"keyspace"` ReplicationFactor int `yaml:"replication_factor"` } `yaml:"cassandra"` Console struct { Port int `yaml:"port"` TemplateDirectory string `yaml:"template_directory"` PublicFolder string `yaml:"public_folder"` } `yaml:"console"` }
WalkerConfig defines the available global configuration parameters for walker. It reads values straight from the config file (walker.yaml by default). See sample-walker.yaml for explanations and default values.
var Config WalkerConfig
Config is the configuration instance the rest of walker should access for global configuration values. See WalkerConfig for available config members.