walker

package module
v0.0.0-...-3fff119 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 27, 2015 License: BSD-3-Clause Imports: 29 Imported by: 0

README

Walker

An efficient, scalable, continuous crawler leveraging Go/Cassandra

Build Status GoDoc

Documentation

Alpha Warning

This project is a work in progress and not ready for production release. Much of the design described below is pending development. Stay tuned for an Alpha release.

Overview

Walker is a web crawler on it's feet. It has been built from the start to be horizontally scalable, smart about recrawling, lean on storage, flexible about what can be done with data, and easy to set up. Use it if you:

  • Want a broad or scalable focused crawl of the web
  • Want to prioritize what you (re)crawl, and how often
  • Want control over where you store crawled data and what you use it for (walker needs to store links and metadata, where to store and what to do with returned data is up to you)
  • Want a smart crawler that will avoid junk (ex. crawler traps)
  • Want the performance of Cassandra and flexibility to do batch processing
  • Want to crawl non-html file types
  • Aren't interested in built-in page rank generation and search indexing (or want to do it yourself)

Architecture in brief

Walker takes advantage of Cassandra's distributed nature to store all links it has crawled and still needs to crawl. The database holds these links, all domains we've seen (with metadata), and new segments (groups of links) to crawl for a given domain.

The fetcher manager component claims domains (meaning: fetchers can be distributed to anywhere they can connect to Cassandra), reads in their segments, and crawls pages politely, respecting robots.txt rules. It will parse pages for new links to feed into the system and output crawled content. You can add your own content processor or use a built-in one like writing pages to local files.

The dispatcher runs batch jobs looking for domains that don't yet have segments generated, reads the links we already have, and intelligently chooses a subset to crawl next.

Note: the fetchers uses a pluggable datastore component to tell it what to crawl (see the Datastore interface). Though the Cassandra datastore is the primarily supported implementation, the fetchers could be backed by alternative implementations (in-memory, classic SQL, etc.) that may not need a dispatcher to run at all.

Console

Walker comes with a friendly console accessible from the browser. It provides an easy way to add new links to your crawl and see information about what you have crawled so far.

walker-console

Getting started

Setup

Make sure you have go installed and a GOPATH set:

go get github.com/iParadigms/walker/...

To get going quickly, you need to install Cassandra. A simple install of Cassandra on Centos 6 is demonstrated below. See the datastax documentation non-RHEL-based installs and recommended settings (Oracle Java is recommended but not required)

echo "[datastax]
name = DataStax Repo for Apache Cassandra
baseurl = http://rpm.datastax.com/community
enabled = 1
gpgcheck = 0" | sudo tee /etc/yum.repos.d/datastax.repo

sudo yum install java-1.7.0-openjdk dsc20

sudo service cassandra start # it can take a few minutes for this to actually start up

In order to run walker and cassandra on your local machine, you may need to make the following changes to cassandra.yaml:

  • Change listen_address to empty
  • Change rpc_address to 0.0.0.0
  • sudo service cassandra restart

Once you have this step completed you can go ahead and build the walker binary:

go install ./walker

The next step is to build the cassandra schema:

walker schema -o schema.txt
cqlsh -f schema.txt # optionally change replication information for the keyspace in schema.txt

This will create the schema for you. At this point the console can be loaded via walker console

Basic crawl

Once you've built a walker binary, you can crawl with the default handler easily, which simply writes pages to a directory structure in $PWD.

# These assume walker is in your $PATH
walker crawl # start crawling; runs a fetch manager, dispatcher, and console all-in-one
walker seed -u http://<test_site>.com # give it a seed URL
# Visit http://<your_machine>:3000 in your browser to see the console

# See more help info and other commands:
walker help

Writing your own handler

In most cases you will want to use walker for some kind of processing. The easiest way is to create a new Go project that implements your own handler. You can still take advantage of walker's command-line interface. For example:

package main

import (
	"github.com/iParadigms/walker"
	"github.com/iParadigms/walker/cmd"
)

type MyHandler struct{}

func (h *MyHandler) HandleResponse(res *walker.FetchResults) {
	// Do something with the response...
}

func main() {
	cmd.Handler(&MyHandler{})
	cmd.Execute()
}

You can then run walker using your own handler easily:

go run main.go # Has the same CLI as the walker binary

Advanced features and configuration

See walker.yaml for extensive descriptions of the various configuration parameters available for walker. This file is the primary way of configuring your crawl. It is not required to exist, but will be read if it is in the working directory of the walker process or configured with a command line parameter.

A small sampling of common configuration items:

# Fetcher configuration
fetcher:
    # Configure the User-Agent header
    user_agent: Walker (http://github.com/iParadigms/walker)

    # Configure which formats this crawler Accepts
    accept_formats: ["text/html", "text/*"]

    # Which link to accept based on protocol (a.k.a. schema)
    accept_protocols: ["http", "https"]

    # Maximum size of http content
    max_http_content_size_bytes: 20971520 # 20MB

    # Crawl delay duration to use when unspecified by robots.txt. 
    default_crawl_delay: 1s

# Dispatcher configuration
dispatcher:
    # maximum number of links added to segments table per dispatch (must be >0)
    num_links_per_segment: 500

    # refresh_percentage is the percentage of links added per dispatch that have already been crawled.
    # So refresh_percentage = 25 means that 25% of the links added to segments on the next dispatch
    # will be refreshed (i.e. already crawled) links. This value must be >= 0 and <= 100.
    refresh_percentage: 25

# Cassandra configuration for the datastore.
# Generally these are used to create a gocql.ClusterConfig object
# (https://godoc.org/github.com/gocql/gocql#ClusterConfig).
#
cassandra:
    hosts: ["localhost"]
    timeout: "2s"

    # replication_factor is used when defining the initial keyspace.
    # For production clusters we recommend 3 replicas.
    replication_factor: 1

    # Whether to dynamically add new-found domains (or their links) to the crawl (a
    # broad crawl) or discard them, assuming desired domains are manually seeded.
    add_new_domains: false

License

All code contributed to the Walker repository is open source software released under the BSD 3-clause license. See LICENSE.txt for details.

Documentation

Overview

Package walker is an efficient, scalable, continuous crawler leveraging Go and Cassandra

This package provides the core walker libraries. The development API is documented here. See http://github.com/iParadigms/walker or README.md for an overview of the project.

Index

Constants

This section is empty.

Variables

View Source
var ConfigName = "walker.yaml"

ConfigName is the path (can be relative or absolute) to the config file that should be read.

View Source
var NotYetCrawled time.Time

NotYetCrawled is a convenience for time.Unix(0, 0), used as a crawl time in Walker for links that have not yet been fetched.

Functions

func GetTestFileDir

func GetTestFileDir() string

GetTestFileDir returns the directory where shared test files are stored, for example test config files. It will panic if it could not get the path from the runtime.

func LoadTestConfig

func LoadTestConfig(filename string)

LoadTestConfig loads the given test config yaml file. The given path is assumed to be relative to the `walker/test/` directory, the location of this file. This will panic if it cannot read the requested config file. If you expect an error or are testing ReadConfigFile, use `GetTestFileDir()` instead.

func MustReadConfigFile

func MustReadConfigFile(path string)

MustReadConfigFile calls ReadConfigFile and panics on error.

func PostConfigHooks

func PostConfigHooks()

PostConfigHooks allows code to set up data structures that depend on the config. It is always called right after the config file is consumed. But it's also public so if you modify the config in a test, you may need to call this function. This function is idempotent; so you can call it as many times as you like.

func ReadConfigFile

func ReadConfigFile(path string) error

ReadConfigFile sets a new path to find the walker yaml config file and forces a reload of the config.

func SetDefaultConfig

func SetDefaultConfig()

SetDefaultConfig resets the Config object to default values, regardless of what was set by any configuration file.

Types

type ConfigStruct

type ConfigStruct struct {
	Fetcher struct {
		MaxDNSCacheEntries       int      `yaml:"max_dns_cache_entries"`
		UserAgent                string   `yaml:"user_agent"`
		AcceptFormats            []string `yaml:"accept_formats"`
		AcceptProtocols          []string `yaml:"accept_protocols"`
		MaxHTTPContentSizeBytes  int64    `yaml:"max_http_content_size_bytes"`
		IgnoreTags               []string `yaml:"ignore_tags"`
		MaxLinksPerPage          int      `yaml:"max_links_per_page"`
		NumSimultaneousFetchers  int      `yaml:"num_simultaneous_fetchers"`
		BlacklistPrivateIPs      bool     `yaml:"blacklist_private_ips"`
		HTTPTimeout              string   `yaml:"http_timeout"`
		HonorMetaNoindex         bool     `yaml:"honor_meta_noindex"`
		HonorMetaNofollow        bool     `yaml:"honor_meta_nofollow"`
		ExcludeLinkPatterns      []string `yaml:"exclude_link_patterns"`
		IncludeLinkPatterns      []string `yaml:"include_link_patterns"`
		DefaultCrawlDelay        string   `yaml:"default_crawl_delay"`
		MaxCrawlDelay            string   `yaml:"max_crawl_delay"`
		PurgeSidList             []string `yaml:"purge_sid_list"`
		ActiveFetchersTTL        string   `yaml:"active_fetchers_ttl"`
		ActiveFetchersCacheratio float32  `yaml:"active_fetchers_cacheratio"`
		ActiveFetchersKeepratio  float32  `yaml:"active_fetchers_keepratio"`
		HTTPKeepAlive            string   `yaml:"http_keep_alive"`
		HTTPKeepAliveThreshold   string   `yaml:"http_keep_alive_threshold"`
		MaxPathLength            int      `yaml:"max_path_length"`
	} `yaml:"fetcher"`

	Dispatcher struct {
		MaxLinksPerSegment         int     `yaml:"num_links_per_segment"`
		RefreshPercentage          float64 `yaml:"refresh_percentage"`
		NumConcurrentDomains       int     `yaml:"num_concurrent_domains"`
		MinLinkRefreshTime         string  `yaml:"min_link_refresh_time"`
		DispatchInterval           string  `yaml:"dispatch_interval"`
		CorrectLinkNormalization   bool    `yaml:"correct_link_normalization"`
		EmptyDispatchRetryInterval string  `yaml:"empty_dispatch_retry_interval"`
	} `yaml:"dispatcher"`

	Cassandra struct {
		Hosts                 []string `yaml:"hosts"`
		Keyspace              string   `yaml:"keyspace"`
		ReplicationFactor     int      `yaml:"replication_factor"`
		Timeout               string   `yaml:"timeout"`
		CQLVersion            string   `yaml:"cql_version"`
		ProtoVersion          int      `yaml:"proto_version"`
		Port                  int      `yaml:"port"`
		NumConns              int      `yaml:"num_conns"`
		NumStreams            int      `yaml:"num_streams"`
		DiscoverHosts         bool     `yaml:"discover_hosts"`
		MaxPreparedStmts      int      `yaml:"max_prepared_stmts"`
		AddNewDomains         bool     `yaml:"add_new_domains"`
		AddedDomainsCacheSize int      `yaml:"added_domains_cache_size"`
		StoreResponseBody     bool     `yaml:"store_response_body"`
		StoreResponseHeaders  bool     `yaml:"store_response_headers"`
		NumQueryRetries       int      `yaml:"num_query_retries"`
		DefaultDomainPriority int      `yaml:"default_domain_priority"`
	} `yaml:"cassandra"`

	Console struct {
		Port                     int    `yaml:"port"`
		TemplateDirectory        string `yaml:"template_directory"`
		PublicFolder             string `yaml:"public_folder"`
		MaxAllowedDomainPriority int    `yaml:"max_allowed_domain_priority"`
	} `yaml:"console"`
}

ConfigStruct defines the available global configuration parameters for walker. It reads values straight from the config file (walker.yaml by default). See sample-walker.yaml for explanations and default values.

var Config ConfigStruct

Config is the configuration instance the rest of walker should access for global configuration values. See ConfigStruct for available config members.

type Datastore

type Datastore interface {
	// ClaimNewHost returns a hostname that is now claimed for this crawler to
	// crawl. A segment of links for this host is assumed to be available.
	// Returns the domain of the segment it claimed, or "" if there are none
	// available.
	ClaimNewHost() string

	// UnclaimHost indicates that all links from `LinksForHost` have been
	// processed, so other work may be done with this host. For example the
	// dispatcher will be free analyze the links and generate a new segment.
	UnclaimHost(host string)

	// LinksForHost returns a channel that will feed URLs for a given host.
	LinksForHost(host string) <-chan *URL

	// StoreURLFetchResults takes the return data/metadata from a fetch and
	// stores the visit. Fetchers will call this once for each link in the
	// segment being crawled.
	StoreURLFetchResults(fr *FetchResults)

	// StoreParsedURL stores a URL parsed out of a page (i.e. a URL we may not
	// have crawled yet). `u` is the URL to store. `fr` is the FetchResults
	// object for the fetch from which we got the URL, for any context the
	// datastore may want. A datastore implementation should handle `fr` being
	// nil, so links can be seeded without a fetch having occurred.
	//
	// URLs passed to StoreParsedURL should be absolute.
	//
	// This layer should handle efficiently deduplicating
	// links (i.e. a fetcher should be safe feeding the same URL many times.
	StoreParsedURL(u *URL, fr *FetchResults)

	// KeepAlive will be called periodically in fetcher. This method should
	// notify the datastore that this fetcher is still alive.
	KeepAlive() error

	// Close will be called when no more Datastore calls will be made, allowing
	// any necessary cleanup to take place.
	Close()
}

Datastore defines the interface for an object to be used as walker's datastore.

Note that this is for link and metadata storage required to make walker function properly. It has nothing to do with storing fetched content (see `Handler` for that).

type Dispatcher

type Dispatcher interface {
	// StartDispatcher should be a blocking call that starts the dispatcher. It
	// should return an error if it could not start or stop properly and nil
	// when it has safely shut down and stopped all internal processing.
	StartDispatcher() error

	// Stop signals the dispatcher to stop. It should block until all internal
	// goroutines have stopped.
	StopDispatcher() error
}

Dispatcher defines the calls a dispatcher should respond to. A dispatcher would typically be paired with a particular Datastore, and not all Datastore implementations may need a Dispatcher.

A basic crawl will likely run the dispatcher in the same process as the fetchers, but higher-scale crawl setups may run dispatchers separately.

type FetchManager

type FetchManager struct {
	// Handler must be set to handle fetch responses.
	Handler Handler

	// Datastore must be set to drive the fetching.
	Datastore Datastore

	// Transport can be set to override the default network transport the
	// FetchManager is going to use. Good for faking remote servers for
	// testing.
	Transport http.RoundTripper

	// TransNoKeepAlive stores a RoundTripper with Keep-Alive set to 0 IF
	// http_keep_alive == "threshold". Otherwise it's nil.
	TransNoKeepAlive http.RoundTripper

	// Parsed duration of the string Config.Fetcher.HTTPKeepAliveThreshold
	KeepAliveThreshold time.Duration
	// contains filtered or unexported fields
}

FetchManager configures and runs the crawl.

The calling code must create a FetchManager, set a Datastore and handlers, then call `Start()`

func (*FetchManager) Start

func (fm *FetchManager) Start()

Start starts a FetchManager. Always pair go Start() with a Stop()

func (*FetchManager) Stop

func (fm *FetchManager) Stop()

Stop notifies the fetchers to finish their current requests. It blocks until all fetchers have finished.

type FetchResults

type FetchResults struct {

	// URL that was requested; will always be populated. If this URL redirects,
	// RedirectedFrom will contain a list of all requested URLS.
	URL *URL

	// A list of redirects. During this request cycle, the first request URL is stored
	// in URL. The second request (first redirect) is stored in RedirectedFrom[0]. And
	// the Nth request (N-1 th redirect) will be stored in RedirectedFrom[N-2],
	// and this is the URL that furnished the http.Response.
	RedirectedFrom []*URL

	// Response object; nil if there was a FetchError or ExcludedByRobots is
	// true. Response.Body may not be the same object the HTTP request actually
	// returns; the fetcher may have read in the response to parse out links,
	// replacing Response.Body with an alternate reader.
	Response *http.Response

	// If the user has set cassandra.store_response_body to true in the config file,
	// then the content of the link will be stored in Body (and consequently stored in the
	// body column of the links table). Otherwise Body is the empty string.
	Body string

	// FetchError if the net/http request had an error (non-2XX HTTP response
	// codes are not considered errors)
	FetchError error

	// Time at the beginning of the request (if a request was made)
	FetchTime time.Time

	// True if we did not request this link because it is excluded by
	// robots.txt rules
	ExcludedByRobots bool

	// True if the page was marked as 'noindex' via a <meta> tag. Whether it
	// was crawled depends on the honor_meta_noindex configuration parameter
	MetaNoIndex bool

	// True if the page was marked as 'nofollow' via a <meta> tag. Whether it
	// was crawled depends on the honor_meta_nofollow configuration parameter
	MetaNoFollow bool

	// The Content-Type of the fetched page.
	MimeType string

	// Fingerprint of the reponse body computed with fnv algorithm (see
	// hash/fnv in standard library)
	FnvFingerprint int64

	// Fingerprint of the text parsed out of the response body, also computed
	// with fnv
	FnvTextFingerprint int64
}

FetchResults contains all relevant context and return data from an individual fetch. Handlers receive this to process results.

type HTMLParser

type HTMLParser struct {
	// A concatenation of all text, excluding content from script/style tags
	Text []byte
	// A list of links found on the parsed page
	Links []*URL
	// true if <meta name="ROBOTS" content="noindex"> was found
	HasMetaNoIndex bool
	// true if <meta name="ROBOTS" content="nofollow"> was found
	HasMetaNoFollow bool
}

HTMLParser simply parses html passed from the fetcher. A new struct is intended to have Parse() called on it, which will populate it's member variables for reading.

func (*HTMLParser) Parse

func (p *HTMLParser) Parse(body []byte)

Parse parses the given content body as HTML and populates instance variables as it is able. Parse errors will cause the parser to finish with whatever it has found so far. This method will reset it's instance variables if run repeatedly

type Handler

type Handler interface {
	// HandleResponse will be called by fetchers as they make requests.
	// Handlers can do whatever they want with responses. HandleResponse will
	// be called as long as the request successfully reached the remote server
	// and got an HTTP code. This means there should never be a FetchError set
	// on the FetchResults.
	HandleResponse(res *FetchResults)
}

Handler defines the interface for objects that will be set as handlers on a FetchManager.

type MockDatastore

type MockDatastore struct {
	mock.Mock
}

MockDatastore implements walker's Datastore interface for testing.

func (*MockDatastore) ClaimNewHost

func (ds *MockDatastore) ClaimNewHost() string

ClaimNewHost implements walker.Datastore interface

func (*MockDatastore) Close

func (ds *MockDatastore) Close()

func (*MockDatastore) KeepAlive

func (ds *MockDatastore) KeepAlive() error

KeepAlive implements walker.Datastore interface

func (*MockDatastore) LinksForHost

func (ds *MockDatastore) LinksForHost(domain string) <-chan *URL

func (*MockDatastore) StoreParsedURL

func (ds *MockDatastore) StoreParsedURL(u *URL, fr *FetchResults)

func (*MockDatastore) StoreURLFetchResults

func (ds *MockDatastore) StoreURLFetchResults(fr *FetchResults)

func (*MockDatastore) UnclaimAll

func (ds *MockDatastore) UnclaimAll() error

UnclaimAll implements method on cassandra.Datastore

func (*MockDatastore) UnclaimHost

func (ds *MockDatastore) UnclaimHost(host string)

UnclaimHost implements walker.Datastore interface

type MockDispatcher

type MockDispatcher struct {
	mock.Mock
}

MockDispatcher implements the walker.Dispatcher interface

func (*MockDispatcher) StartDispatcher

func (d *MockDispatcher) StartDispatcher() error

StartDispatcher implements the walker.Dispatcher interface

func (*MockDispatcher) StopDispatcher

func (d *MockDispatcher) StopDispatcher() error

StopDispatcher implements the walker.Dispatcher interface

type MockHTTPHandler

type MockHTTPHandler struct {
	// contains filtered or unexported fields
}

MockHTTPHandler implements http.Handler to serve mock requests.

It is not a mere mock.Mock object because using `.Return()` to return *http.Response objects is hard to do, and this provides conveniences in our tests.

It should be instantiated with `NewMockRemoteServer()`

func NewMockHTTPHandler

func NewMockHTTPHandler() *MockHTTPHandler

NewMockHTTPHandler creates a new MockHTTPHandler

func (*MockHTTPHandler) ServeHTTP

func (s *MockHTTPHandler) ServeHTTP(w http.ResponseWriter, r *http.Request)

ServeHTTP implements http.Handler interface

func (*MockHTTPHandler) SetResponse

func (s *MockHTTPHandler) SetResponse(link string, r *MockResponse)

SetResponse sets a mock response for the server to return when it sees an incoming request matching the given link. The link should have a scheme and host (ex. "http://test.com/stuff"). Empty fields on MockResponse will be filled in with default values (see MockResponse)

type MockHandler

type MockHandler struct {
	mock.Mock
}

MockHandler implements the walker.Handler interface

func (*MockHandler) HandleResponse

func (h *MockHandler) HandleResponse(fr *FetchResults)

type MockRemoteServer

type MockRemoteServer struct {
	*MockHTTPHandler
	// contains filtered or unexported fields
}

MockRemoteServer wraps MockHTTPHandler to start a fake server for the user. Use `NewMockRemoteServer()`

func NewMockRemoteServer

func NewMockRemoteServer() (*MockRemoteServer, error)

NewMockRemoteServer starts a server listening on port 80. It wraps MockHTTPHandler so mock return values can be set. Stop should be called at the end of the test to stop the server.

func (*MockRemoteServer) Headers

func (rs *MockRemoteServer) Headers(method string, url string, depth int) (http.Header, error)

Headers allows user to inspect the headers included in the request object sent to MockRemoteServer. The triple (method, url, depth) selects which header to return. Here:

(a) method is the http method (GET, POST, etc.)
(b) url is the full url of the page that received the request.
(c) depth is an integer specifying which (of possibly many) headers for the
given (method, url) pair to return. Use depth=-1 to get the latest
header.

func (*MockRemoteServer) Requested

func (rs *MockRemoteServer) Requested(method string, url string) bool

Requested returns true if the url was requested, and false otherwise.

func (*MockRemoteServer) Stop

func (rs *MockRemoteServer) Stop()

Stop will stop the faux-server.

type MockResponse

type MockResponse struct {
	// Status defaults to 200
	Status int

	// Status defaults to "GET"
	Method string

	// Body defaults to nil (no response body)
	Body string

	// Headers of response
	Headers http.Header

	//ContentType defaults to "text/html"
	ContentType string

	// How long is the content
	ContentLength int
}

MockResponse is the source object used to build fake responses in MockHTTPHandler.

type URL

type URL struct {
	*url.URL

	// LastCrawled is the last time we crawled this URL, for example to use a
	// Last-Modified header.
	LastCrawled time.Time
}

URL is the walker URL object, which embeds *url.URL but has extra data and capabilities used by walker. Note that LastCrawled should not be set to its zero value, it should be set to NotYetCrawled.

func CreateURL

func CreateURL(domain, subdomain, path, protocol string, lastcrawled time.Time) (*URL, error)

CreateURL creates a walker URL from values usually pulled out of the datastore. subdomain may optionally include a trailing '.', and path may optionally include a prefixed '/'.

func MustParse

func MustParse(ref string) *URL

MustParse is a helper for calling ParseURL when we kow the string is a safe URL. It will panic if it fails.

func ParseAndNormalizeURL

func ParseAndNormalizeURL(ref string) (*URL, error)

ParseAndNormalizeURL will walker.ParseURL the argument string, and then Normalize the resulting URL.

func ParseURL

func ParseURL(ref string) (*URL, error)

ParseURL is the walker.URL equivalent of url.Parse. Note, all URL's should be passed through this function so that we get consistency.

func (*URL) Clone

func (u *URL) Clone() *URL

Clone will create a copy of this walker.URL

func (*URL) Equal

func (u *URL) Equal(other *URL) bool

Equal returns true if this link is identical to `other`.

func (*URL) EqualIgnoreLastCrawled

func (u *URL) EqualIgnoreLastCrawled(other *URL) bool

EqualIgnoreLastCrawled returns true if the URL portion of this link (excluding LastCrawled) is equal to `other`.

func (*URL) MakeAbsolute

func (u *URL) MakeAbsolute(base *URL)

MakeAbsolute uses URL.ResolveReference to make this URL object an absolute reference (having Schema and Host), if it is not one already. It is resolved using `base` as the base URL.

func (*URL) Normalize

func (u *URL) Normalize()

Normalize will process the URL according to the current set of normalizing rules.

func (*URL) NormalizedForm

func (u *URL) NormalizedForm() *URL

NormalizedForm returns nil if u is normalized. Otherwise, return the normalized version of u.

func (*URL) PrimaryKey

func (u *URL) PrimaryKey() (dom string, subdom string, path string, proto string, time time.Time, err error)

PrimaryKey returns the 5 tuple that is the primary key for this url in the links table. The return values are (with cassandra keys in parens) (a) Domain (dom) (b) Subdomain (subdom) (c) Path part of url (path) (d) Schema of url (proto) (e) last update time of link (time) (f) any errors that occurred

func (*URL) Subdomain

func (u *URL) Subdomain() (string, error)

Subdomain provides the remaining subdomain after removing the ToplevelDomainPlusOne. For example http://www.bbc.co.uk/ will return 'www' as the subdomain (note that there is no trailing period). If there is no subdomain it will return "".

func (*URL) TLDPlusOneAndSubdomain

func (u *URL) TLDPlusOneAndSubdomain() (string, string, error)

TLDPlusOneAndSubdomain is a convenience function that calls ToplevelDomainPlusOne and Subdomain, returning an error if we could not get either one. The first return is the TLD+1 and second is the subdomain

func (*URL) ToplevelDomainPlusOne

func (u *URL) ToplevelDomainPlusOne() (string, error)

ToplevelDomainPlusOne returns the Effective Toplevel Domain of this host as defined by https://publicsuffix.org/, plus one extra domain component.

For example the TLD of http://www.bbc.co.uk/ is 'co.uk', plus one is 'bbc.co.uk'. Walker uses these TLD+1 domains as the primary unit of grouping.

Directories

Path Synopsis
Package cassandra implements walker.Datastore with the Cassandra database
Package cassandra implements walker.Datastore with the Cassandra database
Package cmd provides access to build on the walker CLI This package makes it easy to create custom walker binaries that use their own Handler, Datastore, or Dispatcher.
Package cmd provides access to build on the walker CLI This package makes it easy to create custom walker binaries that use their own Handler, Datastore, or Dispatcher.
Package console implements a web console for Walker in Go
Package console implements a web console for Walker in Go
Package dnscache implements a Dial function that will cache DNS resolutions
Package dnscache implements a Dial function that will cache DNS resolutions
Package mimetools provides functions for matching against media types Also referred to ad MIME types, see http://en.wikipedia.org/wiki/Internet_media_type It primarily exports Matcher: mm := mimetools.NewMatcher([]string{"text/*", "application/json"}) mm.Match("text/html") // returns true, nil mm.Match("text/plain") //returns true, nil mm.Match("application/json") // returns true, nil mm.Match("application/vnd.ms-excel") // returns false, nil
Package mimetools provides functions for matching against media types Also referred to ad MIME types, see http://en.wikipedia.org/wiki/Internet_media_type It primarily exports Matcher: mm := mimetools.NewMatcher([]string{"text/*", "application/json"}) mm.Match("text/html") // returns true, nil mm.Match("text/plain") //returns true, nil mm.Match("application/json") // returns true, nil mm.Match("application/vnd.ms-excel") // returns false, nil
Package simplehandler provides a basic walker handler implementation
Package simplehandler provides a basic walker handler implementation

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL