cassandra

package
v0.0.0-...-3fff119 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 27, 2015 License: BSD-3-Clause Imports: 16 Imported by: 0

Documentation

Overview

Package cassandra implements walker.Datastore with the Cassandra database

Index

Constants

This section is empty.

Variables

View Source
var MaxPriorityPeriod time.Duration

Functions

func CreateSchema

func CreateSchema() error

CreateSchema creates the walker schema in the configured Cassandra database. It requires that the keyspace not already exist (so as to losing non-test data), with the exception of the walker_test schema, which it will drop automatically.

func GetConfig

func GetConfig() *gocql.ClusterConfig

GetConfig returns a fresh ClusterConfig, configured against walker.Config

func GetSchema

func GetSchema() string

GetSchema returns the CQL schema for this version of the cassandra datastore. Certain values, like keyspace and replication factor, are dynamically inserted.

func GetTestDB

func GetTestDB() *gocql.Session

GetTestDB ensures that a cassandra schema is loaded and all data is purged for testing purposes. It returns a gocql session or panics if anything failed. For safety's sake it may ONLY be used if the cassandra keyspace is `walker_test` and will panic if it isn't.

Types

type DQ

type DQ struct {
	// When listing domains, the seed should be the domain preceding the
	// queried set. When paginating, use the last domain of the previous set as
	// the seed.
	// Default: select from the beginning
	Seed string

	// Limit the returned results, used for pagination.
	// Default: no limit
	Limit int

	// Set to true to get only dispatched domains
	// default: get all domains
	Working bool
}

DQ is a domain query struct used for getting domains from cassandra. Zero-values mean use default behavior.

type Datastore

type Datastore struct {
	// contains filtered or unexported fields
}

Datastore is the primary walker Datastore implementation, using Apache Cassandra as a highly scalable backend. It provides extra access calls for the database for use in the console and other applications.

NewDatastore should be used to create one.

func NewDatastore

func NewDatastore() (*Datastore, error)

NewDatastore creates a Cassandra session and initializes a Datastore

func (*Datastore) ClaimNewHost

func (ds *Datastore) ClaimNewHost() string

ClaimNewHost is documented on the walker.Datastore interface.

func (*Datastore) Close

func (ds *Datastore) Close()

Close will close the Datastore

func (*Datastore) FindDomain

func (ds *Datastore) FindDomain(domain string) (*DomainInfo, error)
func (ds *Datastore) FindLink(u *walker.URL, collectContent bool) (*LinkInfo, error)
func (ds *Datastore) InsertLink(link string, excludeDomainReason string) error
func (ds *Datastore) InsertLinks(links []string, excludeDomainReason string) []error

func (*Datastore) KeepAlive

func (ds *Datastore) KeepAlive() error

KeepAlive is documented on the walker.Datastore interface.

func (*Datastore) LinksForHost

func (ds *Datastore) LinksForHost(domain string) <-chan *walker.URL

LinksForHost is documented on the walker.Datastore interface.

func (*Datastore) ListDomains

func (ds *Datastore) ListDomains(query DQ) ([]*DomainInfo, error)

func (*Datastore) ListLinkHistorical

func (ds *Datastore) ListLinkHistorical(u *walker.URL) ([]*LinkInfo, error)
func (ds *Datastore) ListLinks(domain string, query LQ) ([]*LinkInfo, error)

func (*Datastore) MaxPriority

func (ds *Datastore) MaxPriority() int

func (*Datastore) StoreParsedURL

func (ds *Datastore) StoreParsedURL(u *walker.URL, fr *walker.FetchResults)

StoreParsedURL is documented on the walker.Datastore interface.

func (*Datastore) StoreURLFetchResults

func (ds *Datastore) StoreURLFetchResults(fr *walker.FetchResults)

StoreURLFetchResults is documented on the walker.Datastore interface.

func (*Datastore) UnclaimAll

func (ds *Datastore) UnclaimAll() error

UnclaimAll iterates domains to unclaim them. Crawlers will unclaim domains by themselves, but this is used in case crawlers crash or are killed and have left domains claimed.

func (*Datastore) UnclaimHost

func (ds *Datastore) UnclaimHost(host string)

UnclaimHost is documented on the walker.Datastore interface.

func (*Datastore) UpdateDomain

func (ds *Datastore) UpdateDomain(domain string, info *DomainInfo, cfg DomainInfoUpdateConfig) error

type Dispatcher

type Dispatcher struct {
	// contains filtered or unexported fields
}

Dispatcher analyzes what we've crawled so far (generally on a per-domain basis) and updates the database. At minimum this means generating new segments to crawl in the `segments` table, but it can also mean updating domain_info if we find out new things about a domain.

This dispatcher has been designed to run simultaneously with the fetchmanager. Fetchers and dispatchers claim domains in Cassandra, so the dispatcher can operate on the domains not currently being crawled (and vice versa).

Always create a Dispatcher using NewDispatcher()

func NewDispatcher

func NewDispatcher() (*Dispatcher, error)

func (*Dispatcher) StartDispatcher

func (d *Dispatcher) StartDispatcher() error

func (*Dispatcher) StopDispatcher

func (d *Dispatcher) StopDispatcher() error

StopDispatcher stops the dispatcher.

type DomainInfo

type DomainInfo struct {
	// TLD+1
	Domain string

	// Is this domain excluded from the crawl?
	Excluded bool

	// Why did this domain get excluded, or empty if not excluded
	ExcludeReason string

	// When did this domain last get queued to be crawled. Or TimeQueed.IsZero() if not crawled
	ClaimTime time.Time

	// What was the UUID of the crawler that last crawled the domain
	ClaimToken gocql.UUID

	// Number of (unique) links found in this domain
	NumberLinksTotal int

	// Number of (unique) links queued to be processed for this domain
	NumberLinksQueued int

	// Number of links not yet crawled
	NumberLinksUncrawled int

	// Priority of this domain
	Priority int
}

DomainInfo defines a row from the domain_info table

type DomainInfoUpdateConfig

type DomainInfoUpdateConfig struct {

	// Setting Exclude to true indicates that the ExcludeReason field of the
	// DomainInfo passed to UpdateDomain should be persisted to the database.
	Exclude bool

	// Setting Priority to true indicates that the Priority field of the
	// DomainInfo passed to UpdateDomain should be persisted to the database.
	Priority bool
}

DomainInfoUpdateConfig is used to configure the method Datastore.UpdateDomain

type LQ

type LQ struct {
	// When listing links, the seed should be the URL preceding the queried
	// set. When paginating, use the last URL of the previous set as the seed.
	// Default: select from the beginning
	Seed *walker.URL

	// Limit the returned results, used for pagination.
	// Default: no limit
	Limit int

	FilterRegex string
}

LQ is a link query struct used for gettings links from cassandra. Zero-values mean use default behavior.

type LinkInfo

type LinkInfo struct {
	// URL of the link
	URL *walker.URL

	// Status of the fetch
	Status int

	// When did this link get crawled
	CrawlTime time.Time

	// Any error reported when attempting to fetch the URL
	Error string

	// Was this excluded by robots
	RobotsExcluded bool

	// URL this link redirected to if it was a redirect
	RedirectedTo string

	// Whether this link was flagged for immediate fetching
	GetNow bool

	// Mime type (or Content-Type) of the returned data
	Mime string

	// FNV hash of the contents
	FnvFingerprint int64

	// FNV hash of the text extracted from the page
	FnvTextFingerprint int64

	// Body of request (if configured to be stored)
	Body string

	// Header of request (if configured to be stored)
	Headers http.Header
}

LinkInfo defines a row from the link or segment table

type LinkList []*LinkInfo

LinkList is a list of LinkInfos that implements sort.Interface, so we can easily sort and deduplicate it

func (LinkList) Len

func (l LinkList) Len() int

func (LinkList) Less

func (l LinkList) Less(i, j int) bool

func (LinkList) Swap

func (l LinkList) Swap(i, j int)

func (LinkList) Uniq

func (l LinkList) Uniq()

Uniq assumes this list of links is sorted and gets rid of identical links

type MockModelDatastore

type MockModelDatastore struct {
	walker.MockDatastore
}

MockModelDatastore implements walker/cassandra's ModelDatastore interface for testing.

func (*MockModelDatastore) FindDomain

func (ds *MockModelDatastore) FindDomain(domain string) (*DomainInfo, error)
func (ds *MockModelDatastore) FindLink(u *walker.URL, collectContent bool) (*LinkInfo, error)
func (ds *MockModelDatastore) InsertLink(link string, excludeDomainReason string) error
func (ds *MockModelDatastore) InsertLinks(links []string, excludeDomainReason string) []error

func (*MockModelDatastore) ListDomains

func (ds *MockModelDatastore) ListDomains(query DQ) ([]*DomainInfo, error)

func (*MockModelDatastore) ListLinkHistorical

func (ds *MockModelDatastore) ListLinkHistorical(u *walker.URL) ([]*LinkInfo, error)
func (ds *MockModelDatastore) ListLinks(domain string, query LQ) ([]*LinkInfo, error)

func (*MockModelDatastore) UpdateDomain

func (ds *MockModelDatastore) UpdateDomain(domain string, info *DomainInfo, cfg DomainInfoUpdateConfig) error

type ModelDatastore

type ModelDatastore interface {
	walker.Datastore

	// FindDomain returns the DomainInfo for the specified domain
	FindDomain(domain string) (*DomainInfo, error)

	// ListDomains returns a slice of DomainInfo structs populated according to
	// the specified DQ (domain query)
	ListDomains(query DQ) ([]*DomainInfo, error)

	// UpdateDomain updates the given domain with fields from `info`. Which
	// fields will be persisted to the store from the argument DomainInfo is
	// configured from the DomainInfoUpdateConfig argument. For example, to
	// persist the Priority field in the info strut, one would pass
	// DomainInfoUpdateConfig{Priority: true} as the cfg argument to
	// UpdateDomain.
	UpdateDomain(domain string, info *DomainInfo, cfg DomainInfoUpdateConfig) error

	// FindLink returns a LinkInfo matching the given URL. Arguments to this
	// function are: (a) u is the url to find (b) collectContent, if true,
	// indicates that Body and Headers field of LinkInfo will be populated.
	FindLink(u *walker.URL, collectContent bool) (*LinkInfo, error)

	// ListLinks fetches links for the given domain according to the given LQ
	// (Link Query)
	ListLinks(domain string, query LQ) ([]*LinkInfo, error)

	// ListLinkHistorical gets the crawl history of a specific link
	ListLinkHistorical(u *walker.URL) ([]*LinkInfo, error)

	// InsertLink inserts the given link into the database, adding it's domain
	// if it does not exist. If excludeDomainReason is not empty, this domain
	// will be excluded from crawling marked with the given reason.
	InsertLink(link string, excludeDomainReason string) error

	// InsertLinks does the same as InsertLink with many potential errors. It
	// will insert as many as it can (it won't stop once it hits a bad link)
	// and only return errors for problematic links or domains.
	InsertLinks(links []string, excludeDomainReason string) []error
}

ModelDatastore defines additional methods for querying and modifying domains and links in walker, and includes the walker.Datastore interface (which is intended only for the bare minimum that fetchers in the walker package need). This interface is good for use by the console and other tools that need CRUD-like capabilities.

type PriorityURL

type PriorityURL []*LinkInfo

PriorityURL is a heap of URLs, where the next element Pop'ed off the list points to the oldest (as measured by LastCrawled) element in the list. This class is designed to be used with the container/heap package. This type is currently only used in generateSegments

func (PriorityURL) Len

func (pq PriorityURL) Len() int

Returns the length of this PriorityURL

func (PriorityURL) Less

func (pq PriorityURL) Less(i, j int) bool

Return logical less-than between two items in this PriorityURL

func (*PriorityURL) Pop

func (pq *PriorityURL) Pop() interface{}

Pop an item onto this PriorityURL

func (*PriorityURL) Push

func (pq *PriorityURL) Push(x interface{})

Push an item onto this PriorityURL

func (PriorityURL) Swap

func (pq PriorityURL) Swap(i, j int)

Swap two items in this PriorityURL

type SegmentGenerator

type SegmentGenerator struct {
	// A DB handle for the generator to use. Should be provided when
	// constructing a SegmentGenerator
	DB *gocql.Session
	// contains filtered or unexported fields
}

SegmentGenerator is the dispatcher component for generating a segment of links for an individual domain. See the Generate() function.

func (*SegmentGenerator) Generate

func (sg *SegmentGenerator) Generate(domain string) error

Generate reads links in for this domain, generates a segment for it, and inserts the domain into domains_to_crawl (assuming a segment is ready to go)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL