Documentation ¶
Overview ¶
Package sitemapper provides a parallel site crawler for producing site maps
Index ¶
- Constants
- func ValidateHosts(root, link *url.URL) bool
- type Config
- type DomainCrawler
- type DomainValidator
- type DomainValidatorFunc
- type LinkReader
- type Option
- func SetClient(client *http.Client) Option
- func SetCrawlTimeout(crawlTimeout time.Duration) Option
- func SetDomainValidator(validator DomainValidator) Option
- func SetKeepAlive(keepAlive time.Duration) Option
- func SetLogger(logger *zap.Logger) Option
- func SetMaxConcurrency(maxConcurrency int) Option
- func SetMaxPendingURLS(maxPendingURLS int) Option
- func SetTimeout(timeout time.Duration) Option
- type SiteMap
Constants ¶
const DefaultCrawlTimeout = time.Duration(0)
DefaultCrawlTimeout limits the total amount of time spent crawling. When 0 there is no limit.
const DefaultKeepAlive = time.Second * 30
DefaultKeepAlive is the default keepalive timeout for client connections.
const DefaultMaxConcurrency = 8
DefaultMaxConcurrency sets the number of goroutines to be used to crawl pages. This default is used to configure the transport of the default http client so that there are enough connections to support the number of goroutines used.
const DefaultMaxPendingURLS = 8192
DefaultMaxPendingURLS limits the size of the URLS list. This prevents us from increasing the URLS list faster than we can drain it. This wouldn't normally expect to happen, but there could be cases where URLs are poorly designed and contain data that changes on every page load.
const DefaultTimeout = time.Second * 10
DefaultTimeout is the default timeout used by the http client if no other timeout is specified.
Variables ¶
This section is empty.
Functions ¶
func ValidateHosts ¶
ValidateHosts provides a default domain validation function that compares the host components of the provided URLs.
Types ¶
type Config ¶
type Config struct { MaxConcurrency int MaxPendingURLS int CrawlTimeout time.Duration KeepAlive time.Duration Timeout time.Duration Client *http.Client Logger *zap.Logger DomainValidator DomainValidator }
Config is a stuct of crawler configuration options.
type DomainCrawler ¶
type DomainCrawler struct {
// contains filtered or unexported fields
}
DomainCrawler contains the state of a domain web crawler. The domain crawler exposes a Crawl method which proudces a site map.
func NewDomainCrawler ¶
func NewDomainCrawler(root *url.URL, config *Config) (*DomainCrawler, error)
NewDomainCrawler creates a new DomainCrawler from the root url and given configuration.
func (*DomainCrawler) Crawl ¶
func (crawler *DomainCrawler) Crawl() (*SiteMap, error)
Crawl reads all links in the domain with the specified concurrency and returns a site map. Note that Crawl is not thread safe and each caller must create a separate DomainCrawler.
type DomainValidator ¶
A DomainValidator provides a Validate functions for comparing two URLs for same domain inclusion. This allows for custom behavior such as checking scheme (http vs https) or DNS lookup.
type DomainValidatorFunc ¶
DomainValidatorFunc acts as an adapter for allowing the use of ordinary functions as domain validators.
type LinkReader ¶
type LinkReader struct {
// contains filtered or unexported fields
}
LinkReader is an iterative structure that allows for reading all href tags in a given URL. The link reader will make the http request to the specified url and allow for reading through all links in the returned page. When there are no more links in the page Read returns io.EOF. The consumer is responsible for closing the LinkReader when done to ensure and client http requests are cleaned up.
func NewLinkReader ¶
func NewLinkReader(pageURL *url.URL, client *http.Client) *LinkReader
NewLinkReader returns a LinkReader for the specified URL, fetching the content with the specified client
func (*LinkReader) Close ¶
func (u *LinkReader) Close() error
Close cleans up any remaining client response. If all links are read from the link reader the body will be automatically closed, however if only the first N links are required, the body must be closed by the caller.
func (*LinkReader) Read ¶
func (u *LinkReader) Read() (string, error)
Read returns the next href in the html document
func (*LinkReader) URL ¶
func (u *LinkReader) URL() string
URL returns the read-only url string that was used to make the client request
type Option ¶
type Option interface {
// contains filtered or unexported methods
}
Option is used to configure configuration options that are not required
func SetClient ¶
SetClient overrides the default client config. Note that if the client is set, KeepAlive and Timeout will not be effective and the keep alive and timeout options set for the client will take precendence.
func SetCrawlTimeout ¶
SetCrawlTimeout sets the maximum time spent crawling URLs. When the timeout is zero or negative, no timeout is applied and the caller will wait for completion. If the timeout fires, the caller will receive the partial site map.
func SetDomainValidator ¶
func SetDomainValidator(validator DomainValidator) Option
SetDomainValidator overrides the default domain validator. The default validator is configured to compare the host component of the URLs only, not the scheme or any DNS lookups.
func SetKeepAlive ¶
SetKeepAlive sets the http client connection keep alive timeout when the default http client is used.
func SetLogger ¶
SetLogger overrides the default logger. The default logger is configured to write warning and error logs to stderr.
func SetMaxConcurrency ¶
SetMaxConcurrency sets the number of goroutines that will be used. This is also used to configure the default http client with enough open connections to support this number of goroutines.
func SetMaxPendingURLS ¶
SetMaxPendingURLS sets the maximum number of URLs that can persist in the queue for crawling. This will set the size of the channel of URLs being processed by the goroutines. This helps prevent cases where the number of URLs runs away indefinitely due to dynamic urls in page links.
func SetTimeout ¶
SetTimeout sets the http client request timeout when the default http client is used.
type SiteMap ¶
type SiteMap struct {
// contains filtered or unexported fields
}
SiteMap contains the state of a site map.
func CrawlDomain ¶
CrawlDomain crawls a domain provided as a string URL. It wraps a call to CrawlDomainWithURL.
func CrawlDomainWithURL ¶
CrawlDomainWithURL crawls a domain provided as a URL and returns the resulting sitemap.
func NewSiteMap ¶
func NewSiteMap(url *url.URL, validator DomainValidator) *SiteMap
NewSiteMap initializes a new SiteMap anchored at the specified URL and crawls with the specified HTTP client