Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ErrNotMatchOrigin = errors.New("redirection does not match origin host")
ErrNotMatchOrigin indicates that the end location is external to the host we were originally looking up.
var ErrTooManyRedirects = errors.New("too many redirects (10+)")
ErrTooManyRedirects indicates that the requested origin redirected more than a considerable amount of times, indicating there may be a redirect loop.
Functions ¶
func VerifyHostname ¶
func VerifyHostname(c *tls.ConnectionState, host string) error
VerifyHostname verifies if the tls.ConnectionState certificate matches the hostname
Types ¶
type Crawler ¶
type Crawler struct { Log *log.Logger // output log Results []*FetchResult // scan results, should only be access when scan is complete Pool sempool.Pool // thread pool for fetching main resources ResPool sempool.Pool // thread pool for fetching assets Cnf CrawlerConfig // contains filtered or unexported fields }
Crawler is the higher level struct which wraps the entire threaded crawl process
func (*Crawler) Crawl ¶
func (c *Crawler) Crawl()
Crawl represents the higher level functionality of scraper. Crawl should concurrently request the needed resources for a list of domains, allowing the bypass of DNS lookups where necessary.
func (*Crawler) Fetch ¶
func (c *Crawler) Fetch(res *FetchResult)
Fetch manages the fetching of the main resource, as well as all child resources, providing a FetchResult struct containing the entire crawl data needed
type CrawlerConfig ¶
type CrawlerConfig struct { Domains []*Domain // list of domains to scan Assets bool // if we want to pull the assets for the page too NoRemote bool // ignore all resources that match a remote IP AllowInsecure bool // if SSL errors should be ignored Delay time.Duration // delay before each resource is crawled HTTPTimeout time.Duration // http timeout before a request has become stale Threads int // total number of threads to run crawls in }
CrawlerConfig is the configuration which changes Crawler
type CustomClient ¶
type CustomClient struct { URL string Host string ResultURL url.URL // represents the url for the resulting request, without modifications OriginURL *url.URL // represents the url from the original request, without modifications // contains filtered or unexported fields }
CustomClient is the state for our custom http wrapper, which houses the needed data to be able to rewrite the outgoing request during redirects.
type CustomResponse ¶
CustomResponse is the wrapped response from http.Client.Do() which also includes a timer of how long the request took, and a few other minor extras.
type Domain ¶
Domain represents a url we need to fetch, including the items needed to fetch said url. E.g: host, port, ip, scheme, path, etc.
type FetchResult ¶
type FetchResult struct { Resource // Inherit the Resource struct Assets []*Resource `json:"-"` // Assets containing the needed resources for the given URL ResourceTime *utils.TimerResult // ResourceTime is the time it took to fetch all resources TotalTime *utils.TimerResult // TotalTime is the time it took to crawl the site }
FetchResult -- struct returned by Crawl() to represent the entire crawl process
func (*FetchResult) String ¶
func (r *FetchResult) String() string
type HostnameError ¶
type HostnameError struct { Certificate *x509.Certificate Host string }
HostnameError appears when an invalid SSL certificate is supplied
func (HostnameError) Error ¶
func (h HostnameError) Error() string
type Resource ¶
type Resource struct { URL string // the url -- this should exist regardless of failure Request *Domain // request represents what we were provided before the request Response Response // Response represents the end result/data/status/etc. Error error // Error represents an error of a completely failed request Time *utils.TimerResult // Time is the time it took to complete the request }
Resource represents a single entity of many within a given crawl. These should only be of type css, js, jpg, png, etc (static resources).
type Response ¶
type Response struct { Remote bool // Remote is true if the origin is remote (unknown ip) Code int // Code is the numeric HTTP based status code URL *url.URL `json:"-"` // URL is the resulting static URL derived by the original result page Body string // Body is the response body. Used for primary requests, ignored for Resource structs. Headers http.Header // Headers is a map[string][]string of headers ContentLength int64 // ContentLength is the number of bytes in the body of the response TLS *TLSResponse // TLS is the SSL/TLS session if the resource was loaded over SSL/TLS }
Response represents the data for the HTTP-based request, closely matching http.Response
type ResponseCert ¶
type TLSResponse ¶
type TLSResponse struct { HandshakeComplete bool PeerCertificates []*ResponseCert VerifiedChains [][]*ResponseCert }
TLSResponse is the TLS/SSL handshake response and certificate information.