scraper

package

v0.1.1 Latest Latest Go to latest Published: Mar 2, 2018 License: MIT Imports: 19 Imported by: 2

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/lrstanley/marill

Links

Open Source Insights

Documentation ¶

Index ¶

Variables
func VerifyHostname(c *tls.ConnectionState, host string) error
type CertName
type Crawler
type CrawlerConfig
type CustomClient
type CustomResponse
type Domain
- func (d *Domain) String() string
type FetchResult
- func (r *FetchResult) String() string
type HostnameError
- func (h HostnameError) Error() string
type Resource
- func (r *Resource) String() string
type Response
type ResponseCert
type TLSResponse

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrNotMatchOrigin = errors.New("redirection does not match origin host")

ErrNotMatchOrigin indicates that the end location is external to the host we were originally looking up.

View Source

var ErrTooManyRedirects = errors.New("too many redirects (10+)")

ErrTooManyRedirects indicates that the requested origin redirected more than a considerable amount of times, indicating there may be a redirect loop.

Functions ¶

func VerifyHostname ¶

func VerifyHostname(c *tls.ConnectionState, host string) error

VerifyHostname verifies if the tls.ConnectionState certificate matches the hostname

Types ¶

type CertName ¶

type CertName struct {
	Country       string
	Organization  string
	Locality      string
	Province      string
	StreetAddress string
	CommonName    string
}

type Crawler ¶

type Crawler struct {
	Log *log.Logger // output log

	Results []*FetchResult // scan results, should only be access when scan is complete
	Pool    sempool.Pool   // thread pool for fetching main resources
	ResPool sempool.Pool   // thread pool for fetching assets
	Cnf     CrawlerConfig
	// contains filtered or unexported fields
}

Crawler is the higher level struct which wraps the entire threaded crawl process

func (*Crawler) Crawl ¶

func (c *Crawler) Crawl()

Crawl represents the higher level functionality of scraper. Crawl should concurrently request the needed resources for a list of domains, allowing the bypass of DNS lookups where necessary.

func (*Crawler) Fetch ¶

func (c *Crawler) Fetch(res *FetchResult)

Fetch manages the fetching of the main resource, as well as all child resources, providing a FetchResult struct containing the entire crawl data needed

func (*Crawler) Get ¶

func (c *Crawler) Get(url string) (*CustomResponse, error)

Get wraps GetHandler -- easy interface for making get requests

func (*Crawler) IsRemote ¶

func (c *Crawler) IsRemote(host string) bool

IsRemote checks to see if host is remote, and if it should be scanned

type CrawlerConfig ¶

type CrawlerConfig struct {
	Domains       []*Domain     // list of domains to scan
	Assets        bool          // if we want to pull the assets for the page too
	NoRemote      bool          // ignore all resources that match a remote IP
	AllowInsecure bool          // if SSL errors should be ignored
	Delay         time.Duration // delay before each resource is crawled
	HTTPTimeout   time.Duration // http timeout before a request has become stale
	Threads       int           // total number of threads to run crawls in
}

CrawlerConfig is the configuration which changes Crawler

type CustomClient ¶

type CustomClient struct {
	URL       string
	Host      string
	ResultURL url.URL  // represents the url for the resulting request, without modifications
	OriginURL *url.URL // represents the url from the original request, without modifications
	// contains filtered or unexported fields
}

CustomClient is the state for our custom http wrapper, which houses the needed data to be able to rewrite the outgoing request during redirects.

type CustomResponse ¶

type CustomResponse struct {
	*http.Response
	Time *utils.TimerResult
	URL  *url.URL
}

CustomResponse is the wrapped response from http.Client.Do() which also includes a timer of how long the request took, and a few other minor extras.

type Domain ¶

type Domain struct {
	URL *url.URL `json:"-"`
	IP  string
}

Domain represents a url we need to fetch, including the items needed to fetch said url. E.g: host, port, ip, scheme, path, etc.

func (*Domain) String ¶

func (d *Domain) String() string

type FetchResult ¶

type FetchResult struct {
	Resource                        // Inherit the Resource struct
	Assets       []*Resource        `json:"-"` // Assets containing the needed resources for the given URL
	ResourceTime *utils.TimerResult // ResourceTime is the time it took to fetch all resources
	TotalTime    *utils.TimerResult // TotalTime is the time it took to crawl the site
}

FetchResult -- struct returned by Crawl() to represent the entire crawl process

func (*FetchResult) String ¶

func (r *FetchResult) String() string

type HostnameError ¶

type HostnameError struct {
	Certificate *x509.Certificate
	Host        string
}

HostnameError appears when an invalid SSL certificate is supplied

func (HostnameError) Error ¶

func (h HostnameError) Error() string

type Resource ¶

type Resource struct {
	URL      string             // the url -- this should exist regardless of failure
	Request  *Domain            // request represents what we were provided before the request
	Response Response           // Response represents the end result/data/status/etc.
	Error    error              // Error represents an error of a completely failed request
	Time     *utils.TimerResult // Time is the time it took to complete the request
}

Resource represents a single entity of many within a given crawl. These should only be of type css, js, jpg, png, etc (static resources).

func (*Resource) String ¶

func (r *Resource) String() string

type Response ¶

type Response struct {
	Remote        bool         // Remote is true if the origin is remote (unknown ip)
	Code          int          // Code is the numeric HTTP based status code
	URL           *url.URL     `json:"-"` // URL is the resulting static URL derived by the original result page
	Body          string       // Body is the response body. Used for primary requests, ignored for Resource structs.
	Headers       http.Header  // Headers is a map[string][]string of headers
	ContentLength int64        // ContentLength is the number of bytes in the body of the response
	TLS           *TLSResponse // TLS is the SSL/TLS session if the resource was loaded over SSL/TLS
}

Response represents the data for the HTTP-based request, closely matching http.Response

type ResponseCert ¶

type ResponseCert struct {
	Version        int
	NotBefore      time.Time
	NotAfter       time.Time
	Issuer         *CertName
	Subject        *CertName
	DNSNames       []string
	EmailAddresses []string
	IPAddresses    []net.IP
}

type TLSResponse ¶

type TLSResponse struct {
	HandshakeComplete bool
	PeerCertificates  []*ResponseCert
	VerifiedChains    [][]*ResponseCert
}

TLSResponse is the TLS/SSL handshake response and certificate information.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL