Documentation ¶
Index ¶
- func DefaultLinkExtractor(c *Client, currLink string, resp []byte) []string
- func IsClientErrorResponse(resp *http.Response) bool
- func IsHtmlContent(resp *http.Response) bool
- func IsNoopResponse(resp *http.Response) bool
- func IsOkResponse(resp *http.Response) bool
- func IsServerErrorResponse(resp *http.Response) bool
- type Client
- type Config
- type IPInfo
- type LinkExtractor
- type NetworkInfo
- type PageInfo
- type ResponseMatcher
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func DefaultLinkExtractor ¶
DefaultLinkExtractor looks for <a href="..."> tags and extracts the link if the host is not blacklisted. This function assumes that if the href value is a relative path, it is relative to the current URL.
func IsClientErrorResponse ¶
This matches all responses that return a 4xx status code
func IsHtmlContent ¶
This matches all responses that return a 2xx status code and have a Content-Type header that contains "text/html"
func IsOkResponse ¶
This matches all responses that return a 200 status code
func IsServerErrorResponse ¶
This matches all responses that return a 5xx status code
Types ¶
type Client ¶
type Client struct { MaxDepth int NetMutex sync.RWMutex PageMutex sync.RWMutex HostBlacklist map[string]struct{} VisitedNetInfo map[string][]NetworkInfo VisitedPageInfo map[string]PageInfo // contains filtered or unexported fields }
func New ¶
func New(ctx context.Context, config *Config, rm []ResponseMatcher, le LinkExtractor) *Client
New creates a new crawler client using the context to allow for cancellation, the crawler config, and list of response matchers to filter out responses.
Note that the ordering of the response matchers matter, the first matcher to return false will cause the link to be skipped.
type Config ¶
type Config struct { BlacklistHosts map[string]struct{} // hosts to blacklist MaxDepth int // max depth from seed MaxRetries int // max retries for HTTP requests MaxRPS float64 // max requests per second ProxyURL *url.URL // proxy URL, if any. useful to avoid IP bans SeedURLs []string // where to start crawling from Timeout time.Duration // timeout for HTTP requests }
type LinkExtractor ¶
Takes in a map of blacklisted hosts and the response body and returns a slice of links
type NetworkInfo ¶
type NetworkInfo struct { RemoteIPInfo []IPInfo `json:"remote_ip_info"` AvgResponseMs int64 `json:"avg_response_ms"` PathCount int `json:"path_count"` VisitedPaths []string `json:"visited_paths"` // These values are not exported to JSON TotalResponseTimeMs int64 `json:"-"` VisitedPathSet map[string]struct{} `json:"-"` }
type ResponseMatcher ¶
ResponseMatcher is a function that takes an http.Response and returns a boolean to indicate whether or not the contents of the URL should be processed (e.g extract links)