fetch

package module
v0.1.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 28, 2023 License: AGPL-3.0 Imports: 27 Imported by: 0

README

Fetch

GitHub go.mod Go version Go Report Card GitHub
Fetch is the cloudcat.fetch implement for fetching resource from the network.
Support:

  • TLS fingerprinting resistance
  • HTTP2 fingerprinting resistance

License

cloudcat is distributed under the AGPL-3.0 license.

Documentation

Overview

Package fetch the http resource

Index

Constants

View Source
const (
	// DefaultMaxBodySize fetch.Response default max body size
	DefaultMaxBodySize int64 = 1024 * 1024 * 1024
	// DefaultRetryTimes fetch.RequestConfig retry times
	DefaultRetryTimes = 3
	// DefaultTimeout fetch.RequestConfig timeout
	DefaultTimeout = time.Minute
)

Variables

View Source
var (
	// DefaultRetryHTTPCodes retry fetch.RequestConfig error status code
	DefaultRetryHTTPCodes = []int{http.StatusInternalServerError, http.StatusBadGateway, http.StatusServiceUnavailable,
		http.StatusGatewayTimeout, http.StatusRequestTimeout}
	// DefaultHeaders defaults http headers
	DefaultHeaders = http.Header{
		"Accept":          {"*/*"},
		"Accept-Encoding": {"gzip, deflate, br"},
		"Accept-Language": {"en-US,en;"},
		"User-Agent":      {"cloudcat"},
	}
)
View Source
var ErrNoDateHeader = errors.New("no Date header")

ErrNoDateHeader indicates that the HTTP headers contained no Date header.

Functions

func CachedResponse

func CachedResponse(c cloudcat.Cache, req *http.Request) (resp *http.Response, err error)

CachedResponse returns the cached http.Response for req if present, and nil otherwise.

func Date

func Date(respHeaders http.Header) (date time.Time, err error)

Date parses and returns the value of the Date header.

func DecodeResponse added in v0.1.3

func DecodeResponse(res *http.Response) (*http.Response, error)

DecodeResponse decode Content-Encoding from HTTP header (gzip, deflate, br) encodings.

func DefaultRoundTripper

func DefaultRoundTripper() http.RoundTripper

DefaultRoundTripper the fetch default RoundTripper

func DefaultTemplateFuncMap added in v0.1.3

func DefaultTemplateFuncMap(cache cloudcat.Cache) template.FuncMap

DefaultTemplateFuncMap The default template function map

func DoByte

func DoByte(fetch cloudcat.Fetch, req *http.Request) ([]byte, error)

DoByte do request and read response body.

func DoString

func DoString(fetch cloudcat.Fetch, req *http.Request) (string, error)

DoString do request and read response body as string.

func NewFetch added in v0.1.1

func NewFetch(opt Options) cloudcat.Fetch

NewFetch returns a new cloudcat.Fetch instance

func NewRequest

func NewRequest(method, u string, body any, headers map[string]string) (*http.Request, error)

NewRequest returns a new RequestConfig given a method, URL, optional body, optional headers.

func NewTemplateRequest added in v0.1.3

func NewTemplateRequest(funcs template.FuncMap, tpl string, arg any) (*http.Request, error)

NewTemplateRequest returns a new RequestConfig given a http template with argument.

func ProxyFromRequest

func ProxyFromRequest(req *http.Request) (*url.URL, error)

ProxyFromRequest returns a proxy URL on request context.

func WithRoundRobinProxy

func WithRoundRobinProxy(ctx context.Context, proxy ...string) context.Context

WithRoundRobinProxy returns a copy of parent context in which the proxies associated with context.

Types

type CacheTransport

type CacheTransport struct {
	Policy Policy
	// The RoundTripper interface actually used to make requests
	// If nil, http.DefaultTransport is used
	Transport http.RoundTripper
	Cache     cloudcat.Cache
	// If true, responses returned from the cache will be given an extra header, X-From-Cache
	MarkCachedResponses bool
}

CacheTransport is an implementation of http.RoundTripper that will return values from a cache where possible (avoiding a network request) and will additionally add validators (etag/if-modified-since) to repeated requests allowing servers to return 304 / Not Modified

func NewTransport

func NewTransport(c cloudcat.Cache) *CacheTransport

NewTransport returns new CacheTransport with the provided Cache implementation and MarkCachedResponses set to true

func (*CacheTransport) RoundTrip

func (t *CacheTransport) RoundTrip(req *http.Request) (resp *http.Response, err error)

RoundTrip is a wrapper for caching requests. If there is a fresh Response already in cache, then it will be returned without connecting to the server.

func (*CacheTransport) RoundTripDummy

func (t *CacheTransport) RoundTripDummy(req *http.Request) (resp *http.Response, err error)

RoundTripDummy has no awareness of any HTTP Cache-Control directives. Every request and its corresponding response are cached. When the same request is seen again, the response is returned without transferring anything from the Internet.

func (*CacheTransport) RoundTripRFC2616

func (t *CacheTransport) RoundTripRFC2616(req *http.Request) (resp *http.Response, err error)

RoundTripRFC2616 provides a RFC2616 compliant HTTP cache, i.e. with HTTP Cache-Control awareness, aimed at production and used in continuous runs to avoid downloading unmodified data (to save bandwidth and speed up crawls).

If there is a stale Response, then any validators it contains will be set on the new request to give the server a chance to respond with NotModified. If this happens, then the cached Response will be returned.

type Options

type Options struct {
	CharsetAutoDetect bool              `yaml:"charset-auto-detect"`
	MaxBodySize       int64             `yaml:"max-body-size"`
	RetryTimes        int               `yaml:"retry-times"` // greater than or equal 0
	RetryHTTPCodes    []int             `yaml:"retry-http-codes"`
	Timeout           time.Duration     `yaml:"timeout"`
	Headers           http.Header       `yaml:"headers"`
	RoundTripper      http.RoundTripper `yaml:"-"`
	Jar               http.CookieJar    `yaml:"-"`
}

Options The fetchImpl instance options

type Policy

type Policy string

Policy has no awareness of any HTTP Cache-Control directives.

const (

	// Dummy policy is useful for testing spiders faster (without having to wait for downloads every time)
	// and for trying your spider offline, when an Internet connection is not available.
	// The goal is to be able to “replay” a spider run exactly as it ran before.
	Dummy Policy = "dummy"

	// RFC2616 This policy provides a RFC2616 compliant HTTP cache, i.e. with HTTP Cache-Control awareness,
	// aimed at production and used in continuous runs to avoid downloading unmodified data
	// (to save bandwidth and speed up crawls).
	RFC2616 Policy = "rfc2616"

	// XFromCache is the header added to responses that are returned from the cache
	XFromCache = "X-From-Cache"
)

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL