surfer

package
v0.0.0-...-2d91a95 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 7, 2017 License: Apache-2.0 Imports: 25 Imported by: 0

Documentation

Overview

Package surfer s a Go language prepared by the high concurrent web downloader, support GET/POST/HEAD Method and method http/https Protocol, while supporting the fixed UserAgent automatically save the cookie with a random large number of UserAgent disabled cookie two modes, a high degree of simulation of the browser behavior, enabling analog login and other functions.

Index

Constants

View Source
const (
	SurfID             = 0               // Surf Downloader Identifier
	PhomtomJsID        = 1               // PhomtomJs downloader identifier
	SplashID           = 2               // Splash downloader identifier
	DefaultMethod      = "GET"           // Default request method
	DefaultDialTimeout = 2 * time.Minute // The default request server timed out
	DefaultConnTimeout = 2 * time.Minute // Default download timeout
	DefaultTryTimes    = 3               // Default maximum number of downloads
	DefaultRetryPause  = 2 * time.Second // Default to re-download before pause
)

Variables

This section is empty.

Functions

func AutoToUTF8

func AutoToUTF8(resp *http.Response) error

Using the surf kernel to download, you can try to automatically transcode to utf8 Using phantomjs kernel, no transcoding (is utf8)

func BodyBytes

func BodyBytes(resp *http.Response) ([]byte, error)

Read the full response stream body

func DestroyJsFiles

func DestroyJsFiles()

DestroyJsFiles is a funtion for destroy phantomjs js temporary files

func DestroyLuaScriptFiles

func DestroyLuaScriptFiles()

DestroyLuaScriptFiles is a funtion for destroy phantomjs js temporary files

func Download

func Download(req Request) (resp *http.Response, err error)

Download is a function for download HTML from target url

func GetWDPath

func GetWDPath() string

GetWDPath gets the work directory path.

func IsDirExists

func IsDirExists(path string) bool

IsDirExists judges path is directory or not.

func IsFileExists

func IsFileExists(path string) bool

IsFileExists judges path is file or not.

func URLEncode

func URLEncode(urlStr string) (*url.URL, error)

URLEncode returns the encoded url.URL pointer, and parsed the error

func WalkDir

func WalkDir(targpath string, suffixes ...string) (dirlist []string)

WalkDir Traverse the directory, you can specify the suffix

Types

type Body

type Body struct {
	io.ReadCloser
	io.Reader
}

Body Package Response.Body

func (*Body) Read

func (b *Body) Read(p []byte) (int, error)

type DefaultRequest

type DefaultRequest struct {
	// url (required)
	URL string
	// GET POST POST-M HEAD (The default is GET)
	Method string
	// http header
	Header http.Header
	// Whether to use cookies, set in Spider's EnableCookie
	EnableCookie bool
	// POST values
	PostData string
	// dial tcp: i/o timeout
	DialTimeout time.Duration
	// WSARecv tcp: i/o timeout
	ConnTimeout time.Duration
	// the max times of download
	TryTimes int
	// how long pause when retry
	RetryPause time.Duration
	// max redirect times
	// when RedirectTimes equal 0, redirect times is ∞
	// when RedirectTimes less than 0, redirect times is 0
	RedirectTimes int
	// the download ProxyHost
	Proxy string

	// Tentukan Downloader ID
	// 0 Surf Download concurrency tinggi, berbagai fungsi kontrol penuh
	// 1 PhantomJS downloader, fitur yang kuat anti-pecah, lambat, concurrency rendah
	DownloaderID int
	// contains filtered or unexported fields
}

The default implementation of the Request

func (*DefaultRequest) GetConnTimeout

func (defaultRequest *DefaultRequest) GetConnTimeout() time.Duration

GetConnTimeout WSARecv tcp: i/o timeout

func (*DefaultRequest) GetDialTimeout

func (defaultRequest *DefaultRequest) GetDialTimeout() time.Duration

GetDialTimeout dial tcp: i/o timeout

func (*DefaultRequest) GetDownloaderID

func (defaultRequest *DefaultRequest) GetDownloaderID() int

select Surf ro PhomtomJS

func (*DefaultRequest) GetEnableCookie

func (defaultRequest *DefaultRequest) GetEnableCookie() bool

GetEnableCookie enable http cookies

func (*DefaultRequest) GetHeader

func (defaultRequest *DefaultRequest) GetHeader() http.Header

GetHeader http header

func (*DefaultRequest) GetMethod

func (defaultRequest *DefaultRequest) GetMethod() string

GetMethod GET POST POST-M HEAD

func (*DefaultRequest) GetPostData

func (defaultRequest *DefaultRequest) GetPostData() string

GetPostData POST values

func (*DefaultRequest) GetProxy

func (defaultRequest *DefaultRequest) GetProxy() string

GetProxy is the download ProxyHost

func (*DefaultRequest) GetRedirectTimes

func (defaultRequest *DefaultRequest) GetRedirectTimes() int

max redirect times

func (*DefaultRequest) GetRetryPause

func (defaultRequest *DefaultRequest) GetRetryPause() time.Duration

GetRetryPause is the pause time of retry

func (*DefaultRequest) GetTryTimes

func (defaultRequest *DefaultRequest) GetTryTimes() int

GetTryTimes is the max times of download

func (*DefaultRequest) GetURL

func (defaultRequest *DefaultRequest) GetURL() string

GetURL is a func ...

type Param

type Param struct {
	// contains filtered or unexported fields
}

func NewParam

func NewParam(req Request) (param *Param, err error)

NewParam is function for create new param

type Phantom

type Phantom struct {
	PhantomjsFile string // Phantomjs full file name
	TempJsDir     string // Temporary js storage directory
	// contains filtered or unexported fields
}

based on Phantomjs downloader implementation, as surfer added efficiency is much slower than surfer, but because of the analog browser, break better support UserAgent / TryTimes / RetryPause / custom js

func (*Phantom) DestroyJsFiles

func (phantom *Phantom) DestroyJsFiles()

DestroyJsFiles is a funtion for destroy js temporary files

func (*Phantom) Download

func (phantom *Phantom) Download(req Request) (resp *http.Response, err error)

Download is a function for implement the surfer downloader interface

type Request

type Request interface {
	// url
	GetURL() string
	// GET POST POST-M HEAD
	GetMethod() string
	// POST values
	GetPostData() string
	// http header
	GetHeader() http.Header
	// enable http cookies
	GetEnableCookie() bool
	// dial tcp: i/o timeout
	GetDialTimeout() time.Duration
	// WSARecv tcp: i/o timeout
	GetConnTimeout() time.Duration
	// the max times of download
	GetTryTimes() int
	// the pause time of retry
	GetRetryPause() time.Duration
	// the download ProxyHost
	GetProxy() string
	// max redirect times
	GetRedirectTimes() int
	// select Surf ro PhomtomJS
	GetDownloaderID() int
}

type Response

type Response struct {
	Cookies []string
	Body    string
}

based on Phantomjs downloader implementation, as surfer added efficiency is much slower than surfer, but because of the analog browser, break better support UserAgent / TryTimes / RetryPause / custom js

type Splash

type Splash struct {
	SplashServer     string // Splash Server host and port
	TempLuaScriptDir string // Temporary lua script storage directory
	// contains filtered or unexported fields
}

Splash is struct for represent splash API

func (*Splash) DestroyLuaScriptFiles

func (splash *Splash) DestroyLuaScriptFiles()

DestroyLuaScriptFiles is a funtion for destroy js temporary files

func (*Splash) Download

func (splash *Splash) Download(req Request) (resp *http.Response, err error)

Download is a function for implement the surfer downloader interface

type Surf

type Surf struct {
	// contains filtered or unexported fields
}

Default is the default Download implementation.

func (*Surf) Download

func (surf *Surf) Download(req Request) (resp *http.Response, err error)

type Surfer

type Surfer interface {
	// GET @param url string, header http.Header, cookies []*http.Cookie
	// HEAD @param url string, header http.Header, cookies []*http.Cookie
	// POST PostForm @param url, referer string, values url.Values, header http.Header, cookies []*http.Cookie
	// POST-M PostMultipart @param url, referer string, values url.Values, header http.Header, cookies []*http.Cookie
	Download(Request) (resp *http.Response, err error)
}

Surfer is a function downloader represents a core of HTTP web browser for crawler.

func New

func New() Surfer

func NewPhantom

func NewPhantom(phantomjsFile, tempJsDir string) Surfer

NewPhantom is a func to create phantomjs downloader

func NewSplash

func NewSplash(splashServer, tempLuaScriptDir string) Surfer

NewSplash is a funtion for create downloader via splash API

Directories

Path Synopsis
Package agent generates user agents strings for well known browsers and for custom browsers.
Package agent generates user agents strings for well known browsers and for custom browsers.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL