geziyor

package module

v1.0.1 Latest Latest Go to latest Published: Nov 7, 2023 License: MPL-2.0 Imports: 19 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/toqueteos/geziyor

Links

Open Source Insights

README ¶

Geziyor (forked)

Geziyor is a web crawling and web scraping framework.

About this fork

Updated default chromedp actions to wait for network requests to finish
Added context.Context support for easy cancellation
Lower timeout values

Features

JS Rendering
5.000+ requests/second
Caching (Memory/Disk/LevelDB)
Automatic Data Exporting (JSON, JSONL, CSV, or custom)
Metrics (Prometheus, Expvar, or custom)
Limit Concurrency (Global/Per Domain)
Request Delays (Constant/Randomized)
Cookies, Middlewares, robots.txt
Automatic response decoding to UTF-8
Proxy management (Single, Round-Robin, Custom)

See scraper Options for all custom settings.

Usage

This example extracts all quotes from quotes.toscrape.com and exports to JSON file.

func main() {
    ctx := context.TODO()
    geziyor.NewGeziyor(ctx, &geziyor.Options{
        StartURLs: []string{"http://quotes.toscrape.com/"},
        ParseFunc: quotesParse,
        Exporters: []export.Exporter{&export.JSON{}},
    }).Start(ctx)
}

func quotesParse(ctx context.Context, g *geziyor.Geziyor, r *client.Response) {
    r.HTMLDoc.Find("div.quote").Each(func(i int, s *goquery.Selection) {
        g.Exports <- map[string]interface{}{
            "text":   s.Find("span.text").Text(),
            "author": s.Find("small.author").Text(),
        }
    })
    if href, ok := r.HTMLDoc.Find("li.next > a").Attr("href"); ok {
        g.Get(ctx, r.JoinURL(href), quotesParse)
    }
}

See tests for more usage examples.

Installation

go get -u github.com/toqueteos/geziyor

If you want to make JS rendered requests, a local Chrome is required.

Alternatively you can use any Chromium-based headless docker image such as the one available from the Ferret project:

docker run --rm -d -p 9222:9222 montferret/chromium

Don't forget to set Options.BrowserEndpoint!

NOTE: macOS limits the maximum number of open file descriptors. If you want to make concurrent requests over 256, you need to increase limits. Read this for more.

Making Normal Requests

Initial requests start with StartURLs []string field in Options. Geziyor makes concurrent requests to those URLs. After reading response, ParseFunc func(g *Geziyor, r *Response) called.

geziyor.NewGeziyor(ctx, &geziyor.Options{
    StartURLs: []string{"https://httpbingo.org/ip"},
    ParseFunc: func(ctx context.Context, g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start(ctx)

If you want to manually create first requests, set StartRequestsFunc. StartURLs won't be used if you create requests manually. You can make requests using Geziyor methods:

geziyor.NewGeziyor(ctx, &geziyor.Options{
    StartRequestsFunc: func(ctx context.Context, g *geziyor.Geziyor) {
    	g.Get(ctx, "https://httpbingo.org/anything", g.Opt.ParseFunc)
        g.Head(ctx, "https://httpbingo.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(ctx context.Context, g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start(ctx)

Making JS Rendered Requests

JS Rendered requests can be made using GetRendered method.

By default, geziyor tries to launch a local Chrome instance, if there's one available locally.

You can set the BrowserEndpoint option to use connect to a different Chrome instance.

geziyor.NewGeziyor(ctx, &geziyor.Options{
    StartRequestsFunc: func(ctx context.Context, g *geziyor.Geziyor) {
        g.GetRendered(ctx, "https://httpbingo.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(ctx context.Context, g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
    BrowserEndpoint: "ws://localhost:9292",
}).Start(ctx)

Extracting Data

We can extract HTML elements using response.HTMLDoc. HTMLDoc is Goquery's Document.

HTMLDoc can be accessible on Response if response is HTML and can be parsed using Go's built-in HTML parser If response isn't HTML, response.HTMLDoc would be nil.

geziyor.NewGeziyor(ctx, &geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(ctx context.Context, g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            log.Println(s.Find("span.text").Text(), s.Find("small.author").Text())
        })
    },
}).Start(ctx)

Exporting Data

You can export data automatically using exporters. Just send data to Geziyor.Exports chan. Available exporters

geziyor.NewGeziyor(ctx, &geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(ctx context.Context, g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            g.Exports <- map[string]interface{}{
                "text":   s.Find("span.text").Text(),
                "author": s.Find("small.author").Text(),
            }
        })
    },
    Exporters: []export.Exporter{&export.JSON{}},
}).Start(ctx)

Custom Requests - Passing Metadata To Callbacks

You can create custom requests with client.NewRequest

Use that request on geziyor.Do(request, callback)

geziyor.NewGeziyor(ctx, &geziyor.Options{
    StartRequestsFunc: func(ctx context.Context, g *geziyor.Geziyor) {
        req, _ := client.NewRequest(ctx, "GET", "https://httpbingo.org/anything", nil)
        req.Meta["key"] = "value"
        g.Do(req, g.Opt.ParseFunc)
    },
    ParseFunc: func(ctx context.Context, g *geziyor.Geziyor, r *client.Response) {
        fmt.Println("This is our data from request: ", r.Request.Meta["key"])
    },
}).Start(ctx)

Proxy - Use proxy per request

If you want to use proxy for your requests, and you have 1 proxy, you can just set these env values: HTTP_PROXY HTTPS_PROXY And geziyor will use those proxies.

Also, you can use in-order proxy per request by setting ProxyFunc option to client.RoundRobinProxy Or any custom proxy selection function that you want. See client/proxy.go on how to implement that kind of custom proxy selection function.

Proxies can be HTTP, HTTPS and SOCKS5.

Note: If you use http scheme for proxy, It'll be used for http requests and not for https requests.

geziyor.NewGeziyor(ctx, &geziyor.Options{
    StartURLs:         []string{"http://httpbingo.org/anything"},
    ParseFunc:         parseFunc,
    ProxyFunc:         client.RoundRobinProxy("http://some-http-proxy.com", "https://some-https-proxy.com", "socks5://some-socks5-proxy.com"),
}).Start(ctx)

Benchmark

See tests for this benchmark function:

>> go test -run none -bench Requests -benchtime 10s
goos: linux
goarch: amd64
pkg: github.com/toqueteos/geziyor
cpu: AMD Ryzen 7 7700X 8-Core Processor
BenchmarkRequests-16              362724             38632 ns/op
PASS
ok      github.com/toqueteos/geziyor    14.352s

Documentation ¶

Index ¶

type ErrorFunc
type Geziyor
- func NewGeziyor(ctx context.Context, opt *Options) *Geziyor
type Options
type ParseFunc
type StartRequestsFunc

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type ErrorFunc ¶

type ErrorFunc func(ctx context.Context, g *Geziyor, r *client.Request, err error)

type Geziyor ¶

type Geziyor struct {
	Opt     *Options
	Client  *client.Client
	Exports chan interface{}
	// contains filtered or unexported fields
}

Geziyor is our main scraper type

func NewGeziyor ¶

func NewGeziyor(ctx context.Context, opt *Options) *Geziyor

NewGeziyor creates new Geziyor with default values. If options provided, options

func (*Geziyor) Do ¶

func (g *Geziyor) Do(req *client.Request, callback ParseFunc)

Do sends an HTTP request

func (*Geziyor) Get ¶

func (g *Geziyor) Get(ctx context.Context, url string, callback ParseFunc)

Get issues a GET to the specified URL.

func (*Geziyor) GetRendered ¶

func (g *Geziyor) GetRendered(ctx context.Context, url string, callback ParseFunc)

GetRendered issues GET request using headless browser Opens up a new Chrome instance, makes request, waits for rendering HTML DOM and closed. Rendered requests only supported for GET requests.

func (*Geziyor) Head ¶

func (g *Geziyor) Head(ctx context.Context, url string, callback ParseFunc)

Head issues a HEAD to the specified URL

func (*Geziyor) Post ¶

func (g *Geziyor) Post(ctx context.Context, url string, body io.Reader, callback ParseFunc)

Post issues a POST to the specified URL

func (*Geziyor) Start ¶

func (g *Geziyor) Start(ctx context.Context)

Start starts scraping

func (*Geziyor) Stop ¶ added in v1.0.1

func (g *Geziyor) Stop()

Stop disables any more requests and signals all currently ongoing requests to finish

type Options ¶

type Options struct {
	// AllowedDomains is domains that are allowed to make requests
	// If empty, any domain is allowed
	AllowedDomains []string

	// Chrome headless browser WS endpoint.
	// If you want to run your own Chrome browser runner, provide its endpoint in here
	// For example: ws://localhost:3000
	BrowserEndpoint string

	// Cache storage backends.
	// - Memory
	// - Disk
	// - LevelDB
	Cache cache.Cache

	// Policies for caching.
	// - Dummy policy (default)
	// - RFC2616 policy
	CachePolicy cache.Policy

	// Response charset detection for decoding to UTF-8
	CharsetDetectDisabled bool

	// Concurrent requests limit
	ConcurrentRequests int

	// Concurrent requests per domain limit. Uses request.URL.Host
	// Subdomains are different than top domain
	ConcurrentRequestsPerDomain int

	// If set true, cookies won't send.
	CookiesDisabled bool

	// ErrorFunc is callback of errors.
	// If not defined, all errors will be logged.
	ErrorFunc ErrorFunc

	// For extracting data
	Exporters []export.Exporter

	// Disable logging by setting this true
	LogDisabled bool

	// Max body reading size in bytes. Default: 1GB
	MaxBodySize int64

	// Maximum redirection time. Default: 10
	MaxRedirect int

	// Scraper metrics exporting type. See metrics.Type
	MetricsType metrics.Type

	// ParseFunc is callback of StartURLs response.
	ParseFunc ParseFunc

	// If true, HTML parsing is disabled to improve performance.
	ParseHTMLDisabled bool

	// ProxyFunc setting proxy for each request
	ProxyFunc func(*http.Request) (*url.URL, error)

	// Rendered requests pre actions. Setting this will override the existing default.
	// And you'll need to handle all rendered actions, like navigation, waiting, response etc.
	// If you need to make custom actions in addition to the defaults, use Request.Actions instead of this.
	PreActions []chromedp.Action

	// Request delays
	RequestDelay time.Duration

	// RequestDelayRandomize uses random interval between 0.5 * RequestDelay and 1.5 * RequestDelay
	RequestDelayRandomize bool

	// Called before requests made to manipulate requests
	RequestMiddlewares []middleware.RequestProcessor

	// Called after response received
	ResponseMiddlewares []middleware.ResponseProcessor

	// RequestsPerSecond limits requests that is made per seconds. Default: No limit
	RequestsPerSecond float64

	// Which HTTP response codes to retry.
	// Other errors (DNS lookup issues, connections lost, etc) are always retried.
	// Default: []int{500, 502, 503, 504, 522, 524, 408}
	RetryHTTPCodes []int

	// Maximum number of times to retry, in addition to the first download.
	// Set -1 to disable retrying
	// Default: 2
	RetryTimes int

	// If true, disable robots.txt checks
	RobotsTxtDisabled bool

	// StartRequestsFunc called on scraper start
	StartRequestsFunc StartRequestsFunc

	// First requests will made to this url array. (Concurrently)
	StartURLs []string

	// Timeout is global request timeout
	Timeout time.Duration

	// Revisiting same URLs is disabled by default
	URLRevisitEnabled bool

	// User Agent.
	// Default: "Geziyor 1.0"
	UserAgent string
}

Options is custom options type for Geziyor

type ParseFunc ¶

type ParseFunc func(ctx context.Context, g *Geziyor, r *client.Response)

type StartRequestsFunc ¶

type StartRequestsFunc func(ctx context.Context, g *Geziyor)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cache Package cache provides a http.RoundTripper implementation that works as a mostly RFC-compliant cache for http responses.	Package cache provides a http.RoundTripper implementation that works as a mostly RFC-compliant cache for http responses.
diskcache Package diskcache provides an implementation of cache.Cache that uses the diskv package to supplement an in-memory map with persistent storage	Package diskcache provides an implementation of cache.Cache that uses the diskv package to supplement an in-memory map with persistent storage
leveldbcache Package leveldbcache provides an implementation of cache.Cache that uses github.com/syndtr/goleveldb/leveldb	Package leveldbcache provides an implementation of cache.Cache that uses github.com/syndtr/goleveldb/leveldb
memorycache
client
export
internal
metrics
middleware

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL