crawl

package
v1.0.59 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 10, 2024 License: Apache-2.0 Imports: 42 Imported by: 0

Documentation

Index

Constants

View Source
const (
	// B represent a Byte
	B = 1
	// KB represent a Kilobyte
	KB = 1024 * B
	// MB represent a MegaByte
	MB = 1024 * KB
	// GB represent a GigaByte
	GB = 1024 * MB
)

Variables

This section is empty.

Functions

func ClosingPipedTeeReader

func ClosingPipedTeeReader(r io.Reader, pw *io.PipeWriter) io.Reader

ClosingPipedTeeReader is like a classic io.TeeReader, but it explicitely takes an io.PipeWriter, and make sure to close it

Types

type Crawl

type Crawl struct {
	*sync.Mutex
	StartTime        time.Time
	SeedList         []frontier.Item
	Paused           *utils.TAtomBool
	Finished         *utils.TAtomBool
	LiveStats        bool
	ElasticSearchURL string

	// Frontier
	Frontier *frontier.Frontier

	// Crawl settings
	WorkerPool                     sizedwaitgroup.SizedWaitGroup
	MaxConcurrentAssets            int
	Client                         *warc.CustomHTTPClient
	ClientProxied                  *warc.CustomHTTPClient
	Logger                         logrus.Logger
	DisabledHTMLTags               []string
	ExcludedHosts                  []string
	ExcludedStrings                []string
	UserAgent                      string
	Job                            string
	JobPath                        string
	MaxHops                        uint8
	MaxRetry                       int
	MaxRedirect                    int
	HTTPTimeout                    int
	MaxConcurrentRequestsPerDomain int
	RateLimitDelay                 int
	DisableAssetsCapture           bool
	CaptureAlternatePages          bool
	DomainsCrawl                   bool
	Headless                       bool
	Seencheck                      bool
	Workers                        int

	// Cookie-related settings
	CookieFile  string
	KeepCookies bool
	CookieJar   http.CookieJar

	// proxy settings
	Proxy       string
	BypassProxy []string

	// API settings
	API               bool
	APIPort           string
	Prometheus        bool
	PrometheusMetrics *PrometheusMetrics

	// Real time statistics
	URIsPerSecond *ratecounter.RateCounter
	ActiveWorkers *ratecounter.Counter
	CrawledSeeds  *ratecounter.Counter
	CrawledAssets *ratecounter.Counter

	// WARC settings
	WARCPrefix         string
	WARCOperator       string
	WARCWriter         chan *warc.RecordBatch
	WARCWriterFinish   chan bool
	WARCTempDir        string
	CDXDedupeServer    string
	WARCFullOnDisk     bool
	WARCPoolSize       int
	WARCDedupSize      int
	DisableLocalDedupe bool
	CertValidation     bool

	// Crawl HQ settings
	UseHQ             bool
	HQAddress         string
	HQProject         string
	HQKey             string
	HQSecret          string
	HQStrategy        string
	HQBatchSize       int
	HQContinuousPull  bool
	HQClient          *gocrawlhq.Client
	HQFinishedChannel chan *frontier.Item
	HQProducerChannel chan *frontier.Item
	HQChannelsWg      *sync.WaitGroup
}

Crawl define the parameters of a crawl process

func (*Crawl) Capture

func (c *Crawl) Capture(item *frontier.Item)

Capture capture the URL and return the outlinks

func (*Crawl) HQConsumer

func (c *Crawl) HQConsumer()

func (*Crawl) HQFinisher

func (c *Crawl) HQFinisher()

func (*Crawl) HQProducer

func (c *Crawl) HQProducer()

func (*Crawl) HQSeencheckURL

func (c *Crawl) HQSeencheckURL(URL *url.URL) (bool, error)

func (*Crawl) HQSeencheckURLs

func (c *Crawl) HQSeencheckURLs(URLs []*url.URL) (seencheckedBatch []*url.URL, err error)

func (*Crawl) HQWebsocket

func (c *Crawl) HQWebsocket()

This function connects to HQ's websocket and listen for messages. It also sends and "identify" message to the HQ to let it know that Zeno is connected. This "identify" message is sent every second and contains the crawler's stats and details.

func (*Crawl) Start

func (c *Crawl) Start() (err error)

Start fire up the crawling process

func (*Crawl) Worker

func (c *Crawl) Worker()

Worker is the key component of a crawl, it's a background processed dispatched when the crawl starts, it listens on a channel to get new URLs to archive, and eventually push newly discovered URLs back in the frontier.

type PrometheusMetrics

type PrometheusMetrics struct {
	Prefix        string
	DownloadedURI prometheus.Counter
}

PrometheusMetrics define all the metrics exposed by the Prometheus exporter

Directories

Path Synopsis
sitespecific
vk

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL