Documentation ¶
Index ¶
- Constants
- func ClosingPipedTeeReader(r io.Reader, pw *io.PipeWriter) io.Reader
- type Crawl
- func (c *Crawl) Capture(item *frontier.Item)
- func (c *Crawl) HQConsumer()
- func (c *Crawl) HQFinisher()
- func (c *Crawl) HQProducer()
- func (c *Crawl) HQSeencheckURL(URL *url.URL) (bool, error)
- func (c *Crawl) HQSeencheckURLs(URLs []*url.URL) (seencheckedBatch []*url.URL, err error)
- func (c *Crawl) HQWebsocket()
- func (c *Crawl) Start() (err error)
- func (c *Crawl) Worker()
- type PrometheusMetrics
Constants ¶
View Source
const ( // B represent a Byte B = 1 // KB represent a Kilobyte KB = 1024 * B // MB represent a MegaByte MB = 1024 * KB // GB represent a GigaByte GB = 1024 * MB )
Variables ¶
This section is empty.
Functions ¶
func ClosingPipedTeeReader ¶
ClosingPipedTeeReader is like a classic io.TeeReader, but it explicitely takes an io.PipeWriter, and make sure to close it
Types ¶
type Crawl ¶
type Crawl struct { *sync.Mutex StartTime time.Time SeedList []frontier.Item Paused *utils.TAtomBool Finished *utils.TAtomBool LiveStats bool ElasticSearchURL string // Frontier Frontier *frontier.Frontier // Crawl settings WorkerPool sizedwaitgroup.SizedWaitGroup MaxConcurrentAssets int Client *warc.CustomHTTPClient ClientProxied *warc.CustomHTTPClient Logger logrus.Logger DisabledHTMLTags []string ExcludedHosts []string ExcludedStrings []string UserAgent string Job string JobPath string MaxHops uint8 MaxRetry int MaxRedirect int HTTPTimeout int MaxConcurrentRequestsPerDomain int RateLimitDelay int DisableAssetsCapture bool CaptureAlternatePages bool DomainsCrawl bool Headless bool Seencheck bool Workers int // Cookie-related settings CookieFile string KeepCookies bool CookieJar http.CookieJar // proxy settings Proxy string BypassProxy []string // API settings API bool APIPort string Prometheus bool PrometheusMetrics *PrometheusMetrics // Real time statistics URIsPerSecond *ratecounter.RateCounter ActiveWorkers *ratecounter.Counter CrawledSeeds *ratecounter.Counter CrawledAssets *ratecounter.Counter // WARC settings WARCPrefix string WARCOperator string WARCWriter chan *warc.RecordBatch WARCWriterFinish chan bool WARCTempDir string CDXDedupeServer string WARCFullOnDisk bool WARCPoolSize int WARCDedupSize int DisableLocalDedupe bool CertValidation bool // Crawl HQ settings UseHQ bool HQAddress string HQProject string HQKey string HQSecret string HQStrategy string HQBatchSize int HQContinuousPull bool HQClient *gocrawlhq.Client HQFinishedChannel chan *frontier.Item HQProducerChannel chan *frontier.Item HQChannelsWg *sync.WaitGroup }
Crawl define the parameters of a crawl process
func (*Crawl) HQConsumer ¶
func (c *Crawl) HQConsumer()
func (*Crawl) HQFinisher ¶
func (c *Crawl) HQFinisher()
func (*Crawl) HQProducer ¶
func (c *Crawl) HQProducer()
func (*Crawl) HQSeencheckURLs ¶
func (*Crawl) HQWebsocket ¶
func (c *Crawl) HQWebsocket()
This function connects to HQ's websocket and listen for messages. It also sends and "identify" message to the HQ to let it know that Zeno is connected. This "identify" message is sent every second and contains the crawler's stats and details.
type PrometheusMetrics ¶
type PrometheusMetrics struct { Prefix string DownloadedURI prometheus.Counter }
PrometheusMetrics define all the metrics exposed by the Prometheus exporter
Source Files ¶
Click to show internal directories.
Click to hide internal directories.