scraper

command module
v0.0.8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 9, 2020 License: MIT Imports: 1 Imported by: 0

README

Scraper

CLI utility to provide website cold scraping based on its tree.

It helps QA, Developers and DevOps to estimate time fractions for each request along side with response/status codes, content length and content type.

With Scraper json or yaml report you can easily understand consistency of website content.

GoDoc build status

Features

  • Scrape whole website based on its tree
  • Scan links, images, scripts, styles and emails
  • Chrome rendering if browser option is enabled
  • Limit scraping by list of domains
  • Track DNS, Connect, TLSHandshake, FirstByte, Download time
  • Skip verify on wrong SSL certificates
  • Provide response code, content length and type
  • Follow redirects where it is needed
  • Output to file: json or yaml

Build

git clone https://github.com/devopsext/scraper.git
cd scraper/
go build

Example

./scraper website --url https://devopsext.com --max-depth 3 --redirects \
                  --domains devopsext.com,www.devopsext.com \
                  --links --scripts --styles --images \
                  --output json --file scraper.json \
                  --log-format stdout --log-level debug --log-template '{{.msg}}' 

Get response codes

cat scraper.json | jq .StatusCodes
{
  "200": 17
}

Usage

Website command

Usage:
  scraper website [flags]

Flags:
      --browser string      Website browser: chrome
      --domains strings     Website domains
      --emails              Website scrape emails
      --file string         Website output file
  -h, --help                help for website
      --images              Website scrape images
      --insecure            Website insecure skip verify
      --links               Website scrape links
      --max-body-size int   Website max body size (default 10485760)
      --max-depth int       Website max depth (default 1)
      --output string       Website output: json, yaml
      --redirects           Website follow redirects
      --scripts             Website scrape scripts
      --styles              Website scrape styles
      --url string          Website url
      --user-agent string   Website user agent

Global Flags:
      --log-format string     Log format: json, text, stdout (default "text")
      --log-level string      Log level: info, warn, error, debug, panic (default "info")
      --log-template string   Log template (default "{{.func}} [{{.line}}]: {{.msg}}")

Environment variables

For containerization purpose all command switches have environment variables analogs.

  • SCRAPER_WEBSITE_URL
  • SCRAPER_WEBSITE_REDIRECTS
  • SCRAPER_WEBSITE_LINKS
  • SCRAPER_WEBSITE_IMAGES
  • SCRAPER_WEBSITE_SCRIPTS
  • SCRAPER_WEBSITE_STYLES
  • SCRAPER_WEBSITE_EMAILS
  • SCRAPER_WEBSITE_DOMAINS
  • SCRAPER_WEBSITE_OUTPUT
  • SCRAPER_WEBSITE_MAX_DEPTH
  • SCRAPER_WEBSITE_USER_AGENT
  • SCRAPER_WEBSITE_INSECURE
  • SCRAPER_WEBSITE_MAX_BODY_SIZE
  • SCRAPER_WEBSITE_BROWSER
  • SCRAPER_WEBSITE_FILE
  • SCRAPER_LOG_FORMAT
  • SCRAPER_LOG_LEVEL
  • SCRAPER_LOG_TEMPLATE

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL