iterscraper

command module

v0.0.0-...-cc38d19 Latest Latest Go to latest Published: Aug 12, 2016 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/emperorearth/iterscraper

Links

Open Source Insights

README ¶

iterscraper

A basic package used for scraping information from a website where URLs contain an incrementing integer. Information is retrieved from HTML5 elements, and outputted as a CSV.

Thanks Francesc for featuring this repo in episode #1 of Just For Func. Watch The Video or Review Francesc's pull request.

Flags

Flags are all optional, and are set with a single dash on the command line, e.g.

iterscraper \
-url            "http://foo.com/%d" \
-from           1                   \
-to             10                  \
-concurrency    10                  \
-output         foo.csv             \
-nameQuery      ".name"             \
-addressQuery   ".address"          \
-phoneQuery     ".phone"            \
-emailQuery     ".email"

For an explanation of the options, type iterscraper -help

General usage of iterscraper:

  -addressQuery string
        JQuery-style query for the address element (default ".address")
  -concurrency int
        How many scrapers to run in parallel. (More scrapers are faster, but more prone to rate limiting or bandwith issues) (default 1)
  -emailQuery string
        JQuery-style query for the email element (default ".email")
  -from int
        The first ID that should be searched in the URL - inclusive.
  -nameQuery string
        JQuery-style query for the name element (default ".name")
  -output string
        Filename to export the CSV results (default "output.csv")
  -phoneQuery string
        JQuery-style query for the phone element (default ".phone")
  -to int
        The last ID that should be searched in the URL - exclusive (default 1)
  -url string
        The URL you wish to scrape, containing "%d" where the id should be substituted (default "http://example.com/v/%d")

URL Structure

Successive pages must look like:

http://example.com/foo/1/bar
http://example.com/foo/2/bar
http://example.com/foo/3/bar

iterscraper would then accept the url in the following style, in Printf style such that numbers may be substituted into the url:

http://example.com/foo/%d/bar

Installation

Building the source requires the Go programming language and the Glide package manager.

# Dependency is GoQuery
go get github.com/PuerkitoBio/goquery
# Get and build source
go get github.com/philipithomas/iterscraper
# If your $PATH is configured correctly, you can call it directly
iterscraper [flags]

Errata

This is purpose-built for some internal scraping. It's not meant to be the scraping tool for every user case, but you're welcome to modify it for your purposes
On a 429 - too many requests error, the app logs and continues, ignoring the request.
The package will follow up to 10 redirects
On a 404 - not found error, the system will log the miss, then continue. It is not exported to the CSV.

Documentation ¶

Overview ¶

iterscraper scrapes information from a website where URLs contain an incrementing integer. Information is retrieved from HTML5 elements, and outputted as a CSV.

Source Files ¶

View all Source files

main.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL