gocrawsan

command module
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 7, 2017 License: MIT Imports: 17 Imported by: 0

README

Gocrawsan

Simple web crawler with golang

Install

For Linux or macOS user

$ curl -sL http://install.freedom-man.com/goc.sh | bash

If you want to install zsh completion, add --zsh-completion option

$ curl -sL http://install.freedom-man.com/goc.sh | bash -s -- --zsh-completion

or if you get lastest version, execute following command

$ go get github.com/tzmfreedom/gocrawsan

Usage

NAME:
   gocrawsan

USAGE:
   goc [global options] command [command options] [arguments...]

VERSION:
   0.1.0

COMMANDS:
     help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --useragent value, -U value
   --config value, -C value
   --no-redirect
   --selector value, -S value
   --extract-type value, -P value
   --attribute value, -A value
   --no-error
   --timeout value              (default: 10)
   --depth value, -D value      (default: 1)
   --help, -h                   show help
   --version, -v                print the version

You should create config file. By default, gocrawsan reads ~/.config/gocrawsan/config.toml as config file.

urls = [
  "https://www.google.co.jp",
  "https://www.example.com",
]

Then, execute following command.

$ goc
Crawling Depth

By default, gocrawsan crawl only urls that is configured by file (depth = 1). If you want to recursively crawl urls, set depth option to integer value greather than 1.

For example, following command crawl urls that is configured by file and links that these contents have.

$ goc --depth 2
Extract Element By Selector

By default, gocrawsan crawl and print http status code with url. Additionaly, gocrawsan can extract element from html document by css selector.

This command extract "href" attribute on "a" tag.

$ goc --selector a --extract-type attr --attribute href

If you want to text value, set text to extract-type option.

$ goc --selector a --extract-type text
Other Option

You can timeout for http request with timeout option.

$ goc --timeout 10 # timeout with 10 seconds

The user-agent option allows you to set User Agent for http request.

$ goc --useragent "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.1 (KHTML, like Gecko) Ubuntu/11.04 Chromium/14.0.825.0 Chrome/14.0.825.0 Safari/535.1"

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL