gocrawsan

command module

v0.1.1 Latest Latest Go to latest Published: May 7, 2017 License: MIT Imports: 17 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/tzmfreedom/gocrawsan

Links

Open Source Insights

README ¶

Gocrawsan

Simple web crawler with golang

Install

For Linux or macOS user

$ curl -sL http://install.freedom-man.com/goc.sh | bash

If you want to install zsh completion, add --zsh-completion option

$ curl -sL http://install.freedom-man.com/goc.sh | bash -s -- --zsh-completion

or if you get lastest version, execute following command

$ go get github.com/tzmfreedom/gocrawsan

Usage

NAME:
   gocrawsan

USAGE:
   goc [global options] command [command options] [arguments...]

VERSION:
   0.1.0

COMMANDS:
     help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --useragent value, -U value
   --config value, -C value
   --no-redirect
   --selector value, -S value
   --extract-type value, -P value
   --attribute value, -A value
   --no-error
   --timeout value              (default: 10)
   --depth value, -D value      (default: 1)
   --help, -h                   show help
   --version, -v                print the version

You should create config file. By default, gocrawsan reads ~/.config/gocrawsan/config.toml as config file.

urls = [
  "https://www.google.co.jp",
  "https://www.example.com",
]

Then, execute following command.

$ goc

Crawling Depth

By default, gocrawsan crawl only urls that is configured by file (depth = 1). If you want to recursively crawl urls, set depth option to integer value greather than 1.

For example, following command crawl urls that is configured by file and links that these contents have.

$ goc --depth 2

Extract Element By Selector

By default, gocrawsan crawl and print http status code with url. Additionaly, gocrawsan can extract element from html document by css selector.

This command extract "href" attribute on "a" tag.

$ goc --selector a --extract-type attr --attribute href

If you want to text value, set text to extract-type option.

$ goc --selector a --extract-type text

Other Option

You can timeout for http request with timeout option.

$ goc --timeout 10 # timeout with 10 seconds

The user-agent option allows you to set User Agent for http request.

$ goc --useragent "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.1 (KHTML, like Gecko) Ubuntu/11.04 Chromium/14.0.825.0 Chrome/14.0.825.0 Safari/535.1"

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL