gocrawler

command module

v0.0.0-...-d9bb2be Latest Latest Go to latest Published: Dec 7, 2021 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/zerokeeper/gocrawler

README ¶

GoCrawler

使用colly框架基于gospider进行二次开发的爬虫工具，进行多处功能改进和细节优化，更适合信息收集

Installation

GO111MODULE=on go get -u github.com/zerokeeper/gocrawler

Features

快速爬虫
爆破解析sitemap.xml
解析 robots.txt
生成和验证JavaScript files
Link Finder
可同时并行爬取多个站点
随机user-agent
限制了单个网站爬取数量，避免无限爬取

Usage

web crawler written in Go - v1.0.0

Usage:
  gocrawler [flags]

Flags:
  -s, --site string                 Site to crawl
  -S, --sites string                Site list to crawl
  -p, --proxy string                Proxy (Ex: http://127.0.0.1:8080)
  -o, --output string               Output folder
  -u, --user-agent string           User Agent to use
                                    	web: random web user-agent
                                    	mobi: random mobile user-agent
                                    	or you can set your special user-agent (default "web")
      --cookie string               Cookie to use (testA=a; testB=b)
  -H, --header stringArray          Header to use (Use multiple flag to set multiple header)
      --blacklist string            Blacklist URL Regex
      --whitelist string            Whitelist URL Regex
      --whitelist-domain string     Whitelist Domain
  -L, --filter-length string        Turn on length filter
  -t, --threads int                 Number of threads (Run sites in parallel) (default 10)
  -c, --concurrent int              The number of the maximum allowed concurrent requests of the matching domains (default 5)
  -d, --depth int                   MaxDepth limits the recursion depth of visited URLs. (Set it to 0 for infinite recursion) (default 5)
  -k, --delay int                   Delay is the duration to wait before creating a new request to the matching domains (second)
  -K, --random-delay int            RandomDelay is the extra randomized duration to wait added to Delay before creating a new request (second)
  -m, --timeout int                 Request timeout (second) (default 10)
  -n, --request-limit-numbers int   the limit numbers to request (default 1000)
  -B, --base                        Disable all and only use HTML content
      --js                          Enable linkfinder in javascript file (default true)
      --sitemap                     Try to crawl sitemap.xml
      --robots                      Try to crawl robots.txt (default true)
  -a, --other-source                Find URLs from 3rd party (Archive.org, CommonCrawl.org, VirusTotal.com, AlienVault.com)
  -w, --include-subs                Include subdomains crawled from 3rd party. Default is main domain
  -r, --include-other-source        Also include other-source's urls (still crawl and request)
      --subs                        Include subdomains
      --debug                       Turn on debug mode
      --json                        Enable JSON output
  -v, --verbose                     Turn on verbose
  -q, --quiet                       Suppress all the output and only show URL
      --no-redirect                 Disable redirect
      --version                     Check version
  -l, --length                      Turn on length
  -R, --raw                         Enable raw output
  -h, --help                        help for gocrawler

Example commands

Quite output

gocrawler -q -s "https://google.com/"

Run with single site

gocrawler -s "https://google.com/" -o output -c 10 -d 1

Run with site list

gocrawler -S sites.txt -o output -c 10 -d 1

Run with 20 sites at the same time with 10 bot each site

gocrawler -S sites.txt -o output -c 10 -d 1 -t 20

Also get URLs from 3rd party (Archive.org, CommonCrawl.org, VirusTotal.com, AlienVault.com)

gocrawler -s "https://google.com/" -o output -c 10 -d 1 --other-source

Also get URLs from 3rd party (Archive.org, CommonCrawl.org, VirusTotal.com, AlienVault.com) and include subdomains

gocrawler -s "https://google.com/" -o output -c 10 -d 1 --other-source --include-subs

Use custom header/cookies

gocrawler -s "https://google.com/" -o output -c 10 -d 1 --other-source -H "Accept: */*" -H "Test: test" --cookie "testA=a; testB=b"

gocrawler -s "https://google.com/" -o output -c 10 -d 1 --other-source --burp burp_req.txt

Blacklist url/file extension.

P/s: gocrawler blacklisted .(jpg|jpeg|gif|css|tif|tiff|png|ttf|woff|woff2|ico) as default

gocrawler -s "https://google.com/" -o output -c 10 -d 1 --blacklist ".(woff|pdf)"

Show and Blacklist file length.

gocrawler -s "https://google.com/" -o output -c 10 -d 1 --length --filter-length "6871,24432"

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
core
stringset

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL