GoCrawler
使用colly框架基于gospider进行二次开发的爬虫工具,进行多处功能改进和细节优化,更适合信息收集
Installation
GO111MODULE=on go get -u github.com/zerokeeper/gocrawler
Features
- 快速爬虫
- 爆破解析sitemap.xml
- 解析 robots.txt
- 生成和验证JavaScript files
- Link Finder
- 可同时并行爬取多个站点
- 随机user-agent
- 限制了单个网站爬取数量,避免无限爬取
Usage
web crawler written in Go - v1.0.0
Usage:
gocrawler [flags]
Flags:
-s, --site string Site to crawl
-S, --sites string Site list to crawl
-p, --proxy string Proxy (Ex: http://127.0.0.1:8080)
-o, --output string Output folder
-u, --user-agent string User Agent to use
web: random web user-agent
mobi: random mobile user-agent
or you can set your special user-agent (default "web")
--cookie string Cookie to use (testA=a; testB=b)
-H, --header stringArray Header to use (Use multiple flag to set multiple header)
--blacklist string Blacklist URL Regex
--whitelist string Whitelist URL Regex
--whitelist-domain string Whitelist Domain
-L, --filter-length string Turn on length filter
-t, --threads int Number of threads (Run sites in parallel) (default 10)
-c, --concurrent int The number of the maximum allowed concurrent requests of the matching domains (default 5)
-d, --depth int MaxDepth limits the recursion depth of visited URLs. (Set it to 0 for infinite recursion) (default 5)
-k, --delay int Delay is the duration to wait before creating a new request to the matching domains (second)
-K, --random-delay int RandomDelay is the extra randomized duration to wait added to Delay before creating a new request (second)
-m, --timeout int Request timeout (second) (default 10)
-n, --request-limit-numbers int the limit numbers to request (default 1000)
-B, --base Disable all and only use HTML content
--js Enable linkfinder in javascript file (default true)
--sitemap Try to crawl sitemap.xml
--robots Try to crawl robots.txt (default true)
-a, --other-source Find URLs from 3rd party (Archive.org, CommonCrawl.org, VirusTotal.com, AlienVault.com)
-w, --include-subs Include subdomains crawled from 3rd party. Default is main domain
-r, --include-other-source Also include other-source's urls (still crawl and request)
--subs Include subdomains
--debug Turn on debug mode
--json Enable JSON output
-v, --verbose Turn on verbose
-q, --quiet Suppress all the output and only show URL
--no-redirect Disable redirect
--version Check version
-l, --length Turn on length
-R, --raw Enable raw output
-h, --help help for gocrawler
Example commands
Quite output
gocrawler -q -s "https://google.com/"
Run with single site
gocrawler -s "https://google.com/" -o output -c 10 -d 1
Run with site list
gocrawler -S sites.txt -o output -c 10 -d 1
Run with 20 sites at the same time with 10 bot each site
gocrawler -S sites.txt -o output -c 10 -d 1 -t 20
gocrawler -s "https://google.com/" -o output -c 10 -d 1 --other-source
Also get URLs from 3rd party (Archive.org, CommonCrawl.org, VirusTotal.com, AlienVault.com) and include subdomains
gocrawler -s "https://google.com/" -o output -c 10 -d 1 --other-source --include-subs
gocrawler -s "https://google.com/" -o output -c 10 -d 1 --other-source -H "Accept: */*" -H "Test: test" --cookie "testA=a; testB=b"
gocrawler -s "https://google.com/" -o output -c 10 -d 1 --other-source --burp burp_req.txt
Blacklist url/file extension.
P/s: gocrawler blacklisted .(jpg|jpeg|gif|css|tif|tiff|png|ttf|woff|woff2|ico)
as default
gocrawler -s "https://google.com/" -o output -c 10 -d 1 --blacklist ".(woff|pdf)"
Show and Blacklist file length.
gocrawler -s "https://google.com/" -o output -c 10 -d 1 --length --filter-length "6871,24432"