gocrawler

command module
v0.0.0-...-d9bb2be Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 7, 2021 License: MIT Imports: 11 Imported by: 0

README

GoCrawler

使用colly框架基于gospider进行二次开发的爬虫工具,进行多处功能改进和细节优化,更适合信息收集

Installation

GO111MODULE=on go get -u github.com/zerokeeper/gocrawler

Features

  • 快速爬虫
  • 爆破解析sitemap.xml
  • 解析 robots.txt
  • 生成和验证JavaScript files
  • Link Finder
  • 可同时并行爬取多个站点
  • 随机user-agent
  • 限制了单个网站爬取数量,避免无限爬取

Usage

web crawler written in Go - v1.0.0

Usage:
  gocrawler [flags]

Flags:
  -s, --site string                 Site to crawl
  -S, --sites string                Site list to crawl
  -p, --proxy string                Proxy (Ex: http://127.0.0.1:8080)
  -o, --output string               Output folder
  -u, --user-agent string           User Agent to use
                                    	web: random web user-agent
                                    	mobi: random mobile user-agent
                                    	or you can set your special user-agent (default "web")
      --cookie string               Cookie to use (testA=a; testB=b)
  -H, --header stringArray          Header to use (Use multiple flag to set multiple header)
      --blacklist string            Blacklist URL Regex
      --whitelist string            Whitelist URL Regex
      --whitelist-domain string     Whitelist Domain
  -L, --filter-length string        Turn on length filter
  -t, --threads int                 Number of threads (Run sites in parallel) (default 10)
  -c, --concurrent int              The number of the maximum allowed concurrent requests of the matching domains (default 5)
  -d, --depth int                   MaxDepth limits the recursion depth of visited URLs. (Set it to 0 for infinite recursion) (default 5)
  -k, --delay int                   Delay is the duration to wait before creating a new request to the matching domains (second)
  -K, --random-delay int            RandomDelay is the extra randomized duration to wait added to Delay before creating a new request (second)
  -m, --timeout int                 Request timeout (second) (default 10)
  -n, --request-limit-numbers int   the limit numbers to request (default 1000)
  -B, --base                        Disable all and only use HTML content
      --js                          Enable linkfinder in javascript file (default true)
      --sitemap                     Try to crawl sitemap.xml
      --robots                      Try to crawl robots.txt (default true)
  -a, --other-source                Find URLs from 3rd party (Archive.org, CommonCrawl.org, VirusTotal.com, AlienVault.com)
  -w, --include-subs                Include subdomains crawled from 3rd party. Default is main domain
  -r, --include-other-source        Also include other-source's urls (still crawl and request)
      --subs                        Include subdomains
      --debug                       Turn on debug mode
      --json                        Enable JSON output
  -v, --verbose                     Turn on verbose
  -q, --quiet                       Suppress all the output and only show URL
      --no-redirect                 Disable redirect
      --version                     Check version
  -l, --length                      Turn on length
  -R, --raw                         Enable raw output
  -h, --help                        help for gocrawler

Example commands

Quite output
gocrawler -q -s "https://google.com/"
Run with single site
gocrawler -s "https://google.com/" -o output -c 10 -d 1
Run with site list
gocrawler -S sites.txt -o output -c 10 -d 1
Run with 20 sites at the same time with 10 bot each site
gocrawler -S sites.txt -o output -c 10 -d 1 -t 20
Also get URLs from 3rd party (Archive.org, CommonCrawl.org, VirusTotal.com, AlienVault.com)
gocrawler -s "https://google.com/" -o output -c 10 -d 1 --other-source
Also get URLs from 3rd party (Archive.org, CommonCrawl.org, VirusTotal.com, AlienVault.com) and include subdomains
gocrawler -s "https://google.com/" -o output -c 10 -d 1 --other-source --include-subs
Use custom header/cookies
gocrawler -s "https://google.com/" -o output -c 10 -d 1 --other-source -H "Accept: */*" -H "Test: test" --cookie "testA=a; testB=b"

gocrawler -s "https://google.com/" -o output -c 10 -d 1 --other-source --burp burp_req.txt
Blacklist url/file extension.

P/s: gocrawler blacklisted .(jpg|jpeg|gif|css|tif|tiff|png|ttf|woff|woff2|ico) as default

gocrawler -s "https://google.com/" -o output -c 10 -d 1 --blacklist ".(woff|pdf)"
Show and Blacklist file length.
gocrawler -s "https://google.com/" -o output -c 10 -d 1 --length --filter-length "6871,24432"   

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL