linkcrawler

command module
v0.1.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 19, 2017 License: MIT Imports: 7 Imported by: 0

README

linkcrawler
Build Status Code Coverage GoDoc

Persistent and distributed web crawler

Persistent and distributed web crawler that can either crawl a website and create a list of all links OR download all websites in a list to a gzipped file. linkcrawler is threaded and uses connection pools so it is fast. It is persistent because it periodically dumps its state to JSON files which it will use to re-initialize if interrupted. It is distributed by connecting to a database to store its state so you can start as many crawlers as you want on separate machines to speed along the process.

Crawl responsibly.

Getting Started

Install

go get github.com/schollz/linkcrawler/...

Also you will need the database server,

go get github.com/schollz/boltdb-server/...

Run

Setup server

First run the database server:

$ $GOPATH/bin/boltdb-server

which will create a server listening on localhost:8080 by default.

Crawl

To capture all the links on a website:

$ linkcrawler --server 'http://localhost:8080' crawl http://rpiai.com
http://rpiai.com
Setting up crawler...
Starting crawl using DB NB2HI4B2F4XXE4DJMFUS4Y3PNU======
2017/03/11 08:38:02 32 parsed (5/s), 0 todo, 32 done, 3 trashed
Got links downloaded from 'http://rpiai.com'
Wrote 32 links to NB2HI4B2F4XXE4DJMFUS4Y3PNU======.txt

Make sure to replace the server with a different address if you have.

The current state of the crawler is saved. If the crawler is interrupted, you can simply run the command again and it will restart from the last state.

You can run the same command on a different machine which will help to crawl the respective website and collect links and add them to a universal queue.

Download

To download gzipped webpages from a list of websites:

$ linkcrawler --server 'http://localhost:8080' download links.txt
2017/03/11 08:41:20 32 parsed (31/s), 0 todo, 32 done, 0 trashed
Finished downloading
$ ls downloaded | head -n 2
NB2HI4B2F4XXE4DJMFUS4Y3PNU======.html.gz
NB2HI4B2F4XXE4DJMFUS4Y3PNUXQ====.html.gz

Downloads are saved into a folder downloaded with URL of link encoded in Base32.

Dump

To dump the current database, just use

$ linkcrawler --server 'http://localhost:8080' dump http://rpiai.com
Wrote 32 links to NB2HI4B2F4XXE4DJMFUS4Y3PNU======.txt

License

MIT

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL