linkcrawler

command module

v0.1.2 Latest Latest Go to latest Published: Mar 19, 2017 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/schollz/linkcrawler

Links

Open Source Insights

README ¶

Persistent and distributed web crawler

Persistent and distributed web crawler that can either crawl a website and create a list of all links OR download all websites in a list to a gzipped file. linkcrawler is threaded and uses connection pools so it is fast. It is persistent because it periodically dumps its state to JSON files which it will use to re-initialize if interrupted. It is distributed by connecting to a database to store its state so you can start as many crawlers as you want on separate machines to speed along the process.

Crawl responsibly.

Getting Started

Install

go get github.com/schollz/linkcrawler/...

Also you will need the database server,

go get github.com/schollz/boltdb-server/...

Run

Setup server

First run the database server:

$ $GOPATH/bin/boltdb-server

which will create a server listening on localhost:8080 by default.

Crawl

To capture all the links on a website:

$ linkcrawler --server 'http://localhost:8080' crawl http://rpiai.com
http://rpiai.com
Setting up crawler...
Starting crawl using DB NB2HI4B2F4XXE4DJMFUS4Y3PNU======
2017/03/11 08:38:02 32 parsed (5/s), 0 todo, 32 done, 3 trashed
Got links downloaded from 'http://rpiai.com'
Wrote 32 links to NB2HI4B2F4XXE4DJMFUS4Y3PNU======.txt

Make sure to replace the server with a different address if you have.

The current state of the crawler is saved. If the crawler is interrupted, you can simply run the command again and it will restart from the last state.

You can run the same command on a different machine which will help to crawl the respective website and collect links and add them to a universal queue.

Download

To download gzipped webpages from a list of websites:

$ linkcrawler --server 'http://localhost:8080' download links.txt
2017/03/11 08:41:20 32 parsed (31/s), 0 todo, 32 done, 0 trashed
Finished downloading
$ ls downloaded | head -n 2
NB2HI4B2F4XXE4DJMFUS4Y3PNU======.html.gz
NB2HI4B2F4XXE4DJMFUS4Y3PNUXQ====.html.gz

Downloads are saved into a folder downloaded with URL of link encoded in Base32.

Dump

To dump the current database, just use

$ linkcrawler --server 'http://localhost:8080' dump http://rpiai.com
Wrote 32 links to NB2HI4B2F4XXE4DJMFUS4Y3PNU======.txt

License

MIT

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
lib

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL