crawler

module

v0.0.0-...-92ec744 Latest Latest Go to latest Published: Feb 9, 2015 License: MIT

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/calavera/crawler

Links

Open Source Insights

README ¶

Crawler

This is a distributed image crawler.

Architecture

Crawler uses a distributed queue to publish messages with the URLs to crawl. Nodes subcribed to the queue pull the messages and crawl the URLs. New URLs found crawling a page are sent to the queue for other nodes to process. Crawler has a hard limit of two layers depth crawling URLs, this means that it will crawl the URLs found in the first pages its receives but it will stop there (I don't really want to download the whole internet).

Crawler uses Gnatsd as a clustered queue. New Nats server can be added to the cluster via configuration. New nodes can be subcribed to the queue by pointing the to cluster hosts. Nats doesn't offer any durability guarantee, messages can be lost. If you need durability and deliveries guarantees you might want to take a look at Vega, a distributed mailbox system that offers better delivery guarantees. See Engines to see how to swap queues implementations.

Crawler uses Riak as storage engine. Riak offers strong consistency guarantees using CRDTs and Crawler uses them to store counters and set of images processed. The Riak engine can also be swapped if you want to use an engine with other kind of guarantees. See Engines to see how to swap storage implementations.

To bring new Crawler nodes up you just need to boot the application and point it to where Gnatsd and Riak are configured. See Configuration next.

Configuration

Crawler is configured via environment variables:

CRAWLER_PORT: The port where the api is exposed, by default 3819. See Api for more details about the api.
CRAWLER_RIAK_URL: The host and port for one of the nodes to your Riak cluster, for instance 192.168.1.11:8087. This is the port where the protocol buffers api is exposed in Riak.
CRAWKER_GNATSD_NODES: The list of Gnatsd nodes configured separated by comma ,, for instance nats://192.168.1.12:4222,nats://192.168.1.13:4222.

Docker configuration

Crawler can be installed via Docker. You can pull the image calavera/crawler from the Registry.

Crawler can be linked with Riak and Gnatsd containers in your docker cluster. They need to be aliased as riak and gnatsd when you start the crawler container, for instance:

$ docker run --rm --link riak01:riak --link gnatsd:gnatsd -t calavera/crawler

Api

Crawler exposes an http api at the port specified by CRAWLER_PORT. These are the enpoints that the api exposes:

/: The root of the api can be reached via GET operations and displays a short howto about crawler.
/crawl: This enpoint can be reached via POST to enqueue urls to crawl. The urls must be sent in the body of the request separated by white spaces, for instance:

$ curl -X POST -d@- http://localhost:3819/crawl << EOF
https://google.com https://cnn.com
EOF

When the messages are enqueued, this endpoint returns a job identifier in the body. It also sets the Location header to the status endpoint where you can check the status of the process.

/status/job_uuid: This endpoint can be reached via GET. It displays the current urls processed, the ones that have been processed already and the number of times a urls is found in the crawling process.
/results/job_uuid: This endpoint can be reached via GET. It displays the images that have been collected by the job.

Engines

Crawler has been designed to be able to swap messaging and storage engines. In fact, you can see that it works if you start it without pointing it with the Gnatsd and Riak endpoints. This is because, by default, Crawler starts in development mode with two very unsafe engines, a queue engine designed to use channels and a memory storage engine that is not thread safe.

There are two interfaces that you need to implement if you want to design new storage and messaging engines:

Storage engines

The storage engines must implement this interface:

// Connection is an interface that defines how data is saved and retrieved from a storage.
type Connection interface {
  // Create the job in the database.
  CreateJob(string) error
  // Processing increments the counter of currently processing urls for a given job.
  Processing(string) error
  // Done increments the counter of done urls
  // and decrements the counter of processing urls for a given job.
  Done(string) error
  // Save adds an image source to the set of images for a given job.
  Save(string, string) error
  // Status returns the processing and done counters of a given job.
  Status(string) (*Info, error)
  // Results returns the processed images for a given job.
  Results(string) ([][]byte, error)
  // ViewPage decides whether a page needs to be crawled or not.
  // One url should only be crawled once by a given job,
  // but it depends on the guarantees that the storage provides.
  ViewPage(string, string) (bool, error)
}

Messaging engines

The messaging engines must implement this interface:

// Connection is an interface that defines how messages are published are received from a queue.
type Connection interface {
  // Publish pushes new messages to the queue.
  Publish(string, string, uint) error
  // Subscribe pulls messages from the queue and processes them using the processor function.
  Subscribe(Processor)
}

Building

The build system assumes you're in a Linux host and you have Go and Docker installed. Run make build to generate a docker container.

TL;DR

Run make tldr if you want to start right away to test it. This task assumes that you have curl and a Docker client installed and connected with a Docker host.

It pulls apcera/gnatsd from the Docker registry and starts a node.
It pulls hectcastro/riak from the Docker registry and starts a cluster of 5 nodes.
It pulls calavera/crawler from the Docker registry and starts two nodes.
It sends an initial request to crawl https://google.com and https://docker.com.

License

MIT

Directories ¶

Path	Synopsis
Godeps
_workspace/src/code.google.com/p/cascadia The cascadia package is an implementation of CSS selectors.	The cascadia package is an implementation of CSS selectors.
_workspace/src/code.google.com/p/goprotobuf/proto Package proto converts data structures to and from the wire format of protocol buffers.	Package proto converts data structures to and from the wire format of protocol buffers.
_workspace/src/github.com/PuerkitoBio/fetchbot Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.	Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.
_workspace/src/github.com/PuerkitoBio/fetchbot/example/full
_workspace/src/github.com/PuerkitoBio/fetchbot/example/short
_workspace/src/github.com/PuerkitoBio/fetchbot/example/shortauto
_workspace/src/github.com/PuerkitoBio/goquery Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document.	Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document.
_workspace/src/github.com/apcera/nats A Go client for the NATS messaging system (https://nats.io).	A Go client for the NATS messaging system (https://nats.io).
_workspace/src/github.com/julienschmidt/httprouter Package httprouter is a trie based high performance HTTP request router.	Package httprouter is a trie based high performance HTTP request router.
_workspace/src/github.com/temoto/robotstxt.go The robots.txt Exclusion Protocol is implemented as specified in http://www.robotstxt.org/wc/robots.html with various extensions.	The robots.txt Exclusion Protocol is implemented as specified in http://www.robotstxt.org/wc/robots.html with various extensions.
_workspace/src/github.com/temoto/robotstxt.go/robots.txt-check
_workspace/src/github.com/tpjg/goriakpbc Package riak is a riak-client, inspired by the Ruby riak-client gem and the riakpbc go package from mrb.	Package riak is a riak-client, inspired by the Ruby riak-client gem and the riakpbc go package from mrb.
_workspace/src/github.com/tpjg/goriakpbc/json
_workspace/src/github.com/tpjg/goriakpbc/pb Package riak is a generated protocol buffer package.	Package riak is a generated protocol buffer package.
_workspace/src/golang.org/x/net/html Package html implements an HTML5-compliant tokenizer and parser.	Package html implements an HTML5-compliant tokenizer and parser.
_workspace/src/golang.org/x/net/html/atom Package atom provides integer codes (also known as atoms) for a fixed set of frequently occurring HTML strings: tag names and attribute keys such as "p" and "id".	Package atom provides integer codes (also known as atoms) for a fixed set of frequently occurring HTML strings: tag names and attribute keys such as "p" and "id".
_workspace/src/golang.org/x/net/html/charset Package charset provides common text encodings for HTML documents.	Package charset provides common text encodings for HTML documents.
api
cmd
context
crawler
db
queue

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL