crawler

package module
v0.0.0-...-8c614bb Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 12, 2014 License: MIT Imports: 11 Imported by: 0

README

Crawler

Build Status Coverage Status

A simple domain crawler.

  • Respects robots.txt.
  • Doesn't leave the domain it's given.
  • Doesn't visit sub-domains.

Crawl things!

Install the crawler:

go get github.com/aybabtme/crawler/cmd/crawl

Use it:

crawl -h http://antoine.im -f antoineim_map.json

Should print things like:

2014/05/11 02:28:04 starting crawl on http://antoine.im
2014/05/11 02:28:05 [crawler] root has 1 elements
2014/05/11 02:28:05 [crawler] fringe=10 found=12 (new=10, rejected=2) source="http://antoine.im"
2014/05/11 02:28:06 [crawler] fringe=10 found=27 (new=1, rejected=20) source="http://antoine.im/posts/someone_was_right_on_the_internet"
...
2014/05/11 02:28:07 [crawler] fringe=0  found=0 (new=0, rejected=0) source="http://antoine.im/assets/data/to_buffer_or_not_to_buffer/t1_micro_bench_1.0MB.svg"
2014/05/11 02:28:07 [crawler] done crawling, 15 resources, 45 links
2014/05/11 02:28:07 preparing sitemap
2014/05/11 02:28:07 saving to "antoineim_map.json"
2014/05/11 02:28:07 done in 3.006155429s

You can then use the output file, for instance to count how many links point to 404 (needs jq):

jq < mysite.com.json '.resources | map(select(.status_code == 404)) | length'

Or find out which page led to those 404:

jq < mysite.com.json '.resources | map(select(.status_code == 404)) | [.[].refered_by[]] | unique'

Use the lib!

If you want to use the library.

go get github.com/aybabtme/crawler

The godocs are on godoc (lol).

Test it!

go get -t github.com/aybabtme/crawler
make test

To view the coverage report:

make cover

The output

The output of a crawl is a list of resources, along with:

  • Where they refer to (points to something).
  • Where are they are refered from (something points to that).
  • What was the status code of reaching this resource.

The status code is interesting: it might show that you have dead links (404), for instance.

Here's a snippet of crawling my blog. The full map can be found here if you want to see it.

{
    "resource_count": 15,
    "link_count": 45,
    "resources": [
        {
            "url": "http://antoine.im/posts/dynamic_programming_for_the_lazy",
            "refered_by": [
                "http://antoine.im"
            ],
            "refers_to": [
                "http://antoine.im",
                "http://antoine.im/assets/css/brog.css",
                "http://antoine.im/assets/css/font-awesome.min.css",
                "http://antoine.im/assets/css/styles/github.css",
                "http://antoine.im/assets/js/algo_convenience_hacks.js",
                "http://antoine.im/assets/js/brog.js"
            ],
            "status_code": 200
        },
        {
            "url": "http://antoine.im/assets/css/brog.css",
            "refered_by": [
                "http://antoine.im",
                "http://antoine.im/posts/someone_was_right_on_the_internet",
                "http://antoine.im/posts/someone_is_wrong_on_the_internet",
                "http://antoine.im/posts/the_story_of_select_and_the_goroutines",
                "http://antoine.im/posts/dynamic_programming_for_the_lazy",
                "http://antoine.im/posts/to_buffer_or_not_to_buffer",
                "http://antoine.im/posts/correction_hacks"
            ],
            "refers_to": [],
            "status_code": 200
        },
        // ...
    }
}

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Crawler

type Crawler interface {
	Crawl() (ResourceGraph, error)
}

Crawler produces a ResourceGraph from within a single domain.

func NewCrawler

func NewCrawler(domain *url.URL, agent string) (Crawler, error)

NewCrawler creates a single threaded Crawler that respects robots.txt, starting with domain, plus the links provided in the sitemap. The sitemap is reported by robots.txt.

Crawler will write to standard log as it progresses.

type ResourceGraph

type ResourceGraph interface {
	// if a URL is reachable on the website.
	Contains(string) bool
	// the number of resources in the graph.
	ResourceCount() int
	// the number of links interconnecting the resources.
	LinkCount() int
	// walks over the graph as long as the func returns true.
	Walk(func(link string, status int, refersTo, referedBy []string) bool)
	// can be marshalled to JSON.
	json.Marshaler
}

ResourceGraph represents a website as a directed graph where resources are connected by URLs.

Directories

Path Synopsis
Godeps
_workspace/src/code.google.com/p/cascadia
The cascadia package is an implementation of CSS selectors.
The cascadia package is an implementation of CSS selectors.
_workspace/src/code.google.com/p/go.net/html
Package html implements an HTML5-compliant tokenizer and parser.
Package html implements an HTML5-compliant tokenizer and parser.
_workspace/src/code.google.com/p/go.net/html/atom
Package atom provides integer codes (also known as atoms) for a fixed set of frequently occurring HTML strings: tag names and attribute keys such as "p" and "id".
Package atom provides integer codes (also known as atoms) for a fixed set of frequently occurring HTML strings: tag names and attribute keys such as "p" and "id".
_workspace/src/code.google.com/p/go.net/html/charset
Package charset provides common text encodings for HTML documents.
Package charset provides common text encodings for HTML documents.
_workspace/src/github.com/PuerkitoBio/goquery
Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document (the modification functions of jQuery are not included).
Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document (the modification functions of jQuery are not included).
_workspace/src/github.com/PuerkitoBio/purell
Package purell offers URL normalization as described on the wikipedia page: http://en.wikipedia.org/wiki/URL_normalization
Package purell offers URL normalization as described on the wikipedia page: http://en.wikipedia.org/wiki/URL_normalization
_workspace/src/github.com/gorilla/context
Package gorilla/context stores values shared during a request lifetime.
Package gorilla/context stores values shared during a request lifetime.
_workspace/src/github.com/gorilla/mux
Package gorilla/mux implements a request router and dispatcher.
Package gorilla/mux implements a request router and dispatcher.
_workspace/src/github.com/temoto/robotstxt-go
The robots.txt Exclusion Protocol is implemented as specified in http://www.robotstxt.org/wc/robots.html with various extensions.
The robots.txt Exclusion Protocol is implemented as specified in http://www.robotstxt.org/wc/robots.html with various extensions.
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL