crawler

package module

v0.0.0-...-8c614bb Latest Latest Go to latest Published: May 12, 2014 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/aybabtme/crawler

Links

Open Source Insights

README ¶

Crawler

A simple domain crawler.

Respects robots.txt.
Doesn't leave the domain it's given.
Doesn't visit sub-domains.

Crawl things!

Install the crawler:

go get github.com/aybabtme/crawler/cmd/crawl

Use it:

crawl -h http://antoine.im -f antoineim_map.json

Should print things like:

2014/05/11 02:28:04 starting crawl on http://antoine.im
2014/05/11 02:28:05 [crawler] root has 1 elements
2014/05/11 02:28:05 [crawler] fringe=10 found=12 (new=10, rejected=2) source="http://antoine.im"
2014/05/11 02:28:06 [crawler] fringe=10 found=27 (new=1, rejected=20) source="http://antoine.im/posts/someone_was_right_on_the_internet"
...
2014/05/11 02:28:07 [crawler] fringe=0  found=0 (new=0, rejected=0) source="http://antoine.im/assets/data/to_buffer_or_not_to_buffer/t1_micro_bench_1.0MB.svg"
2014/05/11 02:28:07 [crawler] done crawling, 15 resources, 45 links
2014/05/11 02:28:07 preparing sitemap
2014/05/11 02:28:07 saving to "antoineim_map.json"
2014/05/11 02:28:07 done in 3.006155429s

You can then use the output file, for instance to count how many links point to 404 (needs jq):

jq < mysite.com.json '.resources | map(select(.status_code == 404)) | length'

Or find out which page led to those 404:

jq < mysite.com.json '.resources | map(select(.status_code == 404)) | [.[].refered_by[]] | unique'

Use the lib!

If you want to use the library.

go get github.com/aybabtme/crawler

The godocs are on godoc (lol).

Test it!

go get -t github.com/aybabtme/crawler
make test

To view the coverage report:

make cover

The output

The output of a crawl is a list of resources, along with:

Where they refer to (points to something).
Where are they are refered from (something points to that).
What was the status code of reaching this resource.

The status code is interesting: it might show that you have dead links (404), for instance.

Here's a snippet of crawling my blog. The full map can be found here if you want to see it.

{
    "resource_count": 15,
    "link_count": 45,
    "resources": [
        {
            "url": "http://antoine.im/posts/dynamic_programming_for_the_lazy",
            "refered_by": [
                "http://antoine.im"
            ],
            "refers_to": [
                "http://antoine.im",
                "http://antoine.im/assets/css/brog.css",
                "http://antoine.im/assets/css/font-awesome.min.css",
                "http://antoine.im/assets/css/styles/github.css",
                "http://antoine.im/assets/js/algo_convenience_hacks.js",
                "http://antoine.im/assets/js/brog.js"
            ],
            "status_code": 200
        },
        {
            "url": "http://antoine.im/assets/css/brog.css",
            "refered_by": [
                "http://antoine.im",
                "http://antoine.im/posts/someone_was_right_on_the_internet",
                "http://antoine.im/posts/someone_is_wrong_on_the_internet",
                "http://antoine.im/posts/the_story_of_select_and_the_goroutines",
                "http://antoine.im/posts/dynamic_programming_for_the_lazy",
                "http://antoine.im/posts/to_buffer_or_not_to_buffer",
                "http://antoine.im/posts/correction_hacks"
            ],
            "refers_to": [],
            "status_code": 200
        },
        // ...
    }
}

Documentation ¶

Index ¶

type Crawler
- func NewCrawler(domain *url.URL, agent string) (Crawler, error)
type ResourceGraph

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Crawler ¶

type Crawler interface {
	Crawl() (ResourceGraph, error)
}

Crawler produces a ResourceGraph from within a single domain.

func NewCrawler ¶

func NewCrawler(domain *url.URL, agent string) (Crawler, error)

NewCrawler creates a single threaded Crawler that respects robots.txt, starting with domain, plus the links provided in the sitemap. The sitemap is reported by robots.txt.

Crawler will write to standard log as it progresses.

type ResourceGraph ¶

type ResourceGraph interface {
	// if a URL is reachable on the website.
	Contains(string) bool
	// the number of resources in the graph.
	ResourceCount() int
	// the number of links interconnecting the resources.
	LinkCount() int
	// walks over the graph as long as the func returns true.
	Walk(func(link string, status int, refersTo, referedBy []string) bool)
	// can be marshalled to JSON.
	json.Marshaler
}

ResourceGraph represents a website as a directed graph where resources are connected by URLs.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
Godeps
_workspace/src/code.google.com/p/cascadia The cascadia package is an implementation of CSS selectors.	The cascadia package is an implementation of CSS selectors.
_workspace/src/code.google.com/p/go.net/html Package html implements an HTML5-compliant tokenizer and parser.	Package html implements an HTML5-compliant tokenizer and parser.
_workspace/src/code.google.com/p/go.net/html/atom Package atom provides integer codes (also known as atoms) for a fixed set of frequently occurring HTML strings: tag names and attribute keys such as "p" and "id".	Package atom provides integer codes (also known as atoms) for a fixed set of frequently occurring HTML strings: tag names and attribute keys such as "p" and "id".
_workspace/src/code.google.com/p/go.net/html/charset Package charset provides common text encodings for HTML documents.	Package charset provides common text encodings for HTML documents.
_workspace/src/github.com/PuerkitoBio/goquery Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document (the modification functions of jQuery are not included).	Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document (the modification functions of jQuery are not included).
_workspace/src/github.com/PuerkitoBio/purell Package purell offers URL normalization as described on the wikipedia page: http://en.wikipedia.org/wiki/URL_normalization	Package purell offers URL normalization as described on the wikipedia page: http://en.wikipedia.org/wiki/URL_normalization
_workspace/src/github.com/gorilla/context Package gorilla/context stores values shared during a request lifetime.	Package gorilla/context stores values shared during a request lifetime.
_workspace/src/github.com/gorilla/mux Package gorilla/mux implements a request router and dispatcher.	Package gorilla/mux implements a request router and dispatcher.
_workspace/src/github.com/temoto/robotstxt-go The robots.txt Exclusion Protocol is implemented as specified in http://www.robotstxt.org/wc/robots.html with various extensions.	The robots.txt Exclusion Protocol is implemented as specified in http://www.robotstxt.org/wc/robots.html with various extensions.
_workspace/src/github.com/temoto/robotstxt-go/robots.txt-check
cmd
crawl

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL