crawler

command module
v0.0.0-...-76d871f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 24, 2014 License: MIT Imports: 7 Imported by: 0

README

Crawler

Build Status

A simple web crawler

Setup

You will need GoLang 1.3

  1. Setup a workspace as per: http://golang.org/doc/code.html#Workspaces.
  2. Download the code into the work space
  3. Get the required libraries
  4. Test and build
  5. Run
# build go workspace
$ cd ~
$ mkdir go
$ export GOPATH=$HOME/go

# download code
$ mkdir -p go/src/github.com/jkamenik
$ cd go/src/github.com/jkamenik
$ git clone http://github.com/jkamenik/crawler

# get the libraries
$ cd ~/go/src/github.com/jkamenik/crawler
$ go get .

# test and build
$ go test .
$ go build

# run
$ ~/go/bin/crawler <args>

Challenge

The goal is to provide a tool that takes a single command line argument of a URL and determine the content of that URL after crawling it.

The following requirements apply to this challenge:

  1. The tool must download the HTML
  2. The tool must parse and print all the links found in that HTML
  3. The tool must allow for an optional depth argument (default 2) which will control how many pages it will crawl for links.
  4. The output should be the link's text followed by the link url (see below).
  5. A reasonable exit code needs to be provided if the main URL is not accessible; 2nd level URL errors can be ignored.
$ crawler http://somedomain.com
Home -> /
About Us -> /about_us.php
Careers -> http://otherdomain.com/somedomain.com
  Home -> http://somedomain.com
  Careers -> /somedomain.com

Extra credit (optional)

  • Parallelize the downloading, and parsing, and collecting of links
  • Follow redirects of any page
  • Add debugging which is off by default and can be enabled with "-v"
    • Control the level of debugging by repeating "-v" (i.e., "-vvvv")
  • Save the HTML in a folder matching the link title
  • Save any resources used the by page: CSS, JS, and Images.
    • Rewrite the links and references in the HTML to be relative file paths
  • Enable Javascript, using Selenium Webdriver, or similar

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL