crawl

command module
v0.0.0-...-d35440b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 28, 2019 License: MIT Imports: 9 Imported by: 0

README

Crawl

Quality Gate Status Go Report Card

The aim of this project was to write a simple web crawler. The crawler should be limited to one domain - so when you start with https://benmatselby.dev/, it would crawl all pages within benmatselby.dev, but not follow external links, for example to the Facebook and Twitter accounts. Given a URL, it should print a simple site map, showing the links between pages.

Usage

Usage of crawl:
  -depth int
      how deep into the crawl before we stop (default 2)
  -verbose
      whether we want to output all the crawling information

crawl https://benmatselby.dev

This will return a JSON string for the output. You can then pipe that into something like jq to get some nice formatting, or interrogation options:

URLs crawled:

./crawl https://benmatselby.dev | jq '.pages[].URL'
"https://benmatselby.dev"
"https://benmatselby.dev/"
"https://benmatselby.dev/post/sugarcrm-deployment-process/"
"https://benmatselby.dev/post/"
"https://benmatselby.dev/post/joining-a-new-engineering-team/"
"https://benmatselby.dev/post/software-engineering-team-structure/"
"https://benmatselby.dev/post/pa11y-accessibility-ci/"
"https://benmatselby.dev/post/development-environments/"
"https://benmatselby.dev/post/pipelines/"
"https://benmatselby.dev/post/communication/"
"https://benmatselby.dev/post/feature-toggling/"
"https://benmatselby.dev/post/onboarding/"
"https://benmatselby.dev/post/technology-radar/"
"https://benmatselby.dev/post/communication-tools/"
"https://benmatselby.dev/post/squashing-commits/"
"https://benmatselby.dev/post/why-teams-important/"

Count of URLs crawled:

./crawl https://benmatselby.dev | jq '.pages | length'
16

Requirements

Installation via Git

The main way to install this application, is to clone the git repo, and build. This will require you to have the Go runtime installed on your machine.

git clone git@github.com:benmatselby/crawl.git
cd crawl
make all
./crawl

Once built, you can also install the binary into your $GOPATH/bin by go install. This will mean that crawl will be globally available in your system.

Installation via Docker

Whilst this is not the recommended way to run the application, as there is a slight performance overhead in running it in a container, you can do so.

git clone git@github.com:benmatselby/crawl.git
cd crawl
make build-docker
docker run benmatselby/crawl https://benmatselby.dev

Future

  • Define site map writers using the strategy pattern (e.g. have an xml, json writer).
  • Have throttling in the system.
  • Observe the robots.txt.
  • Provide a mechanism to highlight broken links.
  • Provide a mechanism to run for assets like JavaScript, CSS etc.
  • Cater for a site that may use relative paths such as "/" so the URL works on any domain.

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL