ch5ex13

package
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 5, 2022 License: GPL-3.0 Imports: 7 Imported by: 0

README

= Exercise 5.13
// Refs:
:url-base: https://github.com/fenegroni/TGPL-exercise-solutions
:workflow: workflows/Exercise 5.13
:action: actions/workflows/ch5ex13.yml
:url-workflow: {url-base}/{workflow}
:url-action: {url-base}/{action}
:badge-exercise: image:{url-workflow}/badge.svg?branch=main[link={url-action}]

{badge-exercise}

Modify `crawl` to make local copies of the pages it finds,
creating directories as necessary.
Don't make copies of pages that come from a different domain.
For example, if the original page comes from `golang.org`,
save all files from there,
but exclude ones from `vimeo.com`.

== Test

We will use package `net/httptest` to set up
an in place server and provide the
package `links` under test with the HTTP client.

The first test checks the logic of saving files in the same domain.

== Further tests

We must write more tests:

. Validate the file content is as expected.
. Deal with executable content download
. Handle relative paths, empty and malformed URLs.

== Improvements

The first test leaves files behind which need to be deleted.
A function to delete all files in a known directory is trivial
to implement, but requires a change to the logic that saves files
to use a temporary known folder. And it doesn't cater for bugs
in the program that could cause other files to be saved elsewhere.

A much better approach is to isolate the file system logic.
We could use Afero, a virtualised file system, to validate
what operations the web crawler calls upon.

This would allow an even further improvement.
We could mock the filesystem and check that files
are written with the expected content,
without having to read them back in the test
to validate them, or delete them.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var DownloadDir string

Functions

func BreadthFirst

func BreadthFirst(f func(string) []string, worklist []string)

BreadthFirst calls f for each item in the worklist. Any items returned by f are added to the worklist. f is called at most once for each item.

func Crawl

func Crawl(address string) []string

Crawl wraps calls to Extract, logging any errors, allowing BreadthFirst to continue crawling through all the hyperlinks.

func Extract

func Extract(address string) ([]string, error)

Extract makes an HTTP GET request to the specified URL address, parses the response in HTML, and returns the links in the HTML document.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL