ch5ex13

package

v0.1.1 Latest Latest Go to latest Published: Jul 5, 2022 License: GPL-3.0 Imports: 7 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/fenegroni/TGPL-exercise-solutions

Links

Open Source Insights

README ¶

= Exercise 5.13
// Refs:
:url-base: https://github.com/fenegroni/TGPL-exercise-solutions
:workflow: workflows/Exercise 5.13
:action: actions/workflows/ch5ex13.yml
:url-workflow: {url-base}/{workflow}
:url-action: {url-base}/{action}
:badge-exercise: image:{url-workflow}/badge.svg?branch=main[link={url-action}]

{badge-exercise}

Modify `crawl` to make local copies of the pages it finds,
creating directories as necessary.
Don't make copies of pages that come from a different domain.
For example, if the original page comes from `golang.org`,
save all files from there,
but exclude ones from `vimeo.com`.

== Test

We will use package `net/httptest` to set up
an in place server and provide the
package `links` under test with the HTTP client.

The first test checks the logic of saving files in the same domain.

== Further tests

We must write more tests:

. Validate the file content is as expected.
. Deal with executable content download
. Handle relative paths, empty and malformed URLs.

== Improvements

The first test leaves files behind which need to be deleted.
A function to delete all files in a known directory is trivial
to implement, but requires a change to the logic that saves files
to use a temporary known folder. And it doesn't cater for bugs
in the program that could cause other files to be saved elsewhere.

A much better approach is to isolate the file system logic.
We could use Afero, a virtualised file system, to validate
what operations the web crawler calls upon.

This would allow an even further improvement.
We could mock the filesystem and check that files
are written with the expected content,
without having to read them back in the test
to validate them, or delete them.

Documentation ¶

Index ¶

Variables
func BreadthFirst(f func(string) []string, worklist []string)
func Crawl(address string) []string
func Extract(address string) ([]string, error)

Constants ¶

This section is empty.

Variables ¶

View Source

var DownloadDir string

Functions ¶

func BreadthFirst ¶

func BreadthFirst(f func(string) []string, worklist []string)

BreadthFirst calls f for each item in the worklist. Any items returned by f are added to the worklist. f is called at most once for each item.

func Crawl ¶

func Crawl(address string) []string

Crawl wraps calls to Extract, logging any errors, allowing BreadthFirst to continue crawling through all the hyperlinks.

func Extract ¶

func Extract(address string) ([]string, error)

Extract makes an HTTP GET request to the specified URL address, parses the response in HTML, and returns the links in the HTML document.

Types ¶

This section is empty.

Source Files ¶

View all Source files

ch5ex13.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL