checklinks

package module

v0.0.8 Latest Latest Go to latest Published: Apr 19, 2022 License: MIT Imports: 9 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/patrickbucher/checklinks

Links

Open Source Insights

README ¶

`checklinks`: Crawl a Website for Dead URLs

The checklinks utility takes a single website address and crawls that page for links (i.e. href attributes of <a> tags). TLS issues are ignored.

Run It

$ go run cmd/checklinks.go [url]

If the URL does not start with an http:// or https:// prefix, http:// is automatically assumed.

Build It, Then Run It

$ go build cmd/checklinks.go
$ ./checklinks [url]

Install It

Pick a tag (e.g. v0.0.7) and use go install to install that particular version:

$ go install github.com/patrickbucher/checklinks/cmd@v0.0.7
go: downloading github.com/patrickbucher/checklinks v0.0.7

Flags

The success and failure of each individual link is reported to the terminal. Use the flags to control the output and request timeout:

$ ./checklinks -help
Usage of ./checklinks:
  -ignored
        report ignored links (e.g. mailto:...)
  -nofailed
        do NOT report failed links (e.g. 404)
  -success
        report succeeded links (OK)
  -timeout int
        request timeout (in seconds) (default 10)

TODO

introduce command line flags
- user agent (optional)
- level of parallelism (optional)
- allow insecure SSL/TLS
refactor code
- introduce Config struct for handing over the entire configuration from the command line to the crawler function
- introduce Channels struct for handing over channels to Process functions

Documentation ¶

Index ¶

Constants
func CrawlPage(site *url.URL, timeout int, ok, ignore, fail bool)
func ExtractTagAttribute(node *html.Node, tagName, attrName string) []string
func FetchDocument(url string, c *http.Client) (*html.Node, error)
func ProcessLeaf(c *http.Client, l *Link, res resSink, done doneSink, t chan struct{})
func ProcessNode(c *http.Client, l *Link, links linkSink, res resSink, done doneSink, ...)
func QualifyInternalURL(page, link *url.URL) *url.URL
type Link
- func NewLink(address string, site *url.URL) (*Link, error)
- func (l *Link) IsCrawlable() bool
- func (l *Link) IsInternal() bool
type Result
- func (c Result) String() string

Constants ¶

View Source

const (
	// Parallelism is the max. amount of HTTP requests open at any given time.
	Parallelism = 64

	// UserAgent defines a value used for the "User-Agent" header to avoid being blocked.
	UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0"
)

Variables ¶

This section is empty.

Functions ¶

func CrawlPage ¶

func CrawlPage(site *url.URL, timeout int, ok, ignore, fail bool)

CrawlPage crawls the given site's URL and reports successfully checked links, ignored links, and failed links (according to the flags ok, ignore, fail, respectively). The given timeout is used to limit the waiting time of the http client for a request.

func ExtractTagAttribute ¶

func ExtractTagAttribute(node *html.Node, tagName, attrName string) []string

ExtractTagAttribute traverses the given node's tree, searches it for nodes with the given tag name, and extracts the given attribute value from it.

func FetchDocument ¶

func FetchDocument(url string, c *http.Client) (*html.Node, error)

FetchDocument gets the document indicated by the given url using the given client, and returns its root (document) node. An error is returned if the document cannot be fetched or parsed as HTML.

func ProcessLeaf ¶

func ProcessLeaf(c *http.Client, l *Link, res resSink, done doneSink, t chan struct{})

ProcessLeaf uses the given http.Client to fetch the given link using a GET request, and reports the result of that request. A message is sent to the given done channel when the node has been processed.

func ProcessNode ¶

func ProcessNode(c *http.Client, l *Link, links linkSink, res resSink, done doneSink, t chan struct{})

ProcessNode uses the given http.Client to fetch the given link, and reports the extracted links on the page (indicated by <a href="...">). Links unsuitable for further crawling and malformed links are reported. A message is sent to the given done channel when the node has been processed.

func QualifyInternalURL ¶

func QualifyInternalURL(page, link *url.URL) *url.URL

QualifyInternalURL creates a new URL by merging scheme and host information from the page URL with the rest of the URL indication from the link URL.

Types ¶

type Link ¶

type Link struct {
	URL  *url.URL
	Orig *url.URL
}

Link represents a link (URL) in the context of a web site (Site).

func NewLink ¶

func NewLink(address string, site *url.URL) (*Link, error)

NewLink creates a Link from the given address. An error is returned, if the address cannot be parsed.

func (*Link) IsCrawlable ¶

func (l *Link) IsCrawlable() bool

IsCrawlable returns true if the URL of the link has http(s) as the protocol, or no protocol at all (which indicates an internal link), and false otherwise.

func (*Link) IsInternal ¶

func (l *Link) IsInternal() bool

IsInternal returns true if the link's URL points to the same domain as its site, and false otherwise.

type Result ¶

type Result struct {
	Err  error
	Link *Link
}

Result describes the result of processing a Link.

func (Result) String ¶

func (c Result) String() string

String returns a string prefixed with FAIL in case of an error, and prefixed with OK if no error is present. The URL and error (if any) is contained in the string.

Source Files ¶

View all Source files

checklinks.go

Directories ¶

Path	Synopsis
checklinks

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL