checklinks

package module
v0.0.8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 19, 2022 License: MIT Imports: 9 Imported by: 0

README

The checklinks utility takes a single website address and crawls that page for links (i.e. href attributes of <a> tags). TLS issues are ignored.

Run It

$ go run cmd/checklinks.go [url]

If the URL does not start with an http:// or https:// prefix, http:// is automatically assumed.

Build It, Then Run It

$ go build cmd/checklinks.go
$ ./checklinks [url]

Install It

Pick a tag (e.g. v0.0.7) and use go install to install that particular version:

$ go install github.com/patrickbucher/checklinks/cmd@v0.0.7
go: downloading github.com/patrickbucher/checklinks v0.0.7

Flags

The success and failure of each individual link is reported to the terminal. Use the flags to control the output and request timeout:

$ ./checklinks -help
Usage of ./checklinks:
  -ignored
        report ignored links (e.g. mailto:...)
  -nofailed
        do NOT report failed links (e.g. 404)
  -success
        report succeeded links (OK)
  -timeout int
        request timeout (in seconds) (default 10)

TODO

  • introduce command line flags
    • user agent (optional)
    • level of parallelism (optional)
    • allow insecure SSL/TLS
  • refactor code
    • introduce Config struct for handing over the entire configuration from the command line to the crawler function
    • introduce Channels struct for handing over channels to Process functions

Documentation

Index

Constants

View Source
const (
	// Parallelism is the max. amount of HTTP requests open at any given time.
	Parallelism = 64

	// UserAgent defines a value used for the "User-Agent" header to avoid being blocked.
	UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0"
)

Variables

This section is empty.

Functions

func CrawlPage

func CrawlPage(site *url.URL, timeout int, ok, ignore, fail bool)

CrawlPage crawls the given site's URL and reports successfully checked links, ignored links, and failed links (according to the flags ok, ignore, fail, respectively). The given timeout is used to limit the waiting time of the http client for a request.

func ExtractTagAttribute

func ExtractTagAttribute(node *html.Node, tagName, attrName string) []string

ExtractTagAttribute traverses the given node's tree, searches it for nodes with the given tag name, and extracts the given attribute value from it.

func FetchDocument

func FetchDocument(url string, c *http.Client) (*html.Node, error)

FetchDocument gets the document indicated by the given url using the given client, and returns its root (document) node. An error is returned if the document cannot be fetched or parsed as HTML.

func ProcessLeaf

func ProcessLeaf(c *http.Client, l *Link, res resSink, done doneSink, t chan struct{})

ProcessLeaf uses the given http.Client to fetch the given link using a GET request, and reports the result of that request. A message is sent to the given done channel when the node has been processed.

func ProcessNode

func ProcessNode(c *http.Client, l *Link, links linkSink, res resSink, done doneSink, t chan struct{})

ProcessNode uses the given http.Client to fetch the given link, and reports the extracted links on the page (indicated by <a href="...">). Links unsuitable for further crawling and malformed links are reported. A message is sent to the given done channel when the node has been processed.

func QualifyInternalURL

func QualifyInternalURL(page, link *url.URL) *url.URL

QualifyInternalURL creates a new URL by merging scheme and host information from the page URL with the rest of the URL indication from the link URL.

Types

type Link struct {
	URL  *url.URL
	Orig *url.URL
}

Link represents a link (URL) in the context of a web site (Site).

func NewLink(address string, site *url.URL) (*Link, error)

NewLink creates a Link from the given address. An error is returned, if the address cannot be parsed.

func (*Link) IsCrawlable

func (l *Link) IsCrawlable() bool

IsCrawlable returns true if the URL of the link has http(s) as the protocol, or no protocol at all (which indicates an internal link), and false otherwise.

func (*Link) IsInternal

func (l *Link) IsInternal() bool

IsInternal returns true if the link's URL points to the same domain as its site, and false otherwise.

type Result

type Result struct {
	Err  error
	Link *Link
}

Result describes the result of processing a Link.

func (Result) String

func (c Result) String() string

String returns a string prefixed with FAIL in case of an error, and prefixed with OK if no error is present. The URL and error (if any) is contained in the string.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL