httpsyet

package

v0.2.2 Latest Latest Go to latest Published: Jun 30, 2018 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/golangsam/pipe

Links

Open Source Insights

README ¶

`v1` - The separation of concerns

A refacturing

Inspired by Jorin's "qvl.io/httpsyet/httpsyet", which I stumbled upon (via GoLangWeekly).

So, as a "real life" example, the original got refactored with focus solely on the aspects of conncurrency, and intentionally and respectfully each and all code related to the actual crawl functionality was left as untouched as possible.

Please feel free to compare the refactored crawler.go with the untouched original crawler.go.ori.304 (304 LoC) or see crawler.go.mini.224 where the parts which became obsolete are removed entirely (and not only commented out).

tl;dr

Using functions generated from pipe/s (see genny.go below), the actual concurrent network to process the site traffic becomes:

	sites, seen := siteForkSeenAttr(c.sites, site.Attr)
	for _, inp := range siteStrew(sites, size) {
		siteDoneFunc(inp, c.crawl) // sites leave inside crawler's crawl
	}
	siteDoneLeave(seen, c) // seen leave without further processing

Simple, is it not? ;-)

Please note: no sync-WaitGroup is needed around the parallel processes (as did the original), Also Done...'s results may be safely discarded. (The traffic congestion is monitored another way.)

So, how to get there?

Overview

The original Crawler "is used as configuration for Run." only, and this limited/focused purpose deserves respect.

`crawling`

In order to give a home to the data structures relevant during crawling, a new type crawling struct (in new crawling.go - see below) represents a crawling Crawler.

Crawler (the config) and traffic are embedded anonymously; thus crawling inherits their respective methods.

Note: The original implementation uses four hand-made channels and very cleverly orchestrates their handling. Too clever, may be.

Two channels become obsolete:

queue becomes obsolete as feed-back is sent directly into c.sites.
wait becomes a *sync-WaitGroup (to keep track of the traffic inside the circular net)

The remainig two channels sites and results got a new home in crawling:

crawling.add registers entering urls (synchonously, and parallal!)
crawling.goWaitAndClose patiently awaits it's Wait()
crawling.crawl decrements every crawled site
sitePipeLeave decrements the "I've seen your url before"-sites

Some remarks regarding source files follow:

`crawling.go`

defines type crawling to represent a crawling Crawler.
Crawler.crawling instatiates a new crawling, and calls it crawling.crawling (please forgive the pun), which
- builds the process network (see above)
- feeds the initial urls (using the original func queueURLs)
- launches the closer (who simply does a crawling.Wait() before he closes the channels owned by crawling)

`crawler_test.go`

As we feed sites back into the crawling in parallel (which did not happen originally due to the use of channel queue) the visited map needs to become a guarded map (defined at the end of the source file) in order pass the tests.

Feel free to compare with crawler_test.go.ori.

`genny.go`

Just contains the go:generate directives for genny to generate what we need from the generic pipe/s function library.

Changes to `crawler.go`

func (c Crawler) Run() error
- typo corrected: "Run the cralwer." => "Run the crawler."
- ca 30 LoC after initial validation removed,
- finish with <-c.crawling(urls) instead - wait for crawling to finish
func makeQueue()
- completely removed - no need
- ca 35 LoC
func (c Crawler) worker
- remove the for s := range sites loop
- the loop body becomes a method of crawling: func (c *crawling) crawlSite(s site) (urls []*url.URL)
- send results (now typed) into c.results (instead of old argument results)
func queueURLs
- is now launched in func (c *crawling) add

Thus, ca 80 LoC are removed / deactivated, and:

no channel is created
no go routine in launched
only two send's remain:
- c.results <- ... from crawlSite(s site)
- queue <- site from queueURLs, now called with c.sites as argument queue (from func (c *crawling) add).

Back to Overview

Documentation ¶

Overview ¶

Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Crawler ¶

type Crawler struct {
	Sites    []string                             // At least one URL.
	Out      io.Writer                            // Required. Writes one detected site per line.
	Log      *log.Logger                          // Required. Errors are reported here.
	Depth    int                                  // Optional. Limit depth. Set to >= 1.
	Parallel int                                  // Optional. Set how many sites to crawl in parallel.
	Delay    time.Duration                        // Optional. Set delay between crawls.
	Get      func(string) (*http.Response, error) // Optional. Defaults to http.Get.
	Verbose  bool                                 // Optional. If set, status updates are written to logger.
}

Crawler is used as configuration for Run. Is validated in Run().

func (Crawler) Run ¶

func (c Crawler) Run() error

Run the crawler. Can return validation errors. All crawling errors are reported via logger. Output is written to writer. Crawls sites recursively and reports all external links that can be changed to HTTPS. Also reports broken links via error logger.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

v1 - The separation of concerns

A refacturing

tl;dr

Overview

crawling

crawling.go

crawler_test.go

genny.go

Changes to crawler.go

Documentation ¶

Overview ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

Types ¶

type Crawler ¶

func (Crawler) Run ¶

Source Files ¶

`v1` - The separation of concerns

`crawling`

`crawling.go`

`crawler_test.go`

`genny.go`

Changes to `crawler.go`