httpsyet

package
v0.2.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 30, 2018 License: MIT Imports: 12 Imported by: 0

README

v3 - Improve the traffic network

Overview

It's not easy to see at first, but there is still an interesting issue hiding in the original implementation.

It may become more obvious when we imagine to crawl huge numbers - e.g. finding a thousand new urls on new each page, and going down a couple of levels.

Look: Each site's feedback spawns a new goroutine in order to queueURLs the (e.g. 1000) urls found.

And most likely these will block!

Almost each and every of these many goroutines will block most of the time, as there are still urls discovered earlier to be fetched and crawled.

And each queueURLs carries the full slice of urls - no matter how many have been sent to the processing yet. (The implementation uses a straightforward range and does not attempt to shrink the slice. Idiomatic go.)

We can do differently. No need to wast many huge slices across plenty of blocked goroutines.

A battery called �djust provides a flexibly buffered pipe. so we use SitePipeAdjust in our network and do not need to have Feed spawn the queueURLs function any more. We may feed synchonously now!

But now we also do not need to bother Feed with the need of registering new traffic (using t.Add(len(urls))) up front. Instead we use SitePipeEnter (a companion of SiteDoneLeave) at the entrance of our network processor.

Thus, the network becomes more flexible and more self-contained and gives less burden to it's surroundings.

Pushing the types site and it's related sites traffic and result into separate sub-packages is just a little more tidying - respecting the original Crawler and it's living space.


Some remarks regarding changes to source files compared with the previous version:

traffic.go

Simplify Feed as explained, and add two processes (SitePipeEnter and `SitePipeAdjust) to the network.

genny.go in traffic/

Just add a line to use adjust.go

site.go

Only another package name

crawling.go

Just import the new sub-packages, and adjust where need.

crawler_test.go

Just the import path.

Changes to crawler.go

No need to touch.


Back to Overview

Documentation

Overview

Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Crawler

type Crawler struct {
	Sites    []string                             // At least one URL.
	Out      io.Writer                            // Required. Writes one detected site per line.
	Log      *log.Logger                          // Required. Errors are reported here.
	Depth    int                                  // Optional. Limit depth. Set to >= 1.
	Parallel int                                  // Optional. Set how many sites to crawl in parallel.
	Delay    time.Duration                        // Optional. Set delay between crawls.
	Get      func(string) (*http.Response, error) // Optional. Defaults to http.Get.
	Verbose  bool                                 // Optional. If set, status updates are written to logger.
}

Crawler is used as configuration for Run. Is validated in Run().

func (Crawler) Run

func (c Crawler) Run() error

Run the crawler. Can return validation errors. All crawling errors are reported via logger. Output is written to writer. Crawls sites recursively and reports all external links that can be changed to HTTPS. Also reports broken links via error logger.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL