scrape

package module

v0.0.0-...-9ef7f84 Latest Latest Go to latest Published: Oct 16, 2017 License: MIT Imports: 15 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/vedhavyas/sitemap-generator

Links

Open Source Insights

README ¶

Scrape

Scrape is minimalistic depth controlled web scraping project. It can be used as command-line tool or integrate it in your project. Scrape also supports sitemap generation as an output.

Scrape Response

Once the Scraping is done on given URL, the API returns the following structure.

// Response holds the scrapped response
package scrape

import (
	"net/url"
	"regexp"
)

type Response struct {
	BaseURL      *url.URL            // starting url at maxDepth 0
	UniqueURLs   map[string]int      // UniqueURLs holds the map of unique urls we crawled and times each url is repeated
	URLsPerDepth map[int][]*url.URL  // URLsPerDepth holds urls found in each depth
	SkippedURLs  map[string][]string // SkippedURLs holds urls extracted from source urls but failed domainRegex (if given) and are invalid.
	ErrorURLs    map[string]error    // errorURLs holds details as to why reason the url was not crawled
	DomainRegex  *regexp.Regexp      // restricts crawling the urls to given domain
	MaxDepth     int                 // MaxDepth of crawl, -1 means no limit for maxDepth
	Interrupted  bool                // true if the scrapping was interrupted
}

Command line:

Installation:

go get github.com/vedhavyas/scrape/cmd/scrape/

Available command line options:

Usage of ./scrape:
 -domain-regex string(optional)
        Domain regex to limit crawls to. Defaults to base url domain
 -max-depth int(optional)
        Max depth to Crawl (default -1)
 -sitemap string(optional)
        File location to write sitemap to
 -url string(required)
        Starting URL (default "https://vedhavyas.com")

Output

Scrape supports 2 types of output.

Printing all the above collected data to stdout from Response
Generating a sitemap xml file(if passed) from the Response.

As a Package

Scrape can be integrated into any Go project through the given APIs. As a package, you will have access to the above mentioned Response and all the data in it. At this point, the following are the available APIs.

Start

func Start(ctx context.Context, url string) (resp *Response, err error)

Start will start the scrapping with no depth limit(-1) and base url domain

StartWithDepth

func StartWithDepth(ctx context.Context, url string, maxDepth int) (resp *Response, err error)

StartWithDepth will start the scrapping with given max depth and base url domain

StartWithDepthAndDomainRegex

func StartWithDepthAndDomainRegex(ctx context.Context, url string, maxDepth int, domainRegex string) (resp *Response, err error)

StartWithDepthAndDomainRegex will start the scrapping with max depth and regex

StartWithRegex

func StartWithDomainRegex(ctx context.Context, url, domainRegex string) (resp *Response, err error)

StartWithRegex will start the scrapping with no depth limit(-1) and regex

Sitemap

func Sitemap(resp *Response, file string) error

Sitemap generates a sitemap from the given response

Feedback and Contributions

If you think something is missing, please feel free to raise an issue.
If you would like to work on an open issue, feel free to announce yourself in issue's comments

Documentation ¶

Index ¶

func Sitemap(resp *Response, file string) error
type Response
- func (r Response) String() string

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Sitemap ¶

func Sitemap(resp *Response, file string) error

Sitemap generates a sitemap from the given response

Types ¶

type Response ¶

type Response struct {
	BaseURL      *url.URL            // starting url at maxDepth 0
	UniqueURLs   map[string]int      // UniqueURLs holds the map of unique urls we crawled and times its repeated
	URLsPerDepth map[int][]*url.URL  // URLsPerDepth holds url found in each depth
	SkippedURLs  map[string][]string // SkippedURLs holds urls from different domains(if domainRegex is given) and invalid URLs
	ErrorURLs    map[string]error    // errorURLs holds details as to why reason this url was not crawled
	DomainRegex  *regexp.Regexp      // restricts crawling the urls to given domain
	MaxDepth     int                 // MaxDepth of crawl, -1 means no limit for maxDepth
	Interrupted  bool                // says if gru was interrupted while scraping
}

Response holds the scrapped response

func Start ¶

func Start(ctx context.Context, url string) (resp *Response, err error)

Start will start the scrapping with no depth limit(-1) and base url domain

func StartWithDepth ¶

func StartWithDepth(ctx context.Context, url string, maxDepth int) (resp *Response, err error)

StartWithDepth will start the scrapping with given max depth and base url domain

func StartWithDepthAndDomainRegex ¶

func StartWithDepthAndDomainRegex(ctx context.Context, url string, maxDepth int, domainRegex string) (resp *Response, err error)

StartWithDepthAndDomainRegex will start the scrapping with max depth and regex

func StartWithDomainRegex ¶

func StartWithDomainRegex(ctx context.Context, url, domainRegex string) (resp *Response, err error)

StartWithDomainRegex will start the scrapping with no depth limit(-1) and regex

func (Response) String ¶

func (r Response) String() string

String returns a human readable format of the response

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
scrape

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL