crawler

package module

v0.0.0-...-c80b5f8 Latest Latest Go to latest Published: Jul 26, 2020 License: MIT Imports: 8 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/lmedson/ocrawl

Links

Open Source Insights

README ¶

OCrawl - A simple crawler to map sites relations

Installation

Install your terminal type: $ go get github.com/lmedson/ocrawl

Setup a site to crawl

First, you need to create a go file and set a url to be crawled. To visualize the relations graphically, it is possible to plot a graph through an html file, just use the plot function of the graph, passing the result of what was crawled and a desired name, to the html file.

Plot example:

    func main(){
        crawledData := crawler.Crawl("https://clojure.org/")
        crawler.Plot(crawledData, "index") // to plot a index.html file with the graph
    }

Crawling assets example:

    func main(){
        crawler.CrawlAssets("https://clojure.org/") // you can to plot or parse to json the outputed data
    }

Formating output data example:

    func main(){
        crawledData := crawler.Crawl("https://clojure.org/")
        crawler.JsonParse(crawledData, "data")
    }

Runinng

Make sure you have installed all the dependencies. Run your created file, with above code:

$ go run <your-file-name>.go.

Result

After run the crawl and plot a .html file with the graph, you can open it in your favorite browser and see the relations infos passing the mouse cursor by the nodes and through line connections.

The package structure folder

.
├── .gitignore                  # File with ignored files
├── LICENSE                     # Our kind of license
├── ocrawl.go                   # The main method to crawl
├── README.md                   # Readme with how to use the crawler
├── types.go                    # Types used at crawler
├── Gopkg.lock                  # It locks the version of packages
├── Gopkg.toml                  # Controls the import instructions, used by the lock file
├── <filename>.html             # After graph plot, you will have this output file
├── <filename>.json             # After json parse crawled data, you will have this output file
└── utils.py                    # File with some helpers

Documentation ¶

Index ¶

func Contains(links []string, linkToFind string) bool
func JsonParse(crawledData CrawlerResult, fileName string)
func Plot(res CrawlerResult, fileName string)
func Remove(urlList []string, urlToRemove string) []string
func ResolveUrls(link string, baseURL string) string
type AssetsMap
type CrawlerResult
- func Crawl(url string) CrawlerResult
- func CrawlAssets(url string) CrawlerResult
type Img
type Relations

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Contains ¶

func Contains(links []string, linkToFind string) bool

func JsonParse ¶

func JsonParse(crawledData CrawlerResult, fileName string)

func Plot ¶

func Plot(res CrawlerResult, fileName string)

func Remove ¶

func Remove(urlList []string, urlToRemove string) []string

func ResolveUrls ¶

func ResolveUrls(link string, baseURL string) string

Types ¶

type AssetsMap ¶

type AssetsMap struct {
	Page   string   `json:"page"`
	Js     []string `json:"js"`
	Css    []string `json:"css"`
	Images []Img    `json:"images"`
}

type CrawlerResult ¶

type CrawlerResult struct {
	AssetsMapList []AssetsMap `json:"assetsMapList"`
	RelationLinks []Relations `json:"relationLinks"`
	Crawled       []string    `json:"crawled"`
	// contains filtered or unexported fields
}

func Crawl ¶

func Crawl(url string) CrawlerResult

func CrawlAssets ¶

func CrawlAssets(url string) CrawlerResult

type Img ¶

type Img struct {
	ImageName string `json:"imageName"`
	ImageLink string `json:"imageLink"`
}

type Relations ¶

type Relations struct {
	Page         string   `json:"page"`
	RelatedLinks []string `json:"relatedLinks"`
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL