crawler

package

v0.0.0-...-8261f08 Latest Latest Go to latest Published: Oct 20, 2014 License: GPL-3.0 Imports: 7 Imported by: 1

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/TheCreeper/HackBot

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
Variables
func ExtractUrl(s string) string
func IsURL(s string) bool
type Client
- func (c *Client) Crawl(urlF string) (r *CrawlResult, err error)
type CrawlResult

Constants ¶

View Source

const UserAgent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Default useragent

Variables ¶

View Source

var (
	ErrMimeType    = errors.New("MIME type not supported")
	ErrInvalidChar = errors.New("Crawler result contains invalid characters")
)

Errors

View Source

var (
	// Valid url
	ValidUrl = regexp.MustCompile(`https?://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*`)

	// Valid Page Title
	ValidPageTitle = regexp.MustCompile(`^(.)+$`)
)

Some regular expressions

View Source

var AllowedMimeTypes = map[string]bool{

	"text/html; charset=utf-8": true,
}

Map of allowed mime types

Functions ¶

func ExtractUrl ¶

func ExtractUrl(s string) string

func IsURL ¶

func IsURL(s string) bool

Some helper funcs

Types ¶

type Client ¶

type Client struct {

	// Dialer used for requests
	Dial func(network, addr string) (net.Conn, error)

	// The useragent used in requests
	UserAgent string

	// UserName and PassWord for authentication with the WebServer
	UserName string
	PassWord string
}

Client structure

func (*Client) Crawl ¶

func (c *Client) Crawl(urlF string) (r *CrawlResult, err error)

type CrawlResult ¶

type CrawlResult struct {
	Title string `xml:"head>title"` // Title of the page
	//Desc string 'xml: "head>meta"' // Description of the page
	Size int // Size of webpage
}

Store some information about the page

Source Files ¶

View all Source files

crawler.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL