crawler

package
v0.0.0-...-8261f08 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 20, 2014 License: GPL-3.0 Imports: 7 Imported by: 1

Documentation

Index

Constants

View Source
const UserAgent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Default useragent

Variables

View Source
var (
	ErrMimeType    = errors.New("MIME type not supported")
	ErrInvalidChar = errors.New("Crawler result contains invalid characters")
)

Errors

View Source
var (
	// Valid url
	ValidUrl = regexp.MustCompile(`https?://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*`)

	// Valid Page Title
	ValidPageTitle = regexp.MustCompile(`^(.)+$`)
)

Some regular expressions

View Source
var AllowedMimeTypes = map[string]bool{

	"text/html; charset=utf-8": true,
}

Map of allowed mime types

Functions

func ExtractUrl

func ExtractUrl(s string) string

func IsURL

func IsURL(s string) bool

Some helper funcs

Types

type Client

type Client struct {

	// Dialer used for requests
	Dial func(network, addr string) (net.Conn, error)

	// The useragent used in requests
	UserAgent string

	// UserName and PassWord for authentication with the WebServer
	UserName string
	PassWord string
}

Client structure

func (*Client) Crawl

func (c *Client) Crawl(urlF string) (r *CrawlResult, err error)

type CrawlResult

type CrawlResult struct {
	Title string `xml:"head>title"` // Title of the page
	//Desc string 'xml: "head>meta"' // Description of the page
	Size int // Size of webpage
}

Store some information about the page

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL