dspiders

package
v0.0.0-...-2e4edee Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 28, 2022 License: LGPL-3.0 Imports: 20 Imported by: 0

Documentation

Overview

dspiders is means Distributed Spider

This is a web page crawl and full text index package.

It's have two parts for dspiders: crawl machine and process machine. And they are distributed.

Crawl Machine:

This part as a crawl node can be deployed to everywhere if you want. It connect to Process Machine use the nst package. It get the url infomation from Process Machine and crawl the page. If the page was changed, send the page content (the content just in <body> tag, and be removed all html tag) to Process Machine (the type is PageData), analysis the links in the page (html <a> tag), push all links to Process Machine (the type is UrlBasic). At present, the Crawl Machine don't crawl the media file data (css, js, jpeg, png etc. ).

Process Machine:

This part process all data where from the Crawl Machine. It have many small parts.

NetTransport. It manage the communication between the Crawl Machine and the Process Machine, use the nst package.

UrlCrawlQueue. It manage a queue, which store waiting be crawled urls. When Crawl Machine ask for new url, it will provide from this queue.

PagesProcess. It process the page content data and urls data which from Crawl Machine. The urls which need be crawled will send to UrlCrawlQueue. The page content will be reprocess and store. At last, the reprocessed page content will send to WordsProcess.

WordsProcess. It manage the full text index for page content. According to sentences, words, character, it split the page content to slice which recorded the index position. Then store the index information for each page, merge/change words context relationship information.

Words Context Relationship Informaion:

The words index store in the dspiders is use the Roles context status. It record the position of a word in the page content, and what word before and after this word in the sentence.

For example the page url is : http://for.example.com/a. The page content is: Today is sunday, We have a funny holiday.

The page content will be process to only text content "today is sunday \n we have a funny holiday". It have no punctuation but only the line break, every line is one sentence. Then the text content will be process to a map slice map[uint64][]string. It seems like this: [0]{today, is, sunday},[16]{we, have, a, funny, holiday}.

Next step, the dspiders will store the relationship, for example fide the "is" Role, find the "sunday" in the down context relationship, store the link and the position 2 of the word "sunday", find the "today" in the up context relationship, store the link and the position 0 of the word "today".

Now we record the relationship which word appear which link, and the word position in the content, and what word before and after this word.

Index

Constants

View Source
const (
	URL_CRAWL_QUEUE_CAP          uint = 1000  // The url crawl queue's capacity.
	CRAWL_MACHINE_CRAWL_INTERVAL      = 3     // the crawl machine every crawl interval what second.
	WORDS_PROCESS_QUEUE_CAP      uint = 10000 // The words process queue's capacity.

	UP_INTERVAL_DEFAULT = 86400   // The default page recrawl(update) interval, the unit is second, the value is 1 day.
	UP_INTERVAL_MIN     = 600     // The minimum page recrawl(update) interval, the unit is second, the value is 10 minute.
	UP_INTERVAL_MAX     = 8640000 // The maximum page recrawl(update) interval, the unit is second, the value is 100 day.
)

Variables

View Source
var AcceptLanguage = []string{
	"en-US,en;q=0.5",
	"zh-CN,zh;q=0.8,en;q=0.6",
}
View Source
var UserAgent = []string{

	"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0",
	"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
	"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.73 Safari/537.36",

	"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0",
	"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393",
}

Functions

This section is empty.

Types

type AroundLink struct {
	roles.Role
	Url  string // the link address
	Text string // the link text
}

The link which not in the domain.

type CrawlMachine

type CrawlMachine struct {
	// contains filtered or unexported fields
}

func NewCrawlMachine

func NewCrawlMachine(tcp *nst2.Client, name, code string) (c *CrawlMachine)

func (*CrawlMachine) Close

func (c *CrawlMachine) Close()

func (*CrawlMachine) Start

func (c *CrawlMachine) Start()

type MediaData

type MediaData struct {
	roles.Role
	Url        string    // The complete link address
	Ver        uint64    // The page version
	UpTime     time.Time // The update time
	UpInterval int64     // The update time interval(will wait UpInterval second to next update.)
	Domain     string    // The domain name.
	Spider     string    // The spider machine's name.
	MediaType  int       // The media's type
	MediaName  string    // The Media's name
	DataSaved  bool      // If the data already be saved.
	DataBody   []byte    // The media's data body.
	Hash       string    // sha1 hash signature
}

One media's data, for example css, image ...

type NetDataStatus

type NetDataStatus uint

The data status for network transport.

const (
	NET_DATA_STATUS_NO              NetDataStatus = iota // No status
	NET_DATA_STATUS_OK                                   // all ok
	NET_DATA_STATUS_PAGE_UPDATE                          // The page's content was changed, and update it.
	NET_DATA_STATUS_PAGE_NOT_UPDATE                      // The page's content was not changed, not need to update.
	NET_DATA_ERROR                                       // Some error
)

type NetTransportData

type NetTransportData struct {
	Name     string              // The sender's name
	Code     string              // The sender's identity code
	Operate  NetTransportOperate // The operate code
	Status   NetDataStatus       // The data status
	Domain   string              // The damain if it be need
	SiteName string              // The site name if it be need
	Data     []byte              // The data body, it can be PageData, UrlBasic and so on.
}

The data which transport in network

type NetTransportDataRe

type NetTransportDataRe struct {
	Status NetDataStatus // The data status
	Data   []byte        // The data body, it can be PageData, UrlBasic and so on.
}

The data which transport in network - the net transport send

type NetTransportHandle

type NetTransportHandle struct {
	// contains filtered or unexported fields
}

The network transport handle.

Order by nst.ConnExecer interface.

It operate url crawl queue's request, the page's storage, the words's index and others.

func (*NetTransportHandle) Close

func (n *NetTransportHandle) Close()

func (*NetTransportHandle) NSTexec

func (n *NetTransportHandle) NSTexec(ce *nst2.ConnExec) (stat nst2.SendStat, err error)

func (*NetTransportHandle) Start

func (n *NetTransportHandle) Start()

type NetTransportOperate

type NetTransportOperate uint

The network transport operate code

const (
	NET_TRANSPORT_OPERATE_NO                  NetTransportOperate = iota // the code is null
	NET_TRANSPORT_OPERATE_URL_CRAWL_QUEUE_ADD                            // add some basic urls information to url crawl queue
	NET_TRANSPORT_OPERATE_URL_CRAWL_QUEUE_GET                            // get a basic url information from url crawl queue
	NET_TRANSPORT_OPERATE_SEND_PAGE_DATA                                 // send crawled page data to pages process
	NET_TRANSPORT_OPERATE_SEND_MEDIA_DATA                                // send crawled media data to pages process
)

type PageData

type PageData struct {
	roles.Role
	Url         string    // The complete link address
	Ver         uint64    // The page version
	UpTime      time.Time // The update time
	UpInterval  int64     // The update time interval(will wait UpInterval second to next update.)
	Domain      string    // The domain name.
	Spider      string    // The spider machine's name.
	KeyWords    []string  // The key words, from the html's header meta name=keywords
	HeaderTitle string    // The page's title, from <header><title></title></header>
	BodyContent string    // The page's body content, from <body></body>, and is all text
	Hash        string    // The page body content's(the field BodyConent) sha1 hash signature
}

One page's data.

type PageSentences

type PageSentences struct {
	roles.Role
	Url       string              // The complete link address
	Ver       uint64              // the page version
	Sentences map[uint64][]string // the sentences index

}

One page's all sentences and words location index

type PagesProcess

type PagesProcess struct {
	// contains filtered or unexported fields
}

Handle the store for pages when the crawler get the page.

func NewPagesProcess

func NewPagesProcess(sitename string, config *cpool.Section, crawlQueue *UrlCrawlQueue, indexQueue *WordsIndexProcess, drule *operator.Operator) (p *PagesProcess, err error)

func (*PagesProcess) AddMedia

func (p *PagesProcess) AddMedia(media *MediaData) (err error)

Add a media data to store whitch crawler get.

func (*PagesProcess) AddPage

func (p *PagesProcess) AddPage(page *PageData, status NetDataStatus) (err error)

Add a page data to store whitch crawler get.

func (*PagesProcess) AddUrls

func (p *PagesProcess) AddUrls(urls []UrlBasic) (err error)

Add urls to crawl queue whitch clawler get.

func (*PagesProcess) Close

func (p *PagesProcess) Close()

func (*PagesProcess) Start

func (p *PagesProcess) Start()

type SentencesIndex

type SentencesIndex struct {
	Text  string // the text
	Index uint64 // the index location
}

The sentences location index in one page

type UrlBasic

type UrlBasic struct {
	SiteName string // the site name
	Domain   string // the url belonged which domain
	Url      string // the url self
	Text     string // the url's name
	Hash     string // the url page last version hash
	Ver      uint64 // the url page last version
	Filter   bool   // The this is a list url, the bool is true, it will just get all the url but not store the page data.
}

The Url basic information for url queue channel

type UrlChannel

type UrlChannel chan UrlBasic

The Url's queue channel

type UrlCrawlQueue

type UrlCrawlQueue struct {
	// contains filtered or unexported fields
}

The url crawl queue

func NewUrlCrawlQueue

func NewUrlCrawlQueue() (u *UrlCrawlQueue)

Initialize the url crawl queue, the chan's length is const URL_CRAWL_QUEUE_CAP

func (*UrlCrawlQueue) Add

func (u *UrlCrawlQueue) Add(ub UrlBasic) (err error)

Add a url basic information to the url crawl queue

func (*UrlCrawlQueue) Count

func (u *UrlCrawlQueue) Count() (count uint)

Get the queue's length

func (*UrlCrawlQueue) Get

func (u *UrlCrawlQueue) Get() (ub UrlBasic, err error)

Get one url basic information from the url crawl queue

func (*UrlCrawlQueue) List

func (u *UrlCrawlQueue) List() (list []string)

List all url in the queue

type WordIndex

type WordIndex struct {
	roles.Role
}

The word index Role

type WordsIndexProcess

type WordsIndexProcess struct {
	// contains filtered or unexported fields
}

The words index process

func NewWordsIndexProcess

func NewWordsIndexProcess(sentencedb *operator.Operator, sentencedbname string, worddb *operator.Operator, worddbname string) (w *WordsIndexProcess)

func (*WordsIndexProcess) Add

func (w *WordsIndexProcess) Add(req *WordsIndexRequest) (err error)

Add a words index reuest to queue

func (*WordsIndexProcess) Close

func (w *WordsIndexProcess) Close()

Stop the processor

func (*WordsIndexProcess) ReturnQueue

func (w *WordsIndexProcess) ReturnQueue() chan *WordsIndexRequest

return the index wait queue

func (*WordsIndexProcess) Start

func (w *WordsIndexProcess) Start()

Start the processor

type WordsIndexRequest

type WordsIndexRequest struct {
	Url        string         // The index queue url
	Domain     string         // The Url domain
	Type       WordsIndexType // the index type is what
	PageData   *PageData      // the page data
	AroundLink *AroundLink    // the around link
}

One words index request

type WordsIndexType

type WordsIndexType uint

The type which wait to index

const (
	WORDS_INDEX_TYPE_NO     WordsIndexType = iota // No Type
	WORDS_INDEX_TYPE_PAGE                         // The type is page
	WORDS_INDEX_TYPE_AROUND                       // The type is around link
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL