Documentation ¶
Overview ¶
dspiders is means Distributed Spider
This is a web page crawl and full text index package.
It's have two parts for dspiders: crawl machine and process machine. And they are distributed.
Crawl Machine:
This part as a crawl node can be deployed to everywhere if you want. It connect to Process Machine use the nst package. It get the url infomation from Process Machine and crawl the page. If the page was changed, send the page content (the content just in <body> tag, and be removed all html tag) to Process Machine (the type is PageData), analysis the links in the page (html <a> tag), push all links to Process Machine (the type is UrlBasic). At present, the Crawl Machine don't crawl the media file data (css, js, jpeg, png etc. ).
Process Machine:
This part process all data where from the Crawl Machine. It have many small parts.
NetTransport. It manage the communication between the Crawl Machine and the Process Machine, use the nst package.
UrlCrawlQueue. It manage a queue, which store waiting be crawled urls. When Crawl Machine ask for new url, it will provide from this queue.
PagesProcess. It process the page content data and urls data which from Crawl Machine. The urls which need be crawled will send to UrlCrawlQueue. The page content will be reprocess and store. At last, the reprocessed page content will send to WordsProcess.
WordsProcess. It manage the full text index for page content. According to sentences, words, character, it split the page content to slice which recorded the index position. Then store the index information for each page, merge/change words context relationship information.
Words Context Relationship Informaion:
The words index store in the dspiders is use the Roles context status. It record the position of a word in the page content, and what word before and after this word in the sentence.
For example the page url is : http://for.example.com/a. The page content is: Today is sunday, We have a funny holiday.
The page content will be process to only text content "today is sunday \n we have a funny holiday". It have no punctuation but only the line break, every line is one sentence. Then the text content will be process to a map slice map[uint64][]string. It seems like this: [0]{today, is, sunday},[16]{we, have, a, funny, holiday}.
Next step, the dspiders will store the relationship, for example fide the "is" Role, find the "sunday" in the down context relationship, store the link and the position 2 of the word "sunday", find the "today" in the up context relationship, store the link and the position 0 of the word "today".
Now we record the relationship which word appear which link, and the word position in the content, and what word before and after this word.
Index ¶
- Constants
- Variables
- type AroundLink
- type CrawlMachine
- type MediaData
- type NetDataStatus
- type NetTransportData
- type NetTransportDataRe
- type NetTransportHandle
- type NetTransportOperate
- type PageData
- type PageSentences
- type PagesProcess
- type SentencesIndex
- type UrlBasic
- type UrlChannel
- type UrlCrawlQueue
- type WordIndex
- type WordsIndexProcess
- type WordsIndexRequest
- type WordsIndexType
Constants ¶
const ( URL_CRAWL_QUEUE_CAP uint = 1000 // The url crawl queue's capacity. CRAWL_MACHINE_CRAWL_INTERVAL = 3 // the crawl machine every crawl interval what second. WORDS_PROCESS_QUEUE_CAP uint = 10000 // The words process queue's capacity. UP_INTERVAL_DEFAULT = 86400 // The default page recrawl(update) interval, the unit is second, the value is 1 day. UP_INTERVAL_MIN = 600 // The minimum page recrawl(update) interval, the unit is second, the value is 10 minute. UP_INTERVAL_MAX = 8640000 // The maximum page recrawl(update) interval, the unit is second, the value is 100 day. )
Variables ¶
var AcceptLanguage = []string{
"en-US,en;q=0.5",
"zh-CN,zh;q=0.8,en;q=0.6",
}
var UserAgent = []string{
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.73 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393",
}
Functions ¶
This section is empty.
Types ¶
type AroundLink ¶
The link which not in the domain.
type CrawlMachine ¶
type CrawlMachine struct {
// contains filtered or unexported fields
}
func NewCrawlMachine ¶
func NewCrawlMachine(tcp *nst2.Client, name, code string) (c *CrawlMachine)
func (*CrawlMachine) Close ¶
func (c *CrawlMachine) Close()
func (*CrawlMachine) Start ¶
func (c *CrawlMachine) Start()
type MediaData ¶
type MediaData struct { roles.Role Url string // The complete link address Ver uint64 // The page version UpTime time.Time // The update time UpInterval int64 // The update time interval(will wait UpInterval second to next update.) Domain string // The domain name. Spider string // The spider machine's name. MediaType int // The media's type MediaName string // The Media's name DataSaved bool // If the data already be saved. DataBody []byte // The media's data body. Hash string // sha1 hash signature }
One media's data, for example css, image ...
type NetDataStatus ¶
type NetDataStatus uint
The data status for network transport.
const ( NET_DATA_STATUS_NO NetDataStatus = iota // No status NET_DATA_STATUS_OK // all ok NET_DATA_STATUS_PAGE_UPDATE // The page's content was changed, and update it. NET_DATA_STATUS_PAGE_NOT_UPDATE // The page's content was not changed, not need to update. NET_DATA_ERROR // Some error )
type NetTransportData ¶
type NetTransportData struct { Name string // The sender's name Code string // The sender's identity code Operate NetTransportOperate // The operate code Status NetDataStatus // The data status Domain string // The damain if it be need SiteName string // The site name if it be need Data []byte // The data body, it can be PageData, UrlBasic and so on. }
The data which transport in network
type NetTransportDataRe ¶
type NetTransportDataRe struct { Status NetDataStatus // The data status Data []byte // The data body, it can be PageData, UrlBasic and so on. }
The data which transport in network - the net transport send
type NetTransportHandle ¶
type NetTransportHandle struct {
// contains filtered or unexported fields
}
The network transport handle.
Order by nst.ConnExecer interface.
It operate url crawl queue's request, the page's storage, the words's index and others.
func NewNetTransportHandle ¶
func NewNetTransportHandle(u *UrlCrawlQueue, p map[string]*PagesProcess, w *WordsIndexProcess, i *cpool.Section) (n *NetTransportHandle)
func (*NetTransportHandle) Close ¶
func (n *NetTransportHandle) Close()
func (*NetTransportHandle) Start ¶
func (n *NetTransportHandle) Start()
type NetTransportOperate ¶
type NetTransportOperate uint
The network transport operate code
const ( NET_TRANSPORT_OPERATE_NO NetTransportOperate = iota // the code is null NET_TRANSPORT_OPERATE_URL_CRAWL_QUEUE_ADD // add some basic urls information to url crawl queue NET_TRANSPORT_OPERATE_URL_CRAWL_QUEUE_GET // get a basic url information from url crawl queue NET_TRANSPORT_OPERATE_SEND_PAGE_DATA // send crawled page data to pages process NET_TRANSPORT_OPERATE_SEND_MEDIA_DATA // send crawled media data to pages process )
type PageData ¶
type PageData struct { roles.Role Url string // The complete link address Ver uint64 // The page version UpTime time.Time // The update time UpInterval int64 // The update time interval(will wait UpInterval second to next update.) Domain string // The domain name. Spider string // The spider machine's name. KeyWords []string // The key words, from the html's header meta name=keywords HeaderTitle string // The page's title, from <header><title></title></header> BodyContent string // The page's body content, from <body></body>, and is all text Hash string // The page body content's(the field BodyConent) sha1 hash signature }
One page's data.
type PageSentences ¶
type PageSentences struct { roles.Role Url string // The complete link address Ver uint64 // the page version Sentences map[uint64][]string // the sentences index }
One page's all sentences and words location index
type PagesProcess ¶
type PagesProcess struct {
// contains filtered or unexported fields
}
Handle the store for pages when the crawler get the page.
func NewPagesProcess ¶
func NewPagesProcess(sitename string, config *cpool.Section, crawlQueue *UrlCrawlQueue, indexQueue *WordsIndexProcess, drule *operator.Operator) (p *PagesProcess, err error)
func (*PagesProcess) AddMedia ¶
func (p *PagesProcess) AddMedia(media *MediaData) (err error)
Add a media data to store whitch crawler get.
func (*PagesProcess) AddPage ¶
func (p *PagesProcess) AddPage(page *PageData, status NetDataStatus) (err error)
Add a page data to store whitch crawler get.
func (*PagesProcess) AddUrls ¶
func (p *PagesProcess) AddUrls(urls []UrlBasic) (err error)
Add urls to crawl queue whitch clawler get.
func (*PagesProcess) Close ¶
func (p *PagesProcess) Close()
func (*PagesProcess) Start ¶
func (p *PagesProcess) Start()
type SentencesIndex ¶
The sentences location index in one page
type UrlBasic ¶
type UrlBasic struct { SiteName string // the site name Domain string // the url belonged which domain Url string // the url self Text string // the url's name Hash string // the url page last version hash Ver uint64 // the url page last version Filter bool // The this is a list url, the bool is true, it will just get all the url but not store the page data. }
The Url basic information for url queue channel
type UrlCrawlQueue ¶
type UrlCrawlQueue struct {
// contains filtered or unexported fields
}
The url crawl queue
func NewUrlCrawlQueue ¶
func NewUrlCrawlQueue() (u *UrlCrawlQueue)
Initialize the url crawl queue, the chan's length is const URL_CRAWL_QUEUE_CAP
func (*UrlCrawlQueue) Add ¶
func (u *UrlCrawlQueue) Add(ub UrlBasic) (err error)
Add a url basic information to the url crawl queue
func (*UrlCrawlQueue) Get ¶
func (u *UrlCrawlQueue) Get() (ub UrlBasic, err error)
Get one url basic information from the url crawl queue
func (*UrlCrawlQueue) List ¶
func (u *UrlCrawlQueue) List() (list []string)
List all url in the queue
type WordsIndexProcess ¶
type WordsIndexProcess struct {
// contains filtered or unexported fields
}
The words index process
func NewWordsIndexProcess ¶
func (*WordsIndexProcess) Add ¶
func (w *WordsIndexProcess) Add(req *WordsIndexRequest) (err error)
Add a words index reuest to queue
func (*WordsIndexProcess) ReturnQueue ¶
func (w *WordsIndexProcess) ReturnQueue() chan *WordsIndexRequest
return the index wait queue
type WordsIndexRequest ¶
type WordsIndexRequest struct { Url string // The index queue url Domain string // The Url domain Type WordsIndexType // the index type is what PageData *PageData // the page data AroundLink *AroundLink // the around link }
One words index request
type WordsIndexType ¶
type WordsIndexType uint
The type which wait to index
const ( WORDS_INDEX_TYPE_NO WordsIndexType = iota // No Type WORDS_INDEX_TYPE_PAGE // The type is page WORDS_INDEX_TYPE_AROUND // The type is around link )