dspiders

package

v0.0.0-...-2e4edee Latest Latest Go to latest Published: Feb 28, 2022 License: LGPL-3.0 Imports: 20 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/idcsource/insight-0-0-lib

Links

Open Source Insights

Documentation ¶

Overview ¶

dspiders is means Distributed Spider

This is a web page crawl and full text index package.

It's have two parts for dspiders: crawl machine and process machine. And they are distributed.

Crawl Machine:

This part as a crawl node can be deployed to everywhere if you want. It connect to Process Machine use the nst package. It get the url infomation from Process Machine and crawl the page. If the page was changed, send the page content (the content just in <body> tag, and be removed all html tag) to Process Machine (the type is PageData), analysis the links in the page (html <a> tag), push all links to Process Machine (the type is UrlBasic). At present, the Crawl Machine don't crawl the media file data (css, js, jpeg, png etc. ).

Process Machine:

This part process all data where from the Crawl Machine. It have many small parts.

NetTransport. It manage the communication between the Crawl Machine and the Process Machine, use the nst package.

UrlCrawlQueue. It manage a queue, which store waiting be crawled urls. When Crawl Machine ask for new url, it will provide from this queue.

PagesProcess. It process the page content data and urls data which from Crawl Machine. The urls which need be crawled will send to UrlCrawlQueue. The page content will be reprocess and store. At last, the reprocessed page content will send to WordsProcess.

WordsProcess. It manage the full text index for page content. According to sentences, words, character, it split the page content to slice which recorded the index position. Then store the index information for each page, merge/change words context relationship information.

Words Context Relationship Informaion:

The words index store in the dspiders is use the Roles context status. It record the position of a word in the page content, and what word before and after this word in the sentence.

For example the page url is : http://for.example.com/a. The page content is: Today is sunday, We have a funny holiday.

The page content will be process to only text content "today is sunday \n we have a funny holiday". It have no punctuation but only the line break, every line is one sentence. Then the text content will be process to a map slice map[uint64][]string. It seems like this: [0]{today, is, sunday},[16]{we, have, a, funny, holiday}.

Next step, the dspiders will store the relationship, for example fide the "is" Role, find the "sunday" in the down context relationship, store the link and the position 2 of the word "sunday", find the "today" in the up context relationship, store the link and the position 0 of the word "today".

Now we record the relationship which word appear which link, and the word position in the content, and what word before and after this word.

Index ¶

Constants
Variables
type AroundLink
type CrawlMachine
- func NewCrawlMachine(tcp *nst2.Client, name, code string) (c *CrawlMachine)
- func (c *CrawlMachine) Close()
- func (c *CrawlMachine) Start()
type MediaData
type NetDataStatus
type NetTransportData
type NetTransportDataRe
type NetTransportHandle
- func NewNetTransportHandle(u *UrlCrawlQueue, p map[string]*PagesProcess, w *WordsIndexProcess, ...) (n *NetTransportHandle)
- func (n *NetTransportHandle) Close()
- func (n *NetTransportHandle) NSTexec(ce *nst2.ConnExec) (stat nst2.SendStat, err error)
- func (n *NetTransportHandle) Start()
type NetTransportOperate
type PageData
type PageSentences
type PagesProcess
- func NewPagesProcess(sitename string, config *cpool.Section, crawlQueue *UrlCrawlQueue, ...) (p *PagesProcess, err error)
- func (p *PagesProcess) AddMedia(media *MediaData) (err error)
- func (p *PagesProcess) AddPage(page *PageData, status NetDataStatus) (err error)
- func (p *PagesProcess) AddUrls(urls []UrlBasic) (err error)
- func (p *PagesProcess) Close()
- func (p *PagesProcess) Start()
type SentencesIndex
type UrlBasic
type UrlChannel
type UrlCrawlQueue
- func NewUrlCrawlQueue() (u *UrlCrawlQueue)
- func (u *UrlCrawlQueue) Add(ub UrlBasic) (err error)
- func (u *UrlCrawlQueue) Count() (count uint)
- func (u *UrlCrawlQueue) Get() (ub UrlBasic, err error)
- func (u *UrlCrawlQueue) List() (list []string)
type WordIndex
type WordsIndexProcess
- func NewWordsIndexProcess(sentencedb *operator.Operator, sentencedbname string, ...) (w *WordsIndexProcess)
- func (w *WordsIndexProcess) Add(req *WordsIndexRequest) (err error)
- func (w *WordsIndexProcess) Close()
- func (w *WordsIndexProcess) ReturnQueue() chan *WordsIndexRequest
- func (w *WordsIndexProcess) Start()
type WordsIndexRequest
type WordsIndexType

Constants ¶

View Source

const (
	URL_CRAWL_QUEUE_CAP          uint = 1000  // The url crawl queue's capacity.
	CRAWL_MACHINE_CRAWL_INTERVAL      = 3     // the crawl machine every crawl interval what second.
	WORDS_PROCESS_QUEUE_CAP      uint = 10000 // The words process queue's capacity.

	UP_INTERVAL_DEFAULT = 86400   // The default page recrawl(update) interval, the unit is second, the value is 1 day.
	UP_INTERVAL_MIN     = 600     // The minimum page recrawl(update) interval, the unit is second, the value is 10 minute.
	UP_INTERVAL_MAX     = 8640000 // The maximum page recrawl(update) interval, the unit is second, the value is 100 day.
)

Variables ¶

View Source

var AcceptLanguage = []string{
	"en-US,en;q=0.5",
	"zh-CN,zh;q=0.8,en;q=0.6",
}

View Source

var UserAgent = []string{

	"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0",
	"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
	"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.73 Safari/537.36",

	"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0",
	"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393",
}

Functions ¶

This section is empty.

Types ¶

type AroundLink ¶

type AroundLink struct {
	roles.Role
	Url  string // the link address
	Text string // the link text
}

The link which not in the domain.

type CrawlMachine ¶

type CrawlMachine struct {
	// contains filtered or unexported fields
}

func NewCrawlMachine ¶

func NewCrawlMachine(tcp *nst2.Client, name, code string) (c *CrawlMachine)

func (*CrawlMachine) Close ¶

func (c *CrawlMachine) Close()

func (*CrawlMachine) Start ¶

func (c *CrawlMachine) Start()

type MediaData ¶

type MediaData struct {
	roles.Role
	Url        string    // The complete link address
	Ver        uint64    // The page version
	UpTime     time.Time // The update time
	UpInterval int64     // The update time interval(will wait UpInterval second to next update.)
	Domain     string    // The domain name.
	Spider     string    // The spider machine's name.
	MediaType  int       // The media's type
	MediaName  string    // The Media's name
	DataSaved  bool      // If the data already be saved.
	DataBody   []byte    // The media's data body.
	Hash       string    // sha1 hash signature
}

One media's data, for example css, image ...

type NetDataStatus ¶

type NetDataStatus uint

The data status for network transport.

const (
	NET_DATA_STATUS_NO              NetDataStatus = iota // No status
	NET_DATA_STATUS_OK                                   // all ok
	NET_DATA_STATUS_PAGE_UPDATE                          // The page's content was changed, and update it.
	NET_DATA_STATUS_PAGE_NOT_UPDATE                      // The page's content was not changed, not need to update.
	NET_DATA_ERROR                                       // Some error
)

type NetTransportData ¶

type NetTransportData struct {
	Name     string              // The sender's name
	Code     string              // The sender's identity code
	Operate  NetTransportOperate // The operate code
	Status   NetDataStatus       // The data status
	Domain   string              // The damain if it be need
	SiteName string              // The site name if it be need
	Data     []byte              // The data body, it can be PageData, UrlBasic and so on.
}

The data which transport in network

type NetTransportDataRe ¶

type NetTransportDataRe struct {
	Status NetDataStatus // The data status
	Data   []byte        // The data body, it can be PageData, UrlBasic and so on.
}

The data which transport in network - the net transport send

type NetTransportHandle ¶

type NetTransportHandle struct {
	// contains filtered or unexported fields
}

The network transport handle.

Order by nst.ConnExecer interface.

It operate url crawl queue's request, the page's storage, the words's index and others.

func NewNetTransportHandle ¶

func NewNetTransportHandle(u *UrlCrawlQueue, p map[string]*PagesProcess, w *WordsIndexProcess, i *cpool.Section) (n *NetTransportHandle)

func (*NetTransportHandle) Close ¶

func (n *NetTransportHandle) Close()

func (*NetTransportHandle) NSTexec ¶

func (n *NetTransportHandle) NSTexec(ce *nst2.ConnExec) (stat nst2.SendStat, err error)

func (*NetTransportHandle) Start ¶

func (n *NetTransportHandle) Start()

type NetTransportOperate ¶

type NetTransportOperate uint

The network transport operate code

const (
	NET_TRANSPORT_OPERATE_NO                  NetTransportOperate = iota // the code is null
	NET_TRANSPORT_OPERATE_URL_CRAWL_QUEUE_ADD                            // add some basic urls information to url crawl queue
	NET_TRANSPORT_OPERATE_URL_CRAWL_QUEUE_GET                            // get a basic url information from url crawl queue
	NET_TRANSPORT_OPERATE_SEND_PAGE_DATA                                 // send crawled page data to pages process
	NET_TRANSPORT_OPERATE_SEND_MEDIA_DATA                                // send crawled media data to pages process
)

type PageData ¶

type PageData struct {
	roles.Role
	Url         string    // The complete link address
	Ver         uint64    // The page version
	UpTime      time.Time // The update time
	UpInterval  int64     // The update time interval(will wait UpInterval second to next update.)
	Domain      string    // The domain name.
	Spider      string    // The spider machine's name.
	KeyWords    []string  // The key words, from the html's header meta name=keywords
	HeaderTitle string    // The page's title, from <header><title></title></header>
	BodyContent string    // The page's body content, from <body></body>, and is all text
	Hash        string    // The page body content's(the field BodyConent) sha1 hash signature
}

One page's data.

type PageSentences ¶

type PageSentences struct {
	roles.Role
	Url       string              // The complete link address
	Ver       uint64              // the page version
	Sentences map[uint64][]string // the sentences index

}

One page's all sentences and words location index

type PagesProcess ¶

type PagesProcess struct {
	// contains filtered or unexported fields
}

Handle the store for pages when the crawler get the page.

func NewPagesProcess ¶

func NewPagesProcess(sitename string, config *cpool.Section, crawlQueue *UrlCrawlQueue, indexQueue *WordsIndexProcess, drule *operator.Operator) (p *PagesProcess, err error)

func (*PagesProcess) AddMedia ¶

func (p *PagesProcess) AddMedia(media *MediaData) (err error)

Add a media data to store whitch crawler get.

func (*PagesProcess) AddPage ¶

func (p *PagesProcess) AddPage(page *PageData, status NetDataStatus) (err error)

Add a page data to store whitch crawler get.

func (*PagesProcess) AddUrls ¶

func (p *PagesProcess) AddUrls(urls []UrlBasic) (err error)

Add urls to crawl queue whitch clawler get.

func (*PagesProcess) Close ¶

func (p *PagesProcess) Close()

func (*PagesProcess) Start ¶

func (p *PagesProcess) Start()

type SentencesIndex ¶

type SentencesIndex struct {
	Text  string // the text
	Index uint64 // the index location
}

The sentences location index in one page

type UrlBasic ¶

type UrlBasic struct {
	SiteName string // the site name
	Domain   string // the url belonged which domain
	Url      string // the url self
	Text     string // the url's name
	Hash     string // the url page last version hash
	Ver      uint64 // the url page last version
	Filter   bool   // The this is a list url, the bool is true, it will just get all the url but not store the page data.
}

The Url basic information for url queue channel

type UrlChannel ¶

type UrlChannel chan UrlBasic

The Url's queue channel

type UrlCrawlQueue ¶

type UrlCrawlQueue struct {
	// contains filtered or unexported fields
}

The url crawl queue

func NewUrlCrawlQueue ¶

func NewUrlCrawlQueue() (u *UrlCrawlQueue)

Initialize the url crawl queue, the chan's length is const URL_CRAWL_QUEUE_CAP

func (*UrlCrawlQueue) Add ¶

func (u *UrlCrawlQueue) Add(ub UrlBasic) (err error)

Add a url basic information to the url crawl queue

func (*UrlCrawlQueue) Count ¶

func (u *UrlCrawlQueue) Count() (count uint)

Get the queue's length

func (*UrlCrawlQueue) Get ¶

func (u *UrlCrawlQueue) Get() (ub UrlBasic, err error)

Get one url basic information from the url crawl queue

func (*UrlCrawlQueue) List ¶

func (u *UrlCrawlQueue) List() (list []string)

List all url in the queue

type WordIndex ¶

type WordIndex struct {
	roles.Role
}

The word index Role

type WordsIndexProcess ¶

type WordsIndexProcess struct {
	// contains filtered or unexported fields
}

The words index process

func NewWordsIndexProcess ¶

func NewWordsIndexProcess(sentencedb *operator.Operator, sentencedbname string, worddb *operator.Operator, worddbname string) (w *WordsIndexProcess)

func (*WordsIndexProcess) Add ¶

func (w *WordsIndexProcess) Add(req *WordsIndexRequest) (err error)

Add a words index reuest to queue

func (*WordsIndexProcess) Close ¶

func (w *WordsIndexProcess) Close()

Stop the processor

func (*WordsIndexProcess) ReturnQueue ¶

func (w *WordsIndexProcess) ReturnQueue() chan *WordsIndexRequest

return the index wait queue

func (*WordsIndexProcess) Start ¶

func (w *WordsIndexProcess) Start()

Start the processor

type WordsIndexRequest ¶

type WordsIndexRequest struct {
	Url        string         // The index queue url
	Domain     string         // The Url domain
	Type       WordsIndexType // the index type is what
	PageData   *PageData      // the page data
	AroundLink *AroundLink    // the around link
}

One words index request

type WordsIndexType ¶

type WordsIndexType uint

The type which wait to index

const (
	WORDS_INDEX_TYPE_NO     WordsIndexType = iota // No Type
	WORDS_INDEX_TYPE_PAGE                         // The type is page
	WORDS_INDEX_TYPE_AROUND                       // The type is around link
)

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL