Documentation ¶
Overview ¶
Package bslc provides an IP bound crawler with channel based delivery of crawled content.
Example ¶
package main import ( "fmt" "github.com/ragnar-johannsson/bslc" "github.com/ragnar-johannsson/bslc/mimetypes" "sync" ) func main() { // Initialize URL container with IPnets filter and add seed URLs allowedNetworks := bslc.NewIPNetContainer([]string{"127.0.0.0/8"}) seedUrls := []string{"http://127.0.0.1/"} urls := bslc.NewLocalURLContainer(allowedNetworks, seedUrls) // Initialize crawler crawler := bslc.Crawler{ URLs: urls, MaxConcurrentConnections: 5, } // Register mimetype handler channel with crawler ch := make(chan *bslc.Content) crawler.AddMimeTypes(mimetypes.Audio, ch) // Start content handlers wg := sync.WaitGroup{} for i := 0; i < crawler.MaxConcurrentConnections; i++ { wg.Add(1) go func() { defer wg.Done() for content := range ch { // Process received content fmt.Println("Received content from URL: ", content.URL.String()) // Signal when done with content content.Done <- true } }() } // Start crawling and wait until done crawler.StartCrawling() wg.Wait() }
Output:
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Content ¶
type Content struct { URL url.URL ContentType string Filename string Done chan bool // contains filtered or unexported fields }
Content represents the returned content from a crawled URL. The content's origin URL is saved in URL. ContentType holds the MIME type as specified by the remote server. Filename is the content's filename or an empty string if one cannot be determined. Done is channel that must be signalled after all processing of the content is done.
type Crawler ¶
type Crawler struct { URLs URLContainer MaxConcurrentConnections int // contains filtered or unexported fields }
Crawler does the heavy lifting of the actual crawling. URLs is the URLContainer used for bookkeeping and must be initialized. MaxConcurrentConnections is optional; the default is 5 concurrent transfers.
func (*Crawler) ActiveTransfers ¶
func (c *Crawler) ActiveTransfers() int
ActiveTransfers returns the current number of active transfers.
func (*Crawler) AddMimeType ¶
AddMimeType registers channel ch to receive content of the specified mimeType.
func (*Crawler) AddMimeTypes ¶
AddMimeTypes registers channel ch to receive content of the specified mimeTypes.
func (*Crawler) StartCrawling ¶
func (c *Crawler) StartCrawling()
StartCrawling starts the crawling process.
func (*Crawler) StopCrawling ¶
func (c *Crawler) StopCrawling()
StopCrawling stops the crawling process. Transfers in progress will be completed.
func (*Crawler) TotalTransfers ¶
func (c *Crawler) TotalTransfers() int
TotalTransfers returns the total sum of transfers initiated.
type IPNetContainer ¶
type IPNetContainer struct {
// contains filtered or unexported fields
}
IPNetContainer is a container for IP networks.
func NewIPNetContainer ¶
func NewIPNetContainer(networks []string) IPNetContainer
NewIPNetContainer returns a new IPNetContainer encompassing the networks specified. Networks are on CIDR notation form.
func (*IPNetContainer) Contains ¶
func (i *IPNetContainer) Contains(host string) bool
Contains returns true if container includes the given host and false otherwise. The specified host can be a DNS name or an IP.
type URLContainer ¶
URLContainer is a container for URLs encountered during crawling. Call AddURL to add new URLs to the container. If the same URL has evern been added to the container, nothing is added. NextURL returns the next URL in line and removes it from the container, or an empty string and an error if the container is empty. Len returns the number of URLs in the container.
func NewLocalURLContainer ¶
func NewLocalURLContainer(allowedNetworks IPNetContainer, seedUrls []string) URLContainer
NewLocalURLContainer returns a new URLContainer, stored locally in memory without any persistence between sessions and without the ability to share state between remote crawler instances. The allowedNetworks IP container handles the IP filtering and seedUrls are the URLs too bootstrap with.
Source Files ¶
Directories ¶
Path | Synopsis |
---|---|
examples
|
|
Package mimetypes defines MIME type string slices for common media types, to be used for convenience.
|
Package mimetypes defines MIME type string slices for common media types, to be used for convenience. |