bslc

package module
v0.0.0-...-ff399db Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 13, 2014 License: BSD-2-Clause Imports: 13 Imported by: 0

README

BSLC GoDoc

Package bslc provides an IP bound crawler with channel based delivery of crawled content.

Usage

package main

import (
    "github.com/ragnar-johannsson/bslc"
    "github.com/ragnar-johannsson/bslc/mimetypes"
    "log"
    "sync"
)

func main() {
    // Initialize URL container with IPnets filter and add seed URLs
    allowedNetworks := bslc.NewIPNetContainer([]string{ "127.0.0.0/8" })
    seedUrls := []string{ "http://127.0.0.1/" }
    urls := bslc.NewLocalURLContainer(allowedNetworks, seedUrls)

    // Initialize crawler
    crawler := bslc.Crawler{
        URLs: urls,
        MaxConcurrentConnections: 5,
    }

    // Register mimetype handler channel with crawler
    ch := make(chan *bslc.Content)
    crawler.AddMimeTypes(mimetypes.Audio, ch)

    // Start content handlers
    wg := sync.WaitGroup{}
    for i := 0; i < crawler.MaxConcurrentConnections; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for content := range ch {
                // Process received content
                log.Println("Received content from URL: ", content.URL.String())

                // Signal when done with content
                content.Done <- true
            }
        }()
    }

    // Start crawling and wait until done
    crawler.StartCrawling()
    wg.Wait()
}

Further examples

See the examples/ directory for further examples on usage:

License

BSD 2-Clause. See the LICENSE file for details.

Documentation

Overview

Package bslc provides an IP bound crawler with channel based delivery of crawled content.

Example
package main

import (
	"fmt"
	"github.com/ragnar-johannsson/bslc"
	"github.com/ragnar-johannsson/bslc/mimetypes"
	"sync"
)

func main() {
	// Initialize URL container with IPnets filter and add seed URLs
	allowedNetworks := bslc.NewIPNetContainer([]string{"127.0.0.0/8"})
	seedUrls := []string{"http://127.0.0.1/"}
	urls := bslc.NewLocalURLContainer(allowedNetworks, seedUrls)

	// Initialize crawler
	crawler := bslc.Crawler{
		URLs:                     urls,
		MaxConcurrentConnections: 5,
	}

	// Register mimetype handler channel with crawler
	ch := make(chan *bslc.Content)
	crawler.AddMimeTypes(mimetypes.Audio, ch)

	// Start content handlers
	wg := sync.WaitGroup{}
	for i := 0; i < crawler.MaxConcurrentConnections; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			for content := range ch {
				// Process received content
				fmt.Println("Received content from URL: ", content.URL.String())

				// Signal when done with content
				content.Done <- true
			}
		}()
	}

	// Start crawling and wait until done
	crawler.StartCrawling()
	wg.Wait()
}
Output:

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Content

type Content struct {
	URL         url.URL
	ContentType string
	Filename    string
	Done        chan bool
	// contains filtered or unexported fields
}

Content represents the returned content from a crawled URL. The content's origin URL is saved in URL. ContentType holds the MIME type as specified by the remote server. Filename is the content's filename or an empty string if one cannot be determined. Done is channel that must be signalled after all processing of the content is done.

func (*Content) Reader

func (c *Content) Reader() io.Reader

Reader returns a new io.Reader for the crawled content.

type Crawler

type Crawler struct {
	URLs                     URLContainer
	MaxConcurrentConnections int
	// contains filtered or unexported fields
}

Crawler does the heavy lifting of the actual crawling. URLs is the URLContainer used for bookkeeping and must be initialized. MaxConcurrentConnections is optional; the default is 5 concurrent transfers.

func (*Crawler) ActiveTransfers

func (c *Crawler) ActiveTransfers() int

ActiveTransfers returns the current number of active transfers.

func (*Crawler) AddMimeType

func (c *Crawler) AddMimeType(mimeType string, ch chan *Content)

AddMimeType registers channel ch to receive content of the specified mimeType.

func (*Crawler) AddMimeTypes

func (c *Crawler) AddMimeTypes(mimeTypes []string, ch chan *Content)

AddMimeTypes registers channel ch to receive content of the specified mimeTypes.

func (*Crawler) StartCrawling

func (c *Crawler) StartCrawling()

StartCrawling starts the crawling process.

func (*Crawler) StopCrawling

func (c *Crawler) StopCrawling()

StopCrawling stops the crawling process. Transfers in progress will be completed.

func (*Crawler) TotalTransfers

func (c *Crawler) TotalTransfers() int

TotalTransfers returns the total sum of transfers initiated.

type IPNetContainer

type IPNetContainer struct {
	// contains filtered or unexported fields
}

IPNetContainer is a container for IP networks.

func NewIPNetContainer

func NewIPNetContainer(networks []string) IPNetContainer

NewIPNetContainer returns a new IPNetContainer encompassing the networks specified. Networks are on CIDR notation form.

func (*IPNetContainer) Contains

func (i *IPNetContainer) Contains(host string) bool

Contains returns true if container includes the given host and false otherwise. The specified host can be a DNS name or an IP.

type URLContainer

type URLContainer interface {
	AddURL(u string)
	NextURL() (u string, err error)
	Len() int
}

URLContainer is a container for URLs encountered during crawling. Call AddURL to add new URLs to the container. If the same URL has evern been added to the container, nothing is added. NextURL returns the next URL in line and removes it from the container, or an empty string and an error if the container is empty. Len returns the number of URLs in the container.

func NewLocalURLContainer

func NewLocalURLContainer(allowedNetworks IPNetContainer, seedUrls []string) URLContainer

NewLocalURLContainer returns a new URLContainer, stored locally in memory without any persistence between sessions and without the ability to share state between remote crawler instances. The allowedNetworks IP container handles the IP filtering and seedUrls are the URLs too bootstrap with.

Directories

Path Synopsis
examples
Package mimetypes defines MIME type string slices for common media types, to be used for convenience.
Package mimetypes defines MIME type string slices for common media types, to be used for convenience.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL