miru

package module
v0.0.0-...-7af09af Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 16, 2015 License: Unlicense Imports: 20 Imported by: 0

README

miru

Build Status Coverage Status license

API

Queues

/api/queues/

Returns a list of queues.

Queue

/api/queue/bbc.co.uk

Return an individual queue.

Sites

/api/sites

Returns a list of sites.

Crawl

/api/crawl?url=http%3A%2F%2Fbbc.co.uk%2F

Crawls a given URL, will then recursively crawl each found link until the queue list is exhausted.

/api/search?q=news

Searches the datastore for any pages with an index matching the keywords.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// UserAgent is passed on each HTTP request to identify the crawler.
	UserAgent = "Miru/1.0 (+http://www.miru.nylar.io)"
	// UnwantedTags are stripped from all HTML documents.
	UnwantedTags = "style, script, link, iframe, frame, embed"
	// ErrUnreachableURL for when the error doesn't return 200 OK.
	ErrUnreachableURL = errors.New("Url did not return a 200 OK response.")
	// ErrInvalidURL for when not a valid URL.
	ErrInvalidURL = errors.New("Url was invalid.")
	// Delay is time in between each crawl
	Delay int64 = 5
)
View Source
var DefaultConfig = `
[database]
host = "localhost:28015"
name = "miru"

[tables]
index = "indexes"
document = "documents"

[api]
port = "8036"
`

DefaultConfig is used if no config.toml file is found, sets the config to acceptable defaults.

Functions

func APICrawlHandler

func APICrawlHandler(c *Context) http.Handler

APICrawlHandler (GET) allows one to provide a URL to be crawled. Will recursively crawl in the background.

func APIQueueHandler

func APIQueueHandler(c *Context) http.Handler

APIQueueHandler (GET) returns a single queue.

func APIQueuesHandler

func APIQueuesHandler(c *Context) http.Handler

APIQueuesHandler (GET) returns a list of active queues.

func APIRoutes

func APIRoutes(m *mux.Router, c *Context)

APIRoutes configures the routes for the API, cross-origin resource sharing is applied to each route then can be reached by external requests.

func APISearchHandler

func APISearchHandler(c *Context) http.Handler

APISearchHandler (GET) allows one to search the datastore. Accepts one parameter: 'q', which is a URL encoded string.

func APISitesHandler

func APISitesHandler(c *Context) http.Handler

APISitesHandler (GET) returns a list of sites.

func Contents

func Contents(resp *http.Response) []byte

Contents reads data from a response into a byte slice, limits to 4mb.

func Crawl

func Crawl(url string, c *Context, q *Queue) error

Crawl processes pages concurrently

func ExtractLinks(doc *goquery.Document) []string

ExtractLinks returns all internal links from a page.

func ExtractText

func ExtractText(doc *goquery.Document) string

ExtractText returns all p tags in a page

func ExtractTitle

func ExtractTitle(doc *goquery.Document) string

ExtractTitle looks for either a title tag or h1 tag and sets that as the title

func Get

func Get(req *http.Request) (*http.Response, error)

Get uses a custom request and returns a response.

func IndexPage

func IndexPage(c *Context, q *Queue, url, site string) error

IndexPage is called by ProcessPages and handles dealing with individual pages

func Links(doc *goquery.Document, q *Queue, site string)

Links extracts all internal links from a page and enqueues them.

func MustGet

func MustGet(req *http.Request) (*http.Response, error)

MustGet is a strict version of Get

func Normalise

func Normalise(word string) string

Normalise transform a word by lowercasing and applying stemming.

func ProcessPages

func ProcessPages(c *Context, q *Queue, site string, delay int64)

ProcessPages process all queue items and proceeds to index them.

func ProcessURL

func ProcessURL(link, site string) (string, error)

ProcessURL determines whether a URL is to be enqueued or not.

func Request

func Request(url string) *http.Request

Request builds a new request using UserAgent as a header

func RootURL

func RootURL(link string) (string, error)

RootURL returns the domain for a given link

func Stopper

func Stopper(word string) bool

Stopper returns true if the word is in stopWords

Types

type Config

type Config struct {
	Database database
	Tables   tables
	Api      api
}

Config holds configuration information regarding the database and the port in which to serve on.

func LoadConfig

func LoadConfig(data string) (*Config, error)

LoadConfig loads configuration data into the Config struct.

type Context

type Context struct {
	Db     *rdb.Session
	Config *Config
	Queues *Queues
}

Context holds database, configuration and queue data.

func NewContext

func NewContext() *Context

NewContext instantiates a new context and initialises a queue.

func (*Context) Connect

func (c *Context) Connect(host string) error

Connect creates a connection to the database.

func (*Context) InitQueues

func (c *Context) InitQueues()

InitQueues initialises a new queue list.

func (*Context) LoadConfig

func (c *Context) LoadConfig(f string) error

LoadConfig reads a given file from the filesystem, if not found uses the default config.

type Document

type Document struct {
	DocID   string `gorethink:"id" json:"document_id"`
	Url     string `gorethink:"url" json:"url"`
	Site    string `gorethink:"site" json:"site"`
	Title   string `gorethink:"title" json:"title"`
	Content string `gorethink:"content" json:"content"`
}

Document stores data about a page.

func NewDoc

func NewDoc(doc *goquery.Document, url, site string) *Document

NewDoc extracts data from a page and creates a new document.

func NewDocument

func NewDocument(url, site, title, content string) *Document

NewDocument creates a new document instance

func (*Document) Put

func (d *Document) Put(c *Context) error

Put writes a document to the datastore.

type Index

type Index struct {
	IndexID string `gorethink:"id" json:"index_id"`
	DocID   string `gorethink:"doc_id" json:"document_id"`
	Word    string `gorethink:"word" json:"word"`
	Count   int64  `gorethink:"count" json:"count"`
}

Index stores data on a given word in a document.

func NewIndex

func NewIndex(docID, word string, count int64) *Index

NewIndex creates a new index instance

func (*Index) Put

func (i *Index) Put(c *Context) error

Put writes an index to the datastore

type Indexes

type Indexes []*Index

Indexes is a slice of index, holds all the words in a document

func Indexer

func Indexer(text, docID string) Indexes

Indexer tokenises and counts occurences of words in a document

func RemoveDuplicates

func RemoveDuplicates(i Indexes) Indexes

RemoveDuplicates counts the number of duplicates and then keeps only the unique values.

func (*Indexes) Put

func (ixs *Indexes) Put(c *Context) error

Put writes a slice of index to the datastore.

type Queue

type Queue struct {
	Manager map[string]bool `json:"manager"`
	Items   []string        `json:"items"`
	Name    string          `json:"name"`
	Status  string          `json:"status"`
	sync.Mutex
}

Queue holds data regarding a queue

func NewQueue

func NewQueue() *Queue

NewQueue creates a new queue and sets its status to active.

func (*Queue) Dequeue

func (q *Queue) Dequeue() (string, error)

Dequeue pops an item and returns it

func (*Queue) Enqueue

func (q *Queue) Enqueue(item string)

Enqueue pushes a new item onto the queue.

func (*Queue) Len

func (q *Queue) Len() int

Len returns the number of items in the queue.

type QueueList

type QueueList []queueList

QueueList is a sortable interface for keeping queue items in order.

func (QueueList) Len

func (ql QueueList) Len() int

func (QueueList) Less

func (ql QueueList) Less(i, j int) bool

func (QueueList) Swap

func (ql QueueList) Swap(i, j int)

type Queues

type Queues struct {
	Queues map[string]*Queue `json:"queues"`
}

Queues is a map of queue's

func NewQueues

func NewQueues() *Queues

NewQueues return a new queue list

func (*Queues) Add

func (qs *Queues) Add(q *Queue)

Add pushes a new queue onto the queue list

type Response

type Response struct {
	Status  int    `json:"status"`
	Message string `json:"message"`
}

Response provides a generic interface for writing API messages back to the client.

type Result

type Result struct {
	Document
	Index
}

Result holds data for a result's document and index

type Results

type Results struct {
	Speed   float64  `json:"speed"`
	Count   int64    `json:"count"`
	Results []Result `json:"results"`
}

Results holds all of the results, the time taken to perform the query and the number of results.

func (*Results) ParseQuery

func (rxs *Results) ParseQuery(query string) []string

ParseQuery splits words into a list of individual words.

func (*Results) RenderCount

func (rxs *Results) RenderCount() string

RenderCount formats the number of results

func (*Results) RenderCountHTML

func (rxs *Results) RenderCountHTML() template.HTML

RenderCountHTML formats the number of results and escapes for use in templates.

func (*Results) RenderSpeed

func (rxs *Results) RenderSpeed() string

RenderSpeed formats the speed of the query

func (*Results) RenderSpeedHTML

func (rxs *Results) RenderSpeedHTML() template.HTML

RenderSpeedHTML formats the speed of the query and escapes for use in templates.

func (*Results) Search

func (rxs *Results) Search(query string, c *Context) error

Search returns a list of Results for a given query.

Directories

Path Synopsis
Godeps
_workspace/src/code.google.com/p/goprotobuf/proto
Package proto converts data structures to and from the wire format of protocol buffers.
Package proto converts data structures to and from the wire format of protocol buffers.
_workspace/src/github.com/PuerkitoBio/goquery
Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document.
Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document.
_workspace/src/github.com/andybalholm/cascadia
The cascadia package is an implementation of CSS selectors.
The cascadia package is an implementation of CSS selectors.
_workspace/src/github.com/dancannon/gorethink
Go driver for RethinkDB Current version: v0.6.2 (RethinkDB v1.16) For more in depth information on how to use RethinkDB check out the API docs at http://rethinkdb.com/api
Go driver for RethinkDB Current version: v0.6.2 (RethinkDB v1.16) For more in depth information on how to use RethinkDB check out the API docs at http://rethinkdb.com/api
_workspace/src/github.com/gorilla/context
Package context stores values shared during a request lifetime.
Package context stores values shared during a request lifetime.
_workspace/src/github.com/gorilla/mux
Package gorilla/mux implements a request router and dispatcher.
Package gorilla/mux implements a request router and dispatcher.
_workspace/src/github.com/rs/cors
Package cors is net/http handler to handle CORS related requests as defined by http://www.w3.org/TR/cors/ You can configure it by passing an option struct to cors.New: c := cors.New(cors.Options{ AllowedOrigins: []string{"foo.com"}, AllowedMethods: []string{"GET", "POST", "DELETE"}, AllowCredentials: true, }) Then insert the handler in the chain: handler = c.Handler(handler) See Options documentation for more options.
Package cors is net/http handler to handle CORS related requests as defined by http://www.w3.org/TR/cors/ You can configure it by passing an option struct to cors.New: c := cors.New(cors.Options{ AllowedOrigins: []string{"foo.com"}, AllowedMethods: []string{"GET", "POST", "DELETE"}, AllowCredentials: true, }) Then insert the handler in the chain: handler = c.Handler(handler) See Options documentation for more options.
_workspace/src/github.com/satori/go.uuid
Package uuid provides implementation of Universally Unique Identifier (UUID).
Package uuid provides implementation of Universally Unique Identifier (UUID).
_workspace/src/github.com/stretchr/testify/assert
A set of comprehensive testing tools for use with the normal Go testing system.
A set of comprehensive testing tools for use with the normal Go testing system.
_workspace/src/golang.org/x/net/html
Package html implements an HTML5-compliant tokenizer and parser.
Package html implements an HTML5-compliant tokenizer and parser.
_workspace/src/golang.org/x/net/html/atom
Package atom provides integer codes (also known as atoms) for a fixed set of frequently occurring HTML strings: tag names and attribute keys such as "p" and "id".
Package atom provides integer codes (also known as atoms) for a fixed set of frequently occurring HTML strings: tag names and attribute keys such as "p" and "id".
_workspace/src/golang.org/x/net/html/charset
Package charset provides common text encodings for HTML documents.
Package charset provides common text encodings for HTML documents.
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL