prscrape

package
v0.0.0-...-d4dc811 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 3, 2014 License: AGPL-3.0 Imports: 26 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CompressSpace

func CompressSpace(s string) string

CompressSpace reduces all whitespace sequences (space, tabs, newlines etc) in a string to a single space. Leading/trailing space is trimmed. Has the effect of converting multiline strings to one line.

func Contains

func Contains(container *html.Node, n *html.Node) bool

Contains returns true if n is a descendant of container

func DescribeNode

func DescribeNode(n *html.Node) string

DescribeNode generates a debug string describing the node. returns a string of form: "<element#id.class>" (ie, like a css selector)

func DumpTree

func DumpTree(n *html.Node, depth int)

dumpTree is a debug helper to display a tree of nodes

func GetAttr

func GetAttr(n *html.Node, attr string) string

GetAttr retrieved the value of an attribute on a node. Returns empty string if attribute doesn't exist.

func GetTextContent

func GetTextContent(n *html.Node) string

GetTextContent recursively fetches the text for a node

func ParseTime

func ParseTime(s string) (time.Time, error)

func RenderText

func RenderText(n *html.Node) string

RenderText returns the text, using whitespace and line breaks to make it look nice

func ServerMain

func ServerMain(dbFile string, configfunc ConfigureFunc)

ServerMain is the entry point for running the server. handles commandline flags and all that stuff - the idea is that you can easily write a new server with a different bunch of scrapers. The real main() would just be a small stub which instantiates a bunch of scrapers, then passes control over to here. See ukpr/main.go for an example

func StripComments

func StripComments(n *html.Node)

Types

type ConfigureFunc

type ConfigureFunc func(historical bool) []*Scraper

type DBStore

type DBStore struct {
	// contains filtered or unexported fields
}

DBStore manages an archive of recent press releases in an sqlite db. It also implements eventsource.Repository to allow the press releases to be streamed out as server side events. Can stash away press releases for multiple sources.

func NewDBStore

func NewDBStore(dbfile string) *DBStore

func (*DBStore) Replay

func (store *DBStore) Replay(channel, lastEventId string) (out chan eventsource.Event)

Replay to handle last-event-id catchups note: channel contains the source (eg 'tesco'...)

func (*DBStore) Stash

func (store *DBStore) Stash(pr *PressRelease) (*pressReleaseEvent, error)

Stash adds a press release into the store

func (*DBStore) WhichAreNew

func (store *DBStore) WhichAreNew(incoming []*PressRelease) []*PressRelease

returns a list of press releases with the ones already in the store culled out

type DiscoverFunc

type DiscoverFunc func() ([]*PressRelease, error)

DiscoverFunc is for fetching a list of 'current' press releases. (via RSS feed, or by scraping an index page or whatever) The results are passed back as PressRelease structs. At the very least, the Permalink field must be set to the URL of the press release, But there's no reason Discover() can't fill out all the fields if the data is available (eg some rss feeds have everything required). For incomplete PressReleases, the framework will fetch the HTML from the Permalink URL, and invoke Scrape() to complete the data.

func BuildGenericDiscover

func BuildGenericDiscover(scraperName, pageUrl, linkSelector string, allowHostChange bool) (DiscoverFunc, error)

BuildGenericDiscover returns a DiscoverFunc which fetches a page and extracts matching links. TODO: pageUrl should be an array

func BuildPaginatedGenericDiscover

func BuildPaginatedGenericDiscover(scraperName, startUrl, nextPageSelector, linkSelector string) (DiscoverFunc, error)

BuildPaginatedGenericDiscover returns a DiscoverFunc which fetches links and steps through multiple pages.

func BuildRSSDiscover

func BuildRSSDiscover(scraperName string, feeds []string) (DiscoverFunc, error)

BuildRSSDiscover returns a discover function which grabs links from rss feeds

func MustBuildGenericDiscover

func MustBuildGenericDiscover(scraperName, pageUrl, linkSelector string, allowHostChange bool) DiscoverFunc

TODO: kill this once a proper config parser is in place

func MustBuildPaginatedGenericDiscover

func MustBuildPaginatedGenericDiscover(scraperName, startUrl, nextPageSelector, linkSelector string) DiscoverFunc

TODO: kill this once a proper config parser is in place

func MustBuildRSSDiscover

func MustBuildRSSDiscover(scraperName string, feeds []string) DiscoverFunc

TODO: kill this once a proper config parser is in place

type PressRelease

type PressRelease struct {
	Title     string    `json:"title"`
	Source    string    `json:"source"`
	Permalink string    `json:"permalink"`
	PubDate   time.Time `json:"published"`
	Content   string    `json:"text"`
	Type      string    `json:"type"`
}

PressRelease is the data we're scraping and storing. TODO: support multiple urls

type ScrapeFunc

type ScrapeFunc func(pr *PressRelease, doc *html.Node) error

ScrapeFunc is for scraping a single press release from html

func BuildGenericScrape

func BuildGenericScrape(source, title, content, cruft, pubDate string) (ScrapeFunc, error)

BuildGenericScrape builds a function which scrapes a press release from raw_html based on a bunch of css selector strings

func MustBuildGenericScrape

func MustBuildGenericScrape(source, title, content, cruft, pubDate string) ScrapeFunc

TODO: kill this once a proper config parser is in place

type Scraper

type Scraper struct {
	Name     string
	Discover DiscoverFunc
	Scrape   ScrapeFunc
}

ComposedScraper lets you pick-and-mix various discover and scrape functions

type Store

type Store interface {
	WhichAreNew(incoming []*PressRelease) []*PressRelease
	Stash(pr *PressRelease) (*pressReleaseEvent, error)
	Replay(channel, lastEventId string) chan eventsource.Event
}

type TestStore

type TestStore struct {
	// contains filtered or unexported fields
}

func NewTestStore

func NewTestStore(brief bool) *TestStore

func (*TestStore) Replay

func (store *TestStore) Replay(channel, lastEventId string) chan eventsource.Event

func (*TestStore) Stash

func (store *TestStore) Stash(pr *PressRelease) (*pressReleaseEvent, error)

func (*TestStore) WhichAreNew

func (store *TestStore) WhichAreNew(incoming []*PressRelease) []*PressRelease

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL