Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
View Source
var Debug = struct { // HeadlineLogger is where debug output from the headline extraction will be sent HeadlineLogger *log.Logger // AuthorsLogger is where debug output from the author extraction will be sent AuthorsLogger *log.Logger // ContentLogger is where debug output from the content extraction will be sent ContentLogger *log.Logger // DatesLogger is where debug output from the pubdate/lastupdated extraction will be sent DatesLogger *log.Logger // URLLogger is where debug output from URL extraction will be sent (rel-canonical etc) URLLogger *log.Logger // CruftLogger is where debug output from cruft classification will be sent (adverts/social/sidebars etc) CruftLogger *log.Logger }{ nullLogger, nullLogger, nullLogger, nullLogger, nullLogger, nullLogger, }
Debug is the global debug control for the scraper. Set up any loggers you want before calling Extract() By default all logging is suppressed.
Functions ¶
func ContainedCandidates ¶
get any candidates within container (including itself)
Types ¶
type Article ¶
type Article struct { CanonicalURL string `json:"canonical_url,omitempty"` // all known URLs for article (including canonical) // TODO: first url should be considered "preferred" if no canonical? URLs []string `json:"urls,omitempty"` Headline string `json:"headline,omitempty"` Authors []Author `json:"authors,omitempty"` Content string `json:"content,omitempty"` // Published contains date of publication. // An ISO8601 string is used instead of time.Time, so that // less-precise representations can be held (eg YYYY-MM) Published string `json:"published,omitempty"` Updated string `json:"updated,omitempty"` Publication Publication `json:"publication,omitempty"` Keywords []Keyword `json:"keywords,omitempty"` Section string `json:"section,omitempty"` }
type Publication ¶
Source Files ¶
Click to show internal directories.
Click to hide internal directories.