arts

package

v0.0.0-...-e72d39b Latest Latest Go to latest Published: Aug 14, 2018 License: AGPL-3.0 Imports: 24 Imported by: 4

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/bcampbell/arts

Links

Open Source Insights

Documentation ¶

Index ¶

Variables
func ContainedCandidates(container *html.Node, candidates candidateList) candidateList
func ParseHTML(rawHTML []byte) (*html.Node, error)
type Article
- func (art *Article) BestURL() string
type Author
type Keyword
type Publication
type Reverse
- func (r Reverse) Less(i, j int) bool

Constants ¶

This section is empty.

Variables ¶

View Source

var Debug = struct {
	// HeadlineLogger is where debug output from the headline extraction will be sent
	HeadlineLogger *log.Logger
	// AuthorsLogger is where debug output from the author extraction will be sent
	AuthorsLogger *log.Logger
	// ContentLogger is where debug output from the content extraction will be sent
	ContentLogger *log.Logger
	// DatesLogger is where debug output from the pubdate/lastupdated extraction will be sent
	DatesLogger *log.Logger

	// URLLogger is where debug output from URL extraction will be sent (rel-canonical etc)
	URLLogger *log.Logger

	// CruftLogger is where debug output from cruft classification will be sent (adverts/social/sidebars etc)
	CruftLogger *log.Logger
}{
	nullLogger,
	nullLogger,
	nullLogger,
	nullLogger,
	nullLogger,
	nullLogger,
}

Debug is the global debug control for the scraper. Set up any loggers you want before calling Extract() By default all logging is suppressed.

Functions ¶

func ContainedCandidates ¶

func ContainedCandidates(container *html.Node, candidates candidateList) candidateList

get any candidates within container (including itself)

func ParseHTML ¶

func ParseHTML(rawHTML []byte) (*html.Node, error)

Types ¶

type Article ¶

type Article struct {
	CanonicalURL string `json:"canonical_url,omitempty"`
	// all known URLs for article (including canonical)
	// TODO: first url should be considered "preferred" if no canonical?
	URLs     []string `json:"urls,omitempty"`
	Headline string   `json:"headline,omitempty"`
	Authors  []Author `json:"authors,omitempty"`
	Content  string   `json:"content,omitempty"`
	// Published contains date of publication.
	// An ISO8601 string is used instead of time.Time, so that
	// less-precise representations can be held (eg YYYY-MM)
	Published   string      `json:"published,omitempty"`
	Updated     string      `json:"updated,omitempty"`
	Publication Publication `json:"publication,omitempty"`
	Keywords    []Keyword   `json:"keywords,omitempty"`
	Section     string      `json:"section,omitempty"`
}

func Extract ¶

func Extract(client *http.Client, srcURL string) (*Article, error)

delete this and leave it up to user?

func ExtractFromHTML ¶

func ExtractFromHTML(rawHTML []byte, artURL string) (*Article, error)

func ExtractFromTree ¶

func ExtractFromTree(root *html.Node, artURL string) (*Article, error)

func (*Article) BestURL ¶

func (art *Article) BestURL() string

type Author ¶

type Author struct {
	Name    string `json:"name"`
	RelLink string `json:"rellink,omitempty"`
	Email   string `json:"email,omitempty"`
	Twitter string `json:"twitter,omitempty"`
}

type Keyword ¶

type Keyword struct {
	Name string `json:"name"`
	URL  string `json:"url,omitempty"`
}

type Publication ¶

type Publication struct {
	Name   string `json:"name,omitempty"`
	Domain string `json:"domain,omitempty"`
}

type Reverse ¶

type Reverse struct {
	sort.Interface
}

wrapper for reversing any sortable

func (Reverse) Less ¶

func (r Reverse) Less(i, j int) bool

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
byline

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL