press

package module
v0.11.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 9, 2024 License: BSD-3-Clause Imports: 23 Imported by: 0

README

press-archiver

press-archiver is a set of tools to crawl (and archive) press articles behind paywall(s).

press-logo

cmd/press-archiver

press-archiver is a simple command to retrieve press articles as PDFs.

$> go install sbinet.org/x/press-archiver/cmd/press-archiver
$> press-archiver -h
press-archiver archives press articles.

Usage: press-archive [OPTIONS] URL

Example:

 $> press-archive https://example.org/interesting-article
 $> press-archive -o out.pdf https://example.org/interesting-article
 $> press-archive -cfg credentials.cfg -o out.pdf https://example.org/interesting-article

Options:
  -cfg string
    	path to configuration credentials
  -o string
    	path to output PDF to produce (default "out.pdf")

$> press-archiver -o out.pdf -cfg auth.cfg https://example.org/interesting-article
$> open out.pdf

The auth.cfg file contains credentials in the form:

example.org {
    user my-account
    pass s3cr3t
}

example.com {
    user my-other-account
    pass still-s3cr3t
}

cmd/press-archiver-srv

$> go install sbinet.org/x/press-archiver/cmd/press-archiver-srv
$> press-archiver-srv -h
press-archiver-srv serves PDFs from archived press articles.

Usage: press-archive-srv [OPTIONS]

Example:

 $> press-archive-srv
 $> press-archive-srv -addr :8080 -cfg ./credentials.cfg -cache ./cache -prefix /press

Options:
  -addr string
    	[host]:port to serve (default ":8080")
  -cache string
    	path to cache of articles
  -cfg string
    	path to credentials settings
  -prefix string
    	prefix of web server end-points (default "/")

$> press-archiver-srv -prefix /press &

$> open http://localhost:8080/press

License

  • press-archive is released under the BSD-3 license.
  • Newspaper by Smalllike from the Noun Project (CC BY 3.0).

Documentation

Overview

Package press provides tools to crawl and archive articles from press websites.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Cookies added in v0.8.0

func Cookies(ctx context.Context, options ...Option) (map[string][]*http.Cookie, error)

func Domain

func Domain(u *url.URL) string

Domain returns the domain of a URL.

ex:

https://www.example.org/foo -> example.org
https://example.org/bar     -> example.org

Types

type Article added in v0.8.0

type Article struct {
	Title string
	PDF   []byte
	HTML  []byte
}

type Auth

type Auth struct {
	// contains filtered or unexported fields
}

Auth stores credentials.

func NewAuth

func NewAuth(auth scfg.Block) (Auth, error)

NewAuth creates a new authentification ring.

func (Auth) Cookies added in v0.10.0

func (auth Auth) Cookies() map[string][]*http.Cookie

func (*Auth) Credential added in v0.10.0

func (auth *Auth) Credential(name string) *Credential

func (Auth) Names added in v0.10.0

func (auth Auth) Names() []string

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler fetches press articles and archives them.

func NewCrawler

func NewCrawler(opts ...Option) (*Crawler, error)

NewCrawler creates a new press article crawler with the provided options.

func (*Crawler) Fetch

func (c *Crawler) Fetch(ctx context.Context, tgt string) (a Article, err error)

Fetch retrieves the press article at the tgt URL.

type Credential added in v0.10.0

type Credential struct {
	Domain string

	Cookies []*http.Cookie // cookies associated with the (last) authentication
	// contains filtered or unexported fields
}

func (Credential) Expires added in v0.11.0

func (cred Credential) Expires() (expires time.Time)

type Option

type Option func(c *config) error

Option customizes configuration.

func WithAuth

func WithAuth(auth Auth) Option

WithAuth configures the crawler with the provided credentials.

func WithCookies added in v0.8.0

func WithCookies(cookies []*http.Cookie) Option

WithCookies configures the crawler's cookies used by the web browser.

func WithHeadless

func WithHeadless(v bool) Option

WithHeadless configures the crawler's underlying web browser.

func WithNumCPUs added in v0.9.0

func WithNumCPUs(n int) Option

WithNumCPUs limits the number of active goroutines during web browsing. A negative value indicates no limit. A zero value indicates to use the number of available CPUs.

func WithTimeout

func WithTimeout(timeout time.Duration) Option

WithTimeout configures the crawler to use a global fetch-timeout.

Directories

Path Synopsis
cmd
press-archiver
Command press-archiver archives press articles.
Command press-archiver archives press articles.
press-archiver-cookies
Command press-archiver-cookies refreshes cookies for press articles.
Command press-archiver-cookies refreshes cookies for press articles.
press-archiver-ls
Command press-archiver-ls displays the contents of the archives press articles database.
Command press-archiver-ls displays the contents of the archives press articles database.
press-archiver-rm
Command press-archiver-rm removes a set of archives from the database of press articles.
Command press-archiver-rm removes a set of archives from the database of press articles.
press-archiver-srv
Command press-archiver-srv serves PDFs from archived press articles.
Command press-archiver-srv serves PDFs from archived press articles.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL