contentscripts

package
v0.0.0-...-60192f8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 26, 2024 License: AGPL-3.0 Imports: 29 Imported by: 0

README

Readeck Content Scripts

API

The main content script API consists in exporting some functions that can perform operations on the current extracted information.

priority

exports.priority = 0

This is a integer value, defaulting to 0 when unset. The higher the number, the later the script will run. For a script overriding the site configuration with setConfig, you'll need to set it to a value higher than 10 to ensure the script runs last.

isActive

exports.isActive()

This function must return a boolean to indicate that the script can run in the current context.
If the function is absent from the script, the other functions will never run.

// Always run
exports.isActive = function() {
  return true
}

// Only run on a specific domain
exports.isActive = function() {
  return $.domain == "youtube.com"
}
setConfig

exports.setConfig(config)

This function receives an SiteConfiguration object reference. It can set properties of the object as long as the value types don't change.

exports.setConfig = function(config) {
  // Override TitleSelectors
  config.titleSelectors = ["/html/head/title"]

  // Append a body selector
  config.bodySelectors.push("//main")
}
processMeta

exports.processMeta()

This function runs after loading the page meta data.

Global variables and functions

$: extractor information

The global variable $ holds everything that's needed to read or change information on the current extraction process.

$.domain (read only)

The domain of the current extraction. Note that it's different from the host name. For example, if the host name is www.example.co.uk, the value of $.domain is example.co.uk.

The value is always in its Unicode form regardless of the initial input.

$.hostname (read only)

The host name of the current extraction.

The value is always in its Unicode form regardless of the initial input.

$.url (read only)

The URL of the current extraction. The value is a string that you can parse with new URL($.url) when needed.

$.meta

This variable is an object whose values are lists of strings. For example:

{
  "html.title": ["document title"]
}

You can read, set or delete any value in $.meta. You can not use push() to add new values.

When setting values, you can use a list or a single string.

$.meta["html.title"] = "new title" // valid
$.meta["html.author"] = ["someone", "someone else"] // valid
$.authors

A list of found authors in the document.

Note: When setting this value, it must be a list and you can not use $.authors.push() to add new values.

$.description

A string with the document description.

$.title

A string with the document title.

$.type

The document type. When settings this value, it must be one of "article", "photo" or "video".

$.html (write only)

When settings a string to this variable, the whole extracted content is replaced. This is an advanced option and should only be used for content that are not articles (photos or videos).

$.readability

Whether readability is enabled for this content. It can be useful to set it to false when setting an HTML content with $.html.

Please note that even though readability can be disabled, it won't disable the last cleaning pass that removes unwanted tags and attributes.

unescapeURL
/**
 * @param {string} value - input URL
 * @return {string}
 */
function unescapeURL(value)

This function transforms an escaped URL to its non escaped version.

decodeXML
/**
 * @param {string} input
 * @return {Object}
 */
function decodeXML(input)

This function decodes an XML text into an object than can be serialized into JSON or filtered.

requests

If you need to perform HTTP requests in a content script, you must use the requests global object.

This is by no means a full featured or advanced HTTP client but it will let you perform simple requests and retrieve JSON or text responses.

const rsp = requests.get("https://nativerest.net/echo/get")
rsp.raiseForStatus()
const data = rsp.json()
requests.get(url, [headers])

This function performs a GET HTTP request and returns a response object.

An optional header object can take header values for the request.

requests.post(url, data, [headers])

This function performs a POST HTTP requests and returns a response object. The data parameter must be a string of the data you want to send.

An optional header object can take header values for the request.

const rsp = requests.post(
  "http://example.net/",
  JSON.stringify({"a": "abc"}),
  {"Content-Type": "application/json"},
)
response object
response.status

This is the numeric status code.

response.headers

This contains all the response's headers.

response.raiseForStatus()

This function will throw an error if the status is not 2xx.

response.json()

This function returns an object that's the serialization of the response's body.

response.text()

This function returns the response's text content.

Types

Site Configuration

The setConfig function receives a config object that can be modified.

config.titleSelectors - []string

XPath selectors for the document title.

config.bodySelectors - []string

XPath selectors for the document body.

config.dateSelectors - []string

XPath selectors for the document date.

config.authorSelectors - []string

XPath selectors for the document authors.

config.stripSelectors - []string

XPath selectors of elements that must be removed.

config.stripIdOrClass - []string

List of IDs or classes that belong to elements that must be removed.

config.stripImageSrc - []string

List of strings that, when present in an src attribute of an image will trigger the element removal.

config.singlePageLinkSelectors - []string

XPath selectors of elements whose href attribute refers to a link to the full document.

config.nextPageLinkSelectors - []string

XPath selectors of elements whose href attribute refers to a link to the next page.

config.replaceStrings - [][2]string

List of pairs of string replacement.

config.httpHeaders - object

An object that contain HTTP headers being sent to every subsequent requests.

Documentation

Overview

Package contentscripts provides a JavaScript engine that runs builtin, or user defined, scripts during the extraction process.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ExtractAuthor

func ExtractAuthor(m *extract.ProcessMessage, next extract.Processor) extract.Processor

ExtractAuthor applies the "author" directives to find an author.

func ExtractBody

ExtractBody tries to find a body as defined by the "body" directives in the configuration file.

func ExtractDate

ExtractDate applies the "date" directives to find a date. If a date is found we try to parse it.

func FindContentPage

func FindContentPage(m *extract.ProcessMessage, next extract.Processor) extract.Processor

FindContentPage searches for SinglePageLinkSelectors in the page and, if it finds one, it reset the process to its beginning with the newly found URL.

func FindNextPage

FindNextPage looks for NextPageLinkSelectors and if it finds a URL, it's added to the message and can be processed later with GoToNextPage.

func GoToNextPage

GoToNextPage checks if there is a "next_page" value in the process message. It then creates a new drop with the URL.

func LoadScripts

func LoadScripts(programs ...*Program) extract.Processor

LoadScripts starts the content script runtime and adds it to the extractor context.

func LoadSiteConfig

func LoadSiteConfig(m *extract.ProcessMessage, next extract.Processor) extract.Processor

LoadSiteConfig will try to find a matching site config for the first Drop (the extraction starting point).

If a configuration is found, it will be added to the context.

If the configuration indicates custom HTTP headers, they'll be added to the client.

func NewHTTPClient

func NewHTTPClient(vm *Runtime, client *http.Client) (*goja.Object, error)

NewHTTPClient returns a new (very) simple HTTP client for the JS runtime.

func ProcessMeta

ProcessMeta runs the content scripts processMeta exported functions.

func ReplaceStrings

func ReplaceStrings(m *extract.ProcessMessage, next extract.Processor) extract.Processor

ReplaceStrings applies all the replace_string directive in site config file on the received body.

func StripTags

StripTags removes the tags from the DOM root node, according to "strip_tags" configuration directives.

Types

type FilterTest

type FilterTest struct {
	URL      string   `json:"url"`
	Contains []string `json:"contains"`
}

FilterTest holds the values for a filter's test.

type Program

type Program struct {
	*goja.Program
	Name     string
	Priority int
}

Program is a wrapper around goja.Program, with a script name.

func NewProgram

func NewProgram(name string, r io.Reader) (*Program, error)

NewProgram wraps a script into an anonymous function call exposing the "exports" object and returns a Program instance.

type Runtime

type Runtime struct {
	*goja.Runtime
	// contains filtered or unexported fields
}

Runtime contains a collection of content scripts.

func New

func New(programs ...*Program) (*Runtime, error)

New creates a new ContentScript instance.

func (*Runtime) AddScript

func (vm *Runtime) AddScript(name string, r io.Reader) error

AddScript wraps a script into an anonymous function call exposing the "exports" object and adds it to the script list.

func (*Runtime) GetLogger

func (vm *Runtime) GetLogger() *logrus.Entry

GetLogger returns the runtime's log entry or a default one when not set.

func (*Runtime) ProcessMeta

func (vm *Runtime) ProcessMeta() error

ProcessMeta runs every script and calls their respective "processMeta" exported function when it exists.

func (*Runtime) RunProgram

func (vm *Runtime) RunProgram(p *Program) (goja.Value, error)

RunProgram runs a Program instance in the VM and returns its result.

func (*Runtime) SetConfig

func (vm *Runtime) SetConfig(cf *SiteConfig) error

SetConfig runs every script and calls their respective "setConfig" exported function when it exists. The initial configuration is passed to each function as a pointer and can be modified in place.

func (*Runtime) SetLogger

func (vm *Runtime) SetLogger(entry *logrus.Entry)

SetLogger sets the runtime's log entry.

func (*Runtime) SetProcessMessage

func (vm *Runtime) SetProcessMessage(m *extract.ProcessMessage)

SetProcessMessage adds an extract.ProcessMessage to the content script context.

type SiteConfig

type SiteConfig struct {
	TitleSelectors          []string          `json:"title_selectors"            js:"titleSelectors"`
	BodySelectors           []string          `json:"body_selectors"             js:"bodySelectors"`
	DateSelectors           []string          `json:"date_selectors"             js:"dateSelectors"`
	AuthorSelectors         []string          `json:"author_selectors"           js:"authorSelectors"`
	StripSelectors          []string          `json:"strip_selectors"            js:"stripSelectors"`
	StripIDOrClass          []string          `json:"strip_id_or_class"          js:"stripIdOrClass"`
	StripImageSrc           []string          `json:"strip_image_src"            js:"stripImageSrc"`
	NativeAdSelectors       []string          `json:"native_ad_selectors"`
	Tidy                    bool              `json:"tidy"`
	Prune                   bool              `json:"prune"`
	AutoDetectOnFailure     bool              `json:"autodetect_on_failure"`
	SinglePageLinkSelectors []string          `json:"single_page_link_selectors" js:"singlePageLinkSelectors"`
	NextPageLinkSelectors   []string          `json:"next_page_link_selectors"   js:"nextPageLinkSelectors"`
	ReplaceStrings          [][2]string       `json:"replace_strings"            js:"replaceStrings"`
	HTTPHeaders             map[string]string `json:"http_headers"               js:"httpHeaders"`
	Tests                   []FilterTest      `json:"tests"`
	// contains filtered or unexported fields
}

SiteConfig holds the fivefilters configuration.

func NewConfigForURL

func NewConfigForURL(discovery *SiteConfigDiscovery, src *url.URL) (*SiteConfig, error)

NewConfigForURL loads site config configuration file(s) for a given URL.

func NewSiteConfig

func NewSiteConfig(r io.Reader) (*SiteConfig, error)

NewSiteConfig loads a configuration file from an io.Reader.

func (*SiteConfig) Files

func (cf *SiteConfig) Files() []string

Files returns the files used to create the configuration.

func (*SiteConfig) Merge

func (cf *SiteConfig) Merge(new *SiteConfig)

Merge merges a new configuration in the current one.

type SiteConfigDiscovery

type SiteConfigDiscovery struct {
	fs.FS
}

SiteConfigDiscovery is a wrapper around an fs.FS that provides a function to find site-config files based on a name.

var (
	SiteConfigFiles *SiteConfigDiscovery // SiteConfigFiles is the default site-config files discovery
)

func NewSiteconfigDiscovery

func NewSiteconfigDiscovery(root fs.FS) *SiteConfigDiscovery

NewSiteconfigDiscovery returns a new configuration discovery instance.

func (*SiteConfigDiscovery) FindConfigHostFile

func (d *SiteConfigDiscovery) FindConfigHostFile(name string) []string

FindConfigHostFile finds the files matching the given name.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL