Documentation ¶
Overview ¶
Package contentscripts provides a JavaScript engine that runs builtin, or user defined, scripts during the extraction process.
Index ¶
- func ExtractAuthor(m *extract.ProcessMessage, next extract.Processor) extract.Processor
- func ExtractBody(m *extract.ProcessMessage, next extract.Processor) extract.Processor
- func ExtractDate(m *extract.ProcessMessage, next extract.Processor) extract.Processor
- func FindContentPage(m *extract.ProcessMessage, next extract.Processor) extract.Processor
- func FindNextPage(m *extract.ProcessMessage, next extract.Processor) extract.Processor
- func GoToNextPage(m *extract.ProcessMessage, next extract.Processor) extract.Processor
- func LoadScripts(programs ...*Program) extract.Processor
- func LoadSiteConfig(m *extract.ProcessMessage, next extract.Processor) extract.Processor
- func NewHTTPClient(vm *Runtime, client *http.Client) (*goja.Object, error)
- func ProcessMeta(m *extract.ProcessMessage, next extract.Processor) extract.Processor
- func ReplaceStrings(m *extract.ProcessMessage, next extract.Processor) extract.Processor
- func StripTags(m *extract.ProcessMessage, next extract.Processor) extract.Processor
- type FilterTest
- type Program
- type Runtime
- func (vm *Runtime) AddScript(name string, r io.Reader) error
- func (vm *Runtime) GetLogger() *logrus.Entry
- func (vm *Runtime) ProcessMeta() error
- func (vm *Runtime) RunProgram(p *Program) (goja.Value, error)
- func (vm *Runtime) SetConfig(cf *SiteConfig) error
- func (vm *Runtime) SetLogger(entry *logrus.Entry)
- func (vm *Runtime) SetProcessMessage(m *extract.ProcessMessage)
- type SiteConfig
- type SiteConfigDiscovery
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ExtractAuthor ¶
ExtractAuthor applies the "author" directives to find an author.
func ExtractBody ¶
ExtractBody tries to find a body as defined by the "body" directives in the configuration file.
func ExtractDate ¶
ExtractDate applies the "date" directives to find a date. If a date is found we try to parse it.
func FindContentPage ¶
FindContentPage searches for SinglePageLinkSelectors in the page and, if it finds one, it reset the process to its beginning with the newly found URL.
func FindNextPage ¶
FindNextPage looks for NextPageLinkSelectors and if it finds a URL, it's added to the message and can be processed later with GoToNextPage.
func GoToNextPage ¶
GoToNextPage checks if there is a "next_page" value in the process message. It then creates a new drop with the URL.
func LoadScripts ¶
LoadScripts starts the content script runtime and adds it to the extractor context.
func LoadSiteConfig ¶
LoadSiteConfig will try to find a matching site config for the first Drop (the extraction starting point).
If a configuration is found, it will be added to the context.
If the configuration indicates custom HTTP headers, they'll be added to the client.
func NewHTTPClient ¶
NewHTTPClient returns a new (very) simple HTTP client for the JS runtime.
func ProcessMeta ¶
ProcessMeta runs the content scripts processMeta exported functions.
func ReplaceStrings ¶
ReplaceStrings applies all the replace_string directive in site config file on the received body.
Types ¶
type FilterTest ¶
FilterTest holds the values for a filter's test.
type Runtime ¶
Runtime contains a collection of content scripts.
func (*Runtime) AddScript ¶
AddScript wraps a script into an anonymous function call exposing the "exports" object and adds it to the script list.
func (*Runtime) GetLogger ¶
GetLogger returns the runtime's log entry or a default one when not set.
func (*Runtime) ProcessMeta ¶
ProcessMeta runs every script and calls their respective "processMeta" exported function when it exists.
func (*Runtime) RunProgram ¶
RunProgram runs a Program instance in the VM and returns its result.
func (*Runtime) SetConfig ¶
func (vm *Runtime) SetConfig(cf *SiteConfig) error
SetConfig runs every script and calls their respective "setConfig" exported function when it exists. The initial configuration is passed to each function as a pointer and can be modified in place.
func (*Runtime) SetProcessMessage ¶
func (vm *Runtime) SetProcessMessage(m *extract.ProcessMessage)
SetProcessMessage adds an extract.ProcessMessage to the content script context.
type SiteConfig ¶
type SiteConfig struct { TitleSelectors []string `json:"title_selectors" js:"titleSelectors"` BodySelectors []string `json:"body_selectors" js:"bodySelectors"` DateSelectors []string `json:"date_selectors" js:"dateSelectors"` AuthorSelectors []string `json:"author_selectors" js:"authorSelectors"` StripSelectors []string `json:"strip_selectors" js:"stripSelectors"` StripIDOrClass []string `json:"strip_id_or_class" js:"stripIdOrClass"` StripImageSrc []string `json:"strip_image_src" js:"stripImageSrc"` NativeAdSelectors []string `json:"native_ad_selectors"` Tidy bool `json:"tidy"` Prune bool `json:"prune"` AutoDetectOnFailure bool `json:"autodetect_on_failure"` SinglePageLinkSelectors []string `json:"single_page_link_selectors" js:"singlePageLinkSelectors"` NextPageLinkSelectors []string `json:"next_page_link_selectors" js:"nextPageLinkSelectors"` ReplaceStrings [][2]string `json:"replace_strings" js:"replaceStrings"` HTTPHeaders map[string]string `json:"http_headers" js:"httpHeaders"` Tests []FilterTest `json:"tests"` // contains filtered or unexported fields }
SiteConfig holds the fivefilters configuration.
func NewConfigForURL ¶
func NewConfigForURL(discovery *SiteConfigDiscovery, src *url.URL) (*SiteConfig, error)
NewConfigForURL loads site config configuration file(s) for a given URL.
func NewSiteConfig ¶
func NewSiteConfig(r io.Reader) (*SiteConfig, error)
NewSiteConfig loads a configuration file from an io.Reader.
func (*SiteConfig) Files ¶
func (cf *SiteConfig) Files() []string
Files returns the files used to create the configuration.
func (*SiteConfig) Merge ¶
func (cf *SiteConfig) Merge(new *SiteConfig)
Merge merges a new configuration in the current one.
type SiteConfigDiscovery ¶
SiteConfigDiscovery is a wrapper around an fs.FS that provides a function to find site-config files based on a name.
var (
SiteConfigFiles *SiteConfigDiscovery // SiteConfigFiles is the default site-config files discovery
)
func NewSiteconfigDiscovery ¶
func NewSiteconfigDiscovery(root fs.FS) *SiteConfigDiscovery
NewSiteconfigDiscovery returns a new configuration discovery instance.
func (*SiteConfigDiscovery) FindConfigHostFile ¶
func (d *SiteConfigDiscovery) FindConfigHostFile(name string) []string
FindConfigHostFile finds the files matching the given name.