extract

package
v0.0.0-...-d1a9080 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 5, 2021 License: AGPL-3.0 Imports: 25 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewClient

func NewClient() *http.Client

NewClient returns a new http.Client with our custom transport.

func NewRemoteImage

func NewRemoteImage(src string, client *http.Client) (img.Image, error)

NewRemoteImage loads an image and returns a new img.Image instance.

func SetDeniedIPs

func SetDeniedIPs(netList []*net.IPNet) func(e *Extractor)

SetDeniedIPs sets a list of ip or cird that cannot be reached by the extraction client.

func SetHeader

func SetHeader(client *http.Client, name, value string)

SetHeader sets a header on a given client

func SetLogFields

func SetLogFields(f *log.Fields) func(e *Extractor)

SetLogFields sets the default log fields for the extractor.

Types

type Drop

type Drop struct {
	URL          *url.URL
	Domain       string
	ContentType  string
	Charset      string
	DocumentType string

	Title       string
	Description string
	Authors     []string
	Site        string
	Lang        string
	Date        time.Time

	Header http.Header
	Meta   DropMeta
	Body   []byte `json:"-"`

	Pictures map[string]*Picture
}

Drop is the result of a content extraction of one resource.

func NewDrop

func NewDrop(src *url.URL) *Drop

NewDrop returns a Drop instance.

func (*Drop) AddAuthors

func (d *Drop) AddAuthors(values ...string)

AddAuthors add authors to the author list, ignoring potential duplicates.

func (*Drop) IsHTML

func (d *Drop) IsHTML() bool

IsHTML returns true when the resource is of type HTML

func (*Drop) IsMedia

func (d *Drop) IsMedia() bool

IsMedia returns true when the document type is a media type

func (*Drop) Load

func (d *Drop) Load(client *http.Client) error

Load loads the remote URL and retrieve data.

func (*Drop) UnescapedURL

func (d *Drop) UnescapedURL() string

UnescapedURL returns the Drop's URL unescaped, for storage.

type DropMeta

type DropMeta map[string][]string

DropMeta is a map of list of strings that contains the collected metadata.

func (DropMeta) Add

func (m DropMeta) Add(name, value string)

Add adds a value to the raw metadata list.

func (DropMeta) Lookup

func (m DropMeta) Lookup(names ...string) []string

Lookup returns all the found values for the provided metadata names.

func (DropMeta) LookupGet

func (m DropMeta) LookupGet(names ...string) string

LookupGet returns the first value found for the provided metadata names.

type Error

type Error []error

Error holds all the non-fatal errors that were caught during extraction.

func (Error) Error

func (e Error) Error() string

type Extractor

type Extractor struct {
	URL       *url.URL
	HTML      []byte
	Text      string
	Visited   URLList
	Logs      []string
	Context   context.Context
	LogFields *log.Fields
	// contains filtered or unexported fields
}

Extractor is a page extractor.

func New

func New(src string, html []byte, options ...func(e *Extractor)) (*Extractor, error)

New returns an Extractor instance for a given URL, with a default HTTP client.

func (*Extractor) AddDrop

func (e *Extractor) AddDrop(src *url.URL)

AddDrop adds a new Drop to the drop list.

func (*Extractor) AddError

func (e *Extractor) AddError(err error)

AddError add a new error to the extractor's error list.

func (*Extractor) AddProcessors

func (e *Extractor) AddProcessors(p ...Processor)

AddProcessors adds extract processor(s) to the list

func (*Extractor) Client

func (e *Extractor) Client() *http.Client

Client returns the extractor's HTTP client.

func (*Extractor) Drop

func (e *Extractor) Drop() *Drop

Drop return the extractor's first drop, when there is one.

func (*Extractor) Drops

func (e *Extractor) Drops() []*Drop

Drops returns the extractor's drop list.

func (*Extractor) Errors

func (e *Extractor) Errors() Error

Errors returns the extractor's error list.

func (*Extractor) GetLogger

func (e *Extractor) GetLogger() *log.Logger

GetLogger returns a logger for the extractor. This standard logger will copy everything to the extractor Log slice.

func (*Extractor) NewProcessMessage

func (e *Extractor) NewProcessMessage(step ProcessStep) *ProcessMessage

NewProcessMessage returns a new ProcessMessage for a given step.

func (*Extractor) ReplaceDrop

func (e *Extractor) ReplaceDrop(src *url.URL) error

ReplaceDrop replaces the main Drop with a new one.

func (*Extractor) Run

func (e *Extractor) Run()

Run start the extraction process.

type Picture

type Picture struct {
	Href string
	Type string
	Size [2]int
	// contains filtered or unexported fields
}

Picture is a remote picture

func NewPicture

func NewPicture(src string, base *url.URL) (*Picture, error)

NewPicture returns a new Picture instance from a given URL and its base.

func (*Picture) Bytes

func (p *Picture) Bytes() []byte

Bytes returns the image data.

func (*Picture) Copy

func (p *Picture) Copy(size uint, toFormat string) (*Picture, error)

Copy returns a resized copy of the image, as a new Picture instance.

func (*Picture) Encoded

func (p *Picture) Encoded() string

Encoded returns a base64 encoded string of the image.

func (*Picture) Load

func (p *Picture) Load(client *http.Client, size uint, toFormat string) error

Load loads the image remotely and fit it into the given boundaries size.

func (*Picture) Name

func (p *Picture) Name(name string) string

Name returns the given name of the picture with the correct extension.

type ProcessList

type ProcessList []Processor

ProcessList holds the processes that will be applied

type ProcessMessage

type ProcessMessage struct {
	Context   context.Context
	Extractor *Extractor
	Log       *log.Entry
	Dom       *html.Node
	// contains filtered or unexported fields
}

ProcessMessage holds the process message that is passed (and changed) by the subsequent processes.

func (*ProcessMessage) Cancel

func (m *ProcessMessage) Cancel(reason string, args ...interface{})

Cancel fully cancel the extract process.

func (*ProcessMessage) Position

func (m *ProcessMessage) Position() int

Position returns the current process position

func (*ProcessMessage) ResetContent

func (m *ProcessMessage) ResetContent()

ResetContent empty the message Dom and all the drops body

func (*ProcessMessage) ResetPosition

func (m *ProcessMessage) ResetPosition()

ResetPosition lets the process start over (normally with a new URL). It holds a counter and cancels everything after too many resets (defined by maxReset).

func (*ProcessMessage) SetValue

func (m *ProcessMessage) SetValue(name string, value interface{})

SetValue sets a new message value.

func (*ProcessMessage) Step

func (m *ProcessMessage) Step() ProcessStep

Step returns the current process step

func (*ProcessMessage) Value

func (m *ProcessMessage) Value(name string) interface{}

Value returns a stored message value.

type ProcessStep

type ProcessStep int

ProcessStep defines a type of process applied during extraction

const (
	// StepStart happens before the connection is made.
	StepStart ProcessStep = iota + 1

	// StepBody happens after receiving the resource body.
	StepBody

	// StepDom happens after parsing the resource DOM tree.
	StepDom

	// StepFinish happens at the very end of the extraction.
	StepFinish

	// StepPostProcess happens after looping over each Drop.
	StepPostProcess
)

type Processor

type Processor func(*ProcessMessage, Processor) Processor

Processor is the process function

type Transport

type Transport struct {
	// contains filtered or unexported fields
}

Transport is a wrapper around http.RoundTripper that lets you set default headers sent with every request.

func (*Transport) GetHeader

func (t *Transport) GetHeader(name string) string

GetHeader returns a header value from transport

func (*Transport) RoundTrip

func (t *Transport) RoundTrip(req *http.Request) (*http.Response, error)

RoundTrip is the transport interceptor.

func (*Transport) SetHeader

func (t *Transport) SetHeader(name, value string)

SetHeader lets you set a default header for any subsequent request.

type URLList

type URLList map[string]bool

URLList hold a list of URLs

func (URLList) Add

func (l URLList) Add(v *url.URL)

Add adds a new URL to the list

func (URLList) IsPresent

func (l URLList) IsPresent(v *url.URL) bool

IsPresent returns

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL