Documentation ¶
Index ¶
- func NewClient() *http.Client
- func NewRemoteImage(src string, client *http.Client) (img.Image, error)
- func SetDeniedIPs(netList []*net.IPNet) func(e *Extractor)
- func SetHeader(client *http.Client, name, value string)
- func SetLogFields(f *log.Fields) func(e *Extractor)
- type Drop
- type DropMeta
- type Error
- type Extractor
- func (e *Extractor) AddDrop(src *url.URL)
- func (e *Extractor) AddError(err error)
- func (e *Extractor) AddProcessors(p ...Processor)
- func (e *Extractor) Client() *http.Client
- func (e *Extractor) Drop() *Drop
- func (e *Extractor) Drops() []*Drop
- func (e *Extractor) Errors() Error
- func (e *Extractor) GetLogger() *log.Logger
- func (e *Extractor) NewProcessMessage(step ProcessStep) *ProcessMessage
- func (e *Extractor) ReplaceDrop(src *url.URL) error
- func (e *Extractor) Run()
- type Picture
- type ProcessList
- type ProcessMessage
- func (m *ProcessMessage) Cancel(reason string, args ...interface{})
- func (m *ProcessMessage) Position() int
- func (m *ProcessMessage) ResetContent()
- func (m *ProcessMessage) ResetPosition()
- func (m *ProcessMessage) SetValue(name string, value interface{})
- func (m *ProcessMessage) Step() ProcessStep
- func (m *ProcessMessage) Value(name string) interface{}
- type ProcessStep
- type Processor
- type Transport
- type URLList
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func NewRemoteImage ¶
NewRemoteImage loads an image and returns a new img.Image instance.
func SetDeniedIPs ¶
SetDeniedIPs sets a list of ip or cird that cannot be reached by the extraction client.
func SetLogFields ¶
SetLogFields sets the default log fields for the extractor.
Types ¶
type Drop ¶
type Drop struct { URL *url.URL Domain string ContentType string Charset string DocumentType string Title string Description string Authors []string Site string Lang string Date time.Time Header http.Header Meta DropMeta Body []byte `json:"-"` Pictures map[string]*Picture }
Drop is the result of a content extraction of one resource.
func (*Drop) AddAuthors ¶
AddAuthors add authors to the author list, ignoring potential duplicates.
func (*Drop) UnescapedURL ¶
UnescapedURL returns the Drop's URL unescaped, for storage.
type DropMeta ¶
DropMeta is a map of list of strings that contains the collected metadata.
type Error ¶
type Error []error
Error holds all the non-fatal errors that were caught during extraction.
type Extractor ¶
type Extractor struct { URL *url.URL HTML []byte Text string Visited URLList Logs []string Context context.Context LogFields *log.Fields // contains filtered or unexported fields }
Extractor is a page extractor.
func (*Extractor) AddProcessors ¶
AddProcessors adds extract processor(s) to the list
func (*Extractor) GetLogger ¶
GetLogger returns a logger for the extractor. This standard logger will copy everything to the extractor Log slice.
func (*Extractor) NewProcessMessage ¶
func (e *Extractor) NewProcessMessage(step ProcessStep) *ProcessMessage
NewProcessMessage returns a new ProcessMessage for a given step.
func (*Extractor) ReplaceDrop ¶
ReplaceDrop replaces the main Drop with a new one.
type Picture ¶
type Picture struct { Href string Type string Size [2]int // contains filtered or unexported fields }
Picture is a remote picture
func NewPicture ¶
NewPicture returns a new Picture instance from a given URL and its base.
type ProcessList ¶
type ProcessList []Processor
ProcessList holds the processes that will be applied
type ProcessMessage ¶
type ProcessMessage struct { Context context.Context Extractor *Extractor Log *log.Entry Dom *html.Node // contains filtered or unexported fields }
ProcessMessage holds the process message that is passed (and changed) by the subsequent processes.
func (*ProcessMessage) Cancel ¶
func (m *ProcessMessage) Cancel(reason string, args ...interface{})
Cancel fully cancel the extract process.
func (*ProcessMessage) Position ¶
func (m *ProcessMessage) Position() int
Position returns the current process position
func (*ProcessMessage) ResetContent ¶
func (m *ProcessMessage) ResetContent()
ResetContent empty the message Dom and all the drops body
func (*ProcessMessage) ResetPosition ¶
func (m *ProcessMessage) ResetPosition()
ResetPosition lets the process start over (normally with a new URL). It holds a counter and cancels everything after too many resets (defined by maxReset).
func (*ProcessMessage) SetValue ¶
func (m *ProcessMessage) SetValue(name string, value interface{})
SetValue sets a new message value.
func (*ProcessMessage) Step ¶
func (m *ProcessMessage) Step() ProcessStep
Step returns the current process step
func (*ProcessMessage) Value ¶
func (m *ProcessMessage) Value(name string) interface{}
Value returns a stored message value.
type ProcessStep ¶
type ProcessStep int
ProcessStep defines a type of process applied during extraction
const ( // StepStart happens before the connection is made. StepStart ProcessStep = iota + 1 // StepBody happens after receiving the resource body. StepBody // StepDom happens after parsing the resource DOM tree. StepDom // StepFinish happens at the very end of the extraction. StepFinish // StepPostProcess happens after looping over each Drop. StepPostProcess )
type Processor ¶
type Processor func(*ProcessMessage, Processor) Processor
Processor is the process function
type Transport ¶
type Transport struct {
// contains filtered or unexported fields
}
Transport is a wrapper around http.RoundTripper that lets you set default headers sent with every request.