archiver

package
v0.0.0-...-60192f8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 26, 2024 License: AGPL-3.0 Imports: 20 Imported by: 0

README

This package is a fork of Obelisk.

It adds some needed features such as:

  • A specific flag to fetch only images, and not all media
  • Callbacks for URL and content processing.
  • Ability to use your own HTTP Client
  • Ability to use any logger

Obelisk is originally written by RadhiFadlillah and released under an MIT License.

Documentation

Overview

Package archiver provides functions to archive the content of a full HTML page.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DefaultImageProcessor

func DefaultImageProcessor(_ context.Context, _ *Archiver,
	input io.Reader, contentType string, _ *url.URL,
) ([]byte, string, error)

DefaultImageProcessor is the default image processor. It simply reads and return the content.

func DefaultURLProcessor

func DefaultURLProcessor(_ string, content []byte, contentType string) string

DefaultURLProcessor is the default URL processor. It returns the base64 encoded URL.

func GetContextNode

func GetContextNode(ctx context.Context) (node *html.Node, ok bool)

GetContextNode returns the html node stored in the context.

Types

type ArchiveFlag

type ArchiveFlag uint8

ArchiveFlag is an archiver feature to enable.

const (
	// EnableCSS enables extraction of CSS files and tags.
	EnableCSS ArchiveFlag = 1 << iota

	// EnableEmbeds enables extraction of Embedes contents.
	EnableEmbeds

	// EnableJS enables extraction of JavaScript contents.
	EnableJS

	// EnableMedia enables extraction of media contents
	// other than image.
	EnableMedia

	// EnableImages enables extraction of images.
	EnableImages
)

type Archiver

type Archiver struct {
	sync.RWMutex

	Cache   map[string]Asset
	Request *Request
	Result  []byte

	Flags ArchiveFlag

	ImageProcessor imageProcessor
	URLProcessor   urlProcessor
	EventHandler   eventHandler

	RequestTimeout        time.Duration
	SkipTLSVerification   bool
	MaxConcurrentDownload int64
	// contains filtered or unexported fields
}

Archiver is the core of obelisk, which used to download a web page then embeds its assets.

func New

func New(req *Request) (*Archiver, error)

New creates a new Archiver using a Request instance.

func (*Archiver) Archive

func (arc *Archiver) Archive(ctx context.Context) error

Archive starts archival process for the specified request. Returns the archival result, content type and error if there are any.

func (*Archiver) SendEvent

func (arc *Archiver) SendEvent(ctx context.Context, event Event)

SendEvent is the function used to send an archiver event.

type Asset

type Asset struct {
	Data        []byte
	ContentType string
}

Asset is asset that used in a web page.

type Event

type Event interface {
	Fields() map[string]interface{}
}

Event is the interface for events emitted by the archiver.

type EventError

type EventError struct {
	Err error
	URI string
}

EventError is the event emitted when errors occur.

func (*EventError) Fields

func (e *EventError) Fields() map[string]interface{}

Fields returns the field map.

type EventFetchURL

type EventFetchURL struct {
	// contains filtered or unexported fields
}

EventFetchURL is the event emitted when the archiver loads a remote resource.

func (*EventFetchURL) Fields

func (e *EventFetchURL) Fields() map[string]interface{}

Fields returns the field map.

type EventInfo

type EventInfo map[string]interface{}

EventInfo is a simple event for any type of data.

func (EventInfo) Fields

func (e EventInfo) Fields() map[string]interface{}

Fields returns the field map.

type EventStartHTML

type EventStartHTML string

EventStartHTML is the event emitted at the beginning of the archiving process.

func (EventStartHTML) Fields

func (e EventStartHTML) Fields() map[string]interface{}

Fields returns the field map.

type Request

type Request struct {
	Input  io.Reader
	URL    *url.URL
	Client *http.Client
}

Request is data of archival request.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL