warc

package module
v0.8.38 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 25, 2024 License: CC0-1.0 Imports: 30 Imported by: 0

README

warc

GoDoc Go Report Card

WARNING: This project is no longer a work-in-progress, but needs to be carefully implemented and tested, but is generating valid WARCs when used correctly!

Introduction

warc provides methods for reading and writing WARC files in Go. This module is based on nlevitt's WARC module.

Install

go get github.com/CorentinB/warc

License

warc is released under CC0 license. You can find a copy of the CC0 License in the LICENSE file.

Documentation

Index

Constants

View Source
const MaxInMemorySize = 1000000

MaxInMemorySize is the max number of bytes (currently 1MB) to hold in memory before starting to write to disk

Variables

View Source
var (
	IPv6 *availableIPs
	IPv4 *availableIPs
)
View Source
var (

	// Create a counter to keep track of the number of bytes written to WARC files
	DataTotal *ratecounter.Counter
)

Functions

func GenerateWarcFileName added in v0.8.26

func GenerateWarcFileName(prefix string, compression string, atomicSerial *int64) (fileName string)

GenerateWarcFileName generate a WARC file name following recommendations of the specs: Prefix-Timestamp-Serial-Crawlhost.warc.gz

func GetSHA1

func GetSHA1(r io.Reader) string

func GetSHA256 added in v0.8.37

func GetSHA256(r io.Reader) string

func GetSHA256Base16 added in v0.8.37

func GetSHA256Base16(r io.Reader) string

Types

type CustomHTTPClient added in v0.7.0

type CustomHTTPClient struct {
	http.Client
	WARCWriter             chan *RecordBatch
	WARCWriterDoneChannels []chan bool
	WaitGroup              *WaitGroupWithCount

	ErrChan chan *Error

	TempDir               string
	FullOnDisk            bool
	MaxReadBeforeTruncate int
	DataTotal             *ratecounter.Counter
	// contains filtered or unexported fields
}

func NewWARCWritingHTTPClient added in v0.7.0

func NewWARCWritingHTTPClient(HTTPClientSettings HTTPClientSettings) (httpClient *CustomHTTPClient, err error)

func (*CustomHTTPClient) Close added in v0.7.0

func (c *CustomHTTPClient) Close() error

func (*CustomHTTPClient) WriteMetadataRecord added in v0.8.36

func (c *CustomHTTPClient) WriteMetadataRecord(WARCTargetURI, contentType, payload string)

type DedupeOptions added in v0.8.0

type DedupeOptions struct {
	LocalDedupe   bool
	CDXDedupe     bool
	CDXURL        string
	CDXCookie     string
	SizeThreshold int
}

type Error added in v0.8.29

type Error struct {
	Err  error
	Func string
}

type HTTPClientSettings added in v0.8.14

type HTTPClientSettings struct {
	RotatorSettings       *RotatorSettings
	DedupeOptions         DedupeOptions
	Proxy                 string
	DecompressBody        bool
	SkipHTTPStatusCodes   []int
	VerifyCerts           bool
	TempDir               string
	FullOnDisk            bool
	MaxReadBeforeTruncate int
	FollowRedirects       bool
	TCPTimeout            time.Duration
	TLSHandshakeTimeout   time.Duration
	RandomLocalIP         bool
}
type Header map[string]string

Header provides information about the WARC record. It stores WARC record field names and their values. Since WARC field names are case-insensitive, the Header methods are case-insensitive as well.

func NewHeader

func NewHeader() Header

NewHeader creates a new WARC header.

func (Header) Del

func (h Header) Del(key string)

Del deletes the value associated with key.

func (Header) Get

func (h Header) Get(key string) string

Get returns the value associated with the given key. If there is no value associated with the key, Get returns "".

func (Header) Set

func (h Header) Set(key, value string)

Set sets the header field associated with key to value.

type ReadSeekCloser added in v0.8.9

type ReadSeekCloser interface {
	io.Reader
	io.Seeker
	ReaderAt
	io.Closer
	FileName() string
}

ReadSeekCloser is an io.Reader + ReaderAt + io.Seeker + io.Closer + Stat

type ReadWriteSeekCloser added in v0.8.9

type ReadWriteSeekCloser interface {
	ReadSeekCloser
	io.Writer
}

ReadWriteSeekCloser is an io.Writer + io.Reader + io.Seeker + io.Closer.

func NewSpooledTempFile added in v0.8.9

func NewSpooledTempFile(filePrefix string, tempDir string, fullOnDisk bool) ReadWriteSeekCloser

NewSpooledTempFile returns an ReadWriteSeekCloser, with some important constraints: you can Write into it, but whenever you call Read or Seek on it, Write is forbidden, will return an error.

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

Reader store the bufio.Reader and gzip.Reader for a WARC file

func NewReader

func NewReader(reader io.Reader) (*Reader, error)

NewReader returns a new WARC reader

func (*Reader) Close

func (r *Reader) Close()

Close closes the reader.

func (*Reader) ReadRecord

func (r *Reader) ReadRecord() (*Record, error)

ReadRecord reads the next record from the opened WARC file

type ReaderAt added in v0.8.9

type ReaderAt interface {
	ReadAt(p []byte, off int64) (n int, err error)
}

ReaderAt is the interface for ReadAt - read at position, without moving pointer.

type Record

type Record struct {
	Header  Header
	Content ReadWriteSeekCloser
}

Record represents a WARC record.

func NewRecord

func NewRecord(tempDir string, fullOnDisk bool) *Record

NewRecord creates a new WARC record.

type RecordBatch

type RecordBatch struct {
	Records     []*Record
	Done        chan bool
	CaptureTime string
}

RecordBatch is a structure that contains a bunch of records to be written at the same time, and a common capture timestamp

func NewRecordBatch

func NewRecordBatch() *RecordBatch

NewRecordBatch creates a record batch, it also initialize the capture time

type RotatorSettings

type RotatorSettings struct {
	// Content of the warcinfo record that will be written
	// to all WARC files
	WarcinfoContent Header
	// Prefix used for WARC filenames, WARC 1.1 specifications
	// recommend to name files this way:
	// Prefix-Timestamp-Serial-Crawlhost.warc.gz
	Prefix string
	// Compression algorithm to use
	Compression string
	// WarcSize is in MegaBytes
	WarcSize float64
	// Directory where the created WARC files will be stored,
	// default will be the current directory
	OutputDirectory string
	// WARCWriterPoolSize defines the number of parallel WARC writers
	WARCWriterPoolSize int
}

RotatorSettings is used to store the settings needed by recordWriter to write WARC files

func NewRotatorSettings

func NewRotatorSettings() *RotatorSettings

NewRotatorSettings creates a RotatorSettings structure and initialize it with default values

func (*RotatorSettings) NewWARCRotator

func (s *RotatorSettings) NewWARCRotator() (recordWriterChan chan *RecordBatch, doneChannels []chan bool, err error)

NewWARCRotator creates and return a channel that can be used to communicate records to be written to WARC files to the recordWriter function running in a goroutine

type WaitGroupWithCount added in v0.8.18

type WaitGroupWithCount struct {
	sync.WaitGroup
	// contains filtered or unexported fields
}

func (*WaitGroupWithCount) Add added in v0.8.18

func (wg *WaitGroupWithCount) Add(delta int)

func (*WaitGroupWithCount) Done added in v0.8.18

func (wg *WaitGroupWithCount) Done()

func (*WaitGroupWithCount) Size added in v0.8.18

func (wg *WaitGroupWithCount) Size() int

type Writer

type Writer struct {
	FileName     string
	Compression  string
	GZIPWriter   *gzip.Writer
	PGZIPWriter  *pgzip.Writer
	ZSTDWriter   *zstd.Encoder
	FileWriter   *bufio.Writer
	ParallelGZIP bool
}

Writer writes WARC records to WARC files.

func NewWriter

func NewWriter(writer io.Writer, fileName string, compression string, contentLengthHeader string) (*Writer, error)

NewWriter creates a new WARC writer.

func (*Writer) CloseCompressedWriter added in v0.8.20

func (w *Writer) CloseCompressedWriter()

func (*Writer) WriteInfoRecord

func (w *Writer) WriteInfoRecord(payload map[string]string) (recordID string, err error)

WriteInfoRecord method can be used to write informations record to the WARC file

func (*Writer) WriteRecord

func (w *Writer) WriteRecord(r *Record) (recordID string, err error)

WriteRecord writes a record to the underlying WARC file. A record consists of a version string, the record header followed by a record content block and two newlines:

Version CLRF
Header-Key: Header-Value CLRF
CLRF
Content
CLRF
CLRF

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL