warc

package module
v0.5.10-0...-cec16a9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 12, 2021 License: CC0-1.0 Imports: 15 Imported by: 0

README

warc

GoDoc Go Report Card

WARNING: This project is still a WIP. It is NOT ready to be used in any project.

Introduction

warc provides methods for reading and writing WARC files in Go. This module is based on nlevitt's WARC module.

Install

go get github.com/CorentinB/warc

License

warc is released under CC0 license. You can find a copy of the CC0 License in the LICENSE file.

Documentation

Overview

Package warc provides methods for reading and writing WARC files (https://iipc.github.io/warc-specifications/) in Go. This module is based on nlevitt's WARC module (https://github.com/nlevitt/warc).

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GetSHA1

func GetSHA1(content []byte) string

GetSHA1 return the SHA1 of a []byte, can be used to fill the WARC-Block-Digest header

func GetSHA1FromFile

func GetSHA1FromFile(path string) (string, error)

GetSHA1FromFile return the SHA1 of a file, can be used to fill the WARC-Block-Digest header

Types

type Header map[string]string

Header provides information about the WARC record. It stores WARC record field names and their values. Since WARC field names are case-insensitive, the Header methods are case-insensitive as well.

func NewHeader

func NewHeader() Header

NewHeader creates a new WARC header.

func (Header) Del

func (h Header) Del(key string)

Del deletes the value associated with key.

func (Header) Get

func (h Header) Get(key string) string

Get returns the value associated with the given key. If there is no value associated with the key, Get returns "".

func (Header) Set

func (h Header) Set(key, value string)

Set sets the header field associated with key to value.

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

Reader store the bufio.Reader and gzip.Reader for a WARC file

func NewReader

func NewReader(reader io.Reader) (*Reader, error)

NewReader returns a new WARC reader

func (*Reader) Close

func (r *Reader) Close()

Close closes the reader.

func (*Reader) ReadRecord

func (r *Reader) ReadRecord(onDisk bool) (*Record, error)

ReadRecord reads the next record from the opened WARC file. If onDisk is set to true, then the record's payload will be written to a temp file on disk, and specified in the *Record.PayloadPath, else, everything happen in memory.

type Record

type Record struct {
	Header      Header
	Content     io.Reader
	PayloadPath string
}

Record represents a WARC record.

func NewRecord

func NewRecord() *Record

NewRecord creates a new WARC record.

type RecordBatch

type RecordBatch struct {
	Records     []*Record
	Done        chan bool
	CaptureTime string
}

RecordBatch is a structure that contains a bunch of records to be written at the same time, and a common capture timestamp

func NewRecordBatch

func NewRecordBatch() *RecordBatch

NewRecordBatch creates a record batch, it also initialize the capture time

type RotatorSettings

type RotatorSettings struct {
	// Content of the warcinfo record that will be written
	// to all WARC files
	WarcinfoContent Header
	// Prefix used for WARC filenames, WARC 1.1 specifications
	// recommend to name files this way:
	// Prefix-Timestamp-Serial-Crawlhost.warc.gz
	Prefix string
	// Compression algorithm to use
	Compression string
	// WarcSize is in MegaBytes
	WarcSize float64
	// Directory where the created WARC files will be stored,
	// default will be the current directory
	OutputDirectory string
}

RotatorSettings is used to store the settings needed by recordWriter to write WARC files

func NewRotatorSettings

func NewRotatorSettings() *RotatorSettings

NewRotatorSettings creates a RotatorSettings structure and initialize it with default values

func (*RotatorSettings) NewWARCRotator

func (s *RotatorSettings) NewWARCRotator() (recordWriterChannel chan *RecordBatch, done chan bool, err error)

NewWARCRotator creates and return a channel that can be used to communicate records to be written to WARC files to the recordWriter function running in a goroutine

type Writer

type Writer struct {
	FileName    string
	Compression string
	GZIPWriter  *gzip.Writer
	ZSTDWriter  *zstd.Encoder
	FileWriter  *bufio.Writer
}

Writer writes WARC records to WARC files.

func NewWriter

func NewWriter(writer io.Writer, fileName string, compression string) (*Writer, error)

NewWriter creates a new WARC writer.

func (*Writer) WriteInfoRecord

func (w *Writer) WriteInfoRecord(payload map[string]string) (recordID string, err error)

WriteInfoRecord method can be used to write informations record to the WARC file

func (*Writer) WriteRecord

func (w *Writer) WriteRecord(r *Record) (recordID string, err error)

WriteRecord writes a record to the underlying WARC file. A record consists of a version string, the record header followed by a record content block and two newlines:

Version CLRF
Header-Key: Header-Value CLRF
CLRF
Content
CLRF
CLRF

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL