webarchive

package module
v1.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 3, 2023 License: Apache-2.0 Imports: 11 Imported by: 3

README

A reader for the WARC and ARC web archive formats.

Note: This package has been written for use in https://github.com/richardlehane/siegfried and has a bunch of quirks relating to that use case. If you're after a general purpose golang WARC package, you might be better suited by one of these excellent choices:

Example usage:

f, _ := os.Open("examples/IAH-20080430204825-00000-blackbook.arc")
// NewReader(io.Reader) can be used to read WARC, ARC or gzipped WARC or ARC files
rdr, err := webarchive.NewReader(f)
if err != nil {
  log.Fatal(err)
}
// use Next() to iterate through all records in the WARC or ARC file
for record, err := rdr.Next(); err == nil; record, err = rdr.Next() {
  // records implement the io.Reader interface
  i, err := io.Copy(ioutil.Discard, record)
  if err != nil {
    log.Fatal(err)
  }
  fmt.Printf("Read: %d bytes\n", i)
  // records also have URL(), MIME(), Date() and Size() methods
  fmt.Printf("URL: %s, MIME: %s, Date: %v, Size: %d\n", 
    record.URL(), record.MIME(), record.Date(), record.Size())
  // the Fields() method returns all the fields in the WARC or ARC record
  for key, values := range record.Fields() {
    fmt.Printf("Field key: %s, Field values: %v\n", key, values)
  }
}
f.Close()
f, _ = os.Open("examples/IAH-20080430204825-00000-blackbook.warc.gz")
defer f.Close()
// readers can Reset() to reuse the underlying buffers
err = rdr.Reset(f)
// the Close() method should be used if you pass in gzipped files, it is a nop for 
// non-gzipped files
defer rdr.Close()
// NextPayload() skips non-resource, conversion or response records and merges 
// continuations into single records. It also strips HTTP headers from response 
// records. After stripping, those HTTP headers are available alongside the WARC 
// headers in the record.Fields() map.
for record, err := rdr.NextPayload(); err == nil; record, err = rdr.NextPayload() {
  // webarchive.DecodePayload(record) decodes any encodings (transfer or 
  // content) declared in a record's HTTP header.
  // webarchive.DecodePayloadT(record) just decodes transfer encodings.
  // Both decode chunked, deflate and gzip encodings.
  record = webarchive.DecodePayload(record)
  i, err := io.Copy(ioutil.Discard, record)
  if err != nil {
    log.Fatal(err)
  }
  fmt.Printf("Read: %d bytes\n", i)
  // any skipped HTTP headers can be retrieved from the Fields() map
  for key, values := range record.Fields() {
    fmt.Printf("Field key: %s, Field values: %v\n", key, values)
  }
}

Install with go get github.com/richardlehane/webarchive

GoDoc

Documentation

Index

Examples

Constants

View Source
const ARCTime = "20060102150405"

ARCTime is a time format string for the ARC time format

Variables

View Source
var (
	ErrReset         = errors.New("webarchive: attempted reset on nil MultiReader, use NewReader() first")
	ErrNotWebarchive = errors.New("webarchive: not a valid ARC or WARC file")
	ErrVersionBlock  = errors.New("webarchive: invalid ARC version block")
	ErrARCHeader     = errors.New("webarchive: invalid ARC header")
	ErrNotSlicer     = errors.New("webarchive: underlying reader must be a slicer to expose Slice and EOFSlice methods")
	ErrWARCHeader    = errors.New("webarchive: invalid WARC header")
	ErrWARCRecord    = errors.New("webarchive: error parsing WARC record")
	ErrDiscard       = errors.New("webarchive: failed to do full read during discard")
)

Functions

This section is empty.

Types

type ARC

type ARC struct {
	FileDesc   string    // Original pathname of the archive file
	Address    string    // IP address of machine that created the archive file
	FileDate   time.Time // Date the archive file was created
	Version    int       // ARC version (1 or 2) - this will affect the fields available in the Fields() map
	OriginCode string    // Name of gathering organization
}

ARC structs represent the Version blocks at the start of ARC files. Provides information about the ARC file as a whole such as version, file path of the archive file, and date creation of the archive file.

type ARCReader

type ARCReader struct {
	*ARC
	// contains filtered or unexported fields
}

ARCReader is the ARC implementation of a webarchive Reader

func NewARCReader

func NewARCReader(r io.Reader) (*ARCReader, error)

NewARCReader creates a new ARC reader from the supplied io.Reader. Use instead of NewReader if you are only working with ARC files.

Example
f, err := os.Open("examples/IAH-20080430204825-00000-blackbook.arc")
if errors.Is(err, os.ErrNotExist) {
	fmt.Print("text/dns\nfiledesc://IAH-20080430204825-00000-blackbook.arc\n20080430204825\nwww.archive.org.	589	IN	A	207.241.229.39\n298")
	return
}
rdr, err := NewARCReader(f)
if err != nil {
	log.Fatal("failure creating an arc reader")
}
rec, err := rdr.NextPayload()
if err != nil {
	log.Fatal("failure seeking")
}
buf := make([]byte, 56)
io.ReadFull(rec, buf)
var count int
arec, ok := rec.(ARCRecord)
if !ok {
	log.Fatal("failure doing ARCRecord interface assertion")
}
fmt.Println(arec.MIME())
for _, err = rdr.NextPayload(); err != io.EOF; _, err = rdr.NextPayload() {
	if err != nil {
		log.Fatal(err)
	}
	count++
}
fmt.Printf("%s\n%s%d", rdr.FileDesc, buf, count)
Output:

text/dns
filedesc://IAH-20080430204825-00000-blackbook.arc
20080430204825
www.archive.org.	589	IN	A	207.241.229.39
298

func (ARCReader) Close

func (r ARCReader) Close() error

Close closes the underlying gzip reader if the WARC or ARC file is gzipped. If not a gzip file, this is a nop.

func (ARCReader) EofSlice

func (r ARCReader) EofSlice(off int64, l int) ([]byte, error)

Slice returns a byte slice with size l from a given offset from the end of the content of the record.

func (ARCReader) IsSlicer

func (r ARCReader) IsSlicer() bool

func (*ARCReader) Next

func (a *ARCReader) Next() (Record, error)

Next iterates to the next Record. Returns io.EOF at the end of file.

func (*ARCReader) NextPayload

func (a *ARCReader) NextPayload() (Record, error)

NextPayload iterates to the next payload record. As ARC files do not differentiate between different types of records, the effect of NextPayload for an ARC reader is just to strip HTTP headers. These stripped headers are then made available in the Fields() map.

func (ARCReader) Read

func (r ARCReader) Read(p []byte) (int, error)

Read reads the content of the record. When iterating with NextPayload, the read will start after any stripped HTTP headers. Otherwise, the read starts immediately after the WARC or ARC header block.

func (*ARCReader) Reset

func (a *ARCReader) Reset(r io.Reader) error

Reset allows re-use of an ARC reader

func (ARCReader) Size

func (r ARCReader) Size() int64

Size returns the size in bytes of the content. When iterating with NextPayload, Size returns the size after HTTP headers have been stripped. So the size reported here may be different from that reported in the ARC or WARC's header block.

func (ARCReader) Slice

func (r ARCReader) Slice(off int64, l int) ([]byte, error)

Slice returns a byte slice with size l from a given offset from the start of the content of the record. When iterating with NextPayload, the slice zero offset starts after any stripped HTTP headers. Otherwise, the zero offset is immediately after the WARC or ARC header block.

type ARCRecord

type ARCRecord interface {
	IP() string
	Record
}

ARCRecord represents the common fields shared by ARC version 1 and ARC version 2 URL record blocks. ARC version 2 URL record blocks have additional fields not exposed here. These fields are available in the Fields() map. To access the IP() method of an ARCRecord, do an interface assertion on a Record.

Example:

record, _ := reader.Next()
arcrecord, ok := record.(ARCRecord)
if ok {fmt.Println(arcrecord.IP())}

type Content

type Content interface {
	Size() int64
	Read(p []byte) (n int, err error)
	Slice(off int64, l int) ([]byte, error)
	EofSlice(off int64, l int) ([]byte, error)
	// contains filtered or unexported methods
}

Content represents the content portion of a WARC or ARC record.

type Header interface {
	URL() string
	Date() time.Time
	MIME() string
	Fields() map[string][]string
	// contains filtered or unexported methods
}

Header represents the common header fields shared by ARC and WARC records.

type MultiReader

type MultiReader struct {
	Reader
	// contains filtered or unexported fields
}

MultiReader is the concrete type returned by webarchive.NewReader. A MultiReader can represent both a WARC or ARC reader (or both if ARC and WARC files are given to the same Reader using Reset).

Example:

f, _ := os.Open("examples/IAH-20080430204825-00000-blackbook.arc")
rdr, _ := NewReader(f)
f.Close()
f, _ = os.Open("examples/IAH-20080430204825-00000-blackbook.warc.gz")
rdr.Reset(f)
rdr.Close()
f.Close()

func (*MultiReader) Reset

func (m *MultiReader) Reset(r io.Reader) error

Reset allows re-use of a Multireader. A Multireader created with a WARC file can be reset with an ARC file, and vice versa.

type Reader

type Reader interface {
	Reset(io.Reader) error
	Next() (Record, error)
	NextPayload() (Record, error) // skip non-resonse/resource records; merge continuations; strip non-body content from record
	Close() error
}

Reader represents the common methods shared by ARC, WARC and Multi readers.

func NewReader

func NewReader(r io.Reader) (Reader, error)

NewReader returns a new webarchive Reader reading from the io.Reader. The supplied io.Reader can be a WARC, ARC, WARC.GZ or ARC.GZ file.

Example
f, _ := os.Open("examples/IAH-20080430204825-00000-blackbook.arc")
// NewReader(io.Reader) can be used to read WARC, ARC or gzipped WARC or ARC files
rdr, err := NewReader(f)
if err != nil {
	log.Fatal(err)
}
// use Next() to iterate through all records in the WARC or ARC file
for record, err := rdr.Next(); err == nil; record, err = rdr.Next() {
	// records implement the io.Reader interface
	i, err := io.Copy(ioutil.Discard, record)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("Read: %d bytes\n", i)
	// records also have URL(), MIME(), Date() and Size() methods
	fmt.Printf("URL: %s, MIME: %s, Date: %v, Size: %d\n",
		record.URL(), record.MIME(), record.Date(), record.Size())
	// the Fields() method returns all the fields in the WARC or ARC record
	for key, values := range record.Fields() {
		fmt.Printf("Field key: %s, Field values: %v\n", key, values)
	}
}
f.Close()
f, _ = os.Open("examples/IAH-20080430204825-00000-blackbook.warc.gz")
defer f.Close()
// readers can Reset() to reuse the underlying buffers
err = rdr.Reset(f)
// the Close() method should be used if you pass in gzipped files, it is a nop for
// non-gzipped files
defer rdr.Close()
// NextPayload() skips non-resource, conversion or response records and merges
// continuations into single records. It also strips HTTP headers from response
// records. After stripping, those HTTP headers are available alongside the WARC
// headers in the record.Fields() map.
for record, err := rdr.NextPayload(); err == nil; record, err = rdr.NextPayload() {
	// DecodePayload(record) decodes any encodings (transfer or
	// content) declared in a record's HTTP header.
	// DecodePayloadT(record) just decodes transfer encodings.
	// Both decode chunked, deflate and gzip encodings.
	record = DecodePayload(record)
	i, err := io.Copy(ioutil.Discard, record)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("Read: %d bytes\n", i)
	// any skipped HTTP headers can be retrieved from the Fields() map
	for key, values := range record.Fields() {
		fmt.Printf("Field key: %s, Field values: %v\n", key, values)
	}
}
Output:

type Record

type Record interface {
	Header
	Content
}

Record represents both ARC and WARC records.

func DecodePayload

func DecodePayload(r Record) Record

DecodePayload decodes any encodings (transfer or content) declared in a record's HTTP header. Decodes chunked, deflate and gzip encodings.

func DecodePayloadT

func DecodePayloadT(r Record) Record

DecodePayloadT decodes any transfer encodings declared in a record's HTTP header. Decodes chunked, deflate and gzip encodings.

type WARCReader

type WARCReader struct {
	// contains filtered or unexported fields
}

WARCReader is the WARC implementation of a webarchive Reader

func NewWARCReader

func NewWARCReader(r io.Reader) (*WARCReader, error)

NewWARCReader creates a new WARC reader from the supplied io.Reader. Use instead of NewReader if you are only working with ARC files.

Example
f, err := os.Open("examples/IAH-20080430204825-00000-blackbook.warc")
if errors.Is(err, os.ErrNotExist) {
	fmt.Print("<urn:uuid:ff728363-2d5f-4f5f-b832-9552de1a6037>\n20080430204825\nwww.archive.org.	589	IN	A	207.241.229.39\n298")
	return
}
rdr, err := NewWARCReader(f)
if err != nil {
	log.Fatal("failure creating an warc reader")
}
rec, err := rdr.NextPayload()
if err != nil {
	log.Fatal("failure seeking: " + err.Error())
}
buf := make([]byte, 55)
io.ReadFull(rec, buf)
var count int
wrec, ok := rec.(WARCRecord)
if !ok {
	log.Fatal("failure doing WARCRecord interface assertion")
}
fmt.Println(wrec.ID())
for _, err = rdr.NextPayload(); err != io.EOF; _, err = rdr.NextPayload() {
	if err != nil {
		log.Fatal(err)
	}
	count++
}
fmt.Printf("%s\n%d", buf, count)
Output:

<urn:uuid:ff728363-2d5f-4f5f-b832-9552de1a6037>
20080430204825
www.archive.org.	589	IN	A	207.241.229.39
298

func (WARCReader) Close

func (r WARCReader) Close() error

Close closes the underlying gzip reader if the WARC or ARC file is gzipped. If not a gzip file, this is a nop.

func (WARCReader) Date

func (h WARCReader) Date() time.Time

Date returns the archive date of the current Record.

func (WARCReader) EofSlice

func (r WARCReader) EofSlice(off int64, l int) ([]byte, error)

Slice returns a byte slice with size l from a given offset from the end of the content of the record.

func (WARCReader) Fields

func (h WARCReader) Fields() map[string][]string

Fields returns a map of all WARC fields for the current Record. If NextPayload was used, this map will also contain any stripped HTTP headers.

func (WARCReader) ID

func (h WARCReader) ID() string

ID returns the WARC Record ID.

func (WARCReader) IsSlicer

func (r WARCReader) IsSlicer() bool

func (WARCReader) MIME

func (h WARCReader) MIME() string

func (*WARCReader) Next

func (w *WARCReader) Next() (Record, error)

Next iterates to the next Record. Returns io.EOF at the end of file.

func (*WARCReader) NextPayload

func (w *WARCReader) NextPayload() (Record, error)

NextPayload iterates to the next payload record. It skips non-resource, conversion or response records and merges continuations into single records. It also strips HTTP headers from response records. After stripping, those HTTP headers are available alongside the WARC headers in the record.Fields() map.

func (WARCReader) Read

func (r WARCReader) Read(p []byte) (int, error)

Read reads the content of the record. When iterating with NextPayload, the read will start after any stripped HTTP headers. Otherwise, the read starts immediately after the WARC or ARC header block.

func (*WARCReader) Reset

func (w *WARCReader) Reset(r io.Reader) error

Reset allows re-use of a ARC reader

func (WARCReader) Size

func (r WARCReader) Size() int64

Size returns the size in bytes of the content. When iterating with NextPayload, Size returns the size after HTTP headers have been stripped. So the size reported here may be different from that reported in the ARC or WARC's header block.

func (WARCReader) Slice

func (r WARCReader) Slice(off int64, l int) ([]byte, error)

Slice returns a byte slice with size l from a given offset from the start of the content of the record. When iterating with NextPayload, the slice zero offset starts after any stripped HTTP headers. Otherwise, the zero offset is immediately after the WARC or ARC header block.

func (WARCReader) Type

func (h WARCReader) Type() string

Type returns the WARC Type

func (WARCReader) URL

func (h WARCReader) URL() string

URL returns the URL of the current Record.

type WARCRecord

type WARCRecord interface {
	ID() string
	Type() string
	Record
}

WARCRecord allows access to specific WARC record fields. Other WARC fields not included here are accessible via the Fields() method. To access the ID() and Type() methods of a WARCRecord, do an interface assertion on a Record.

Example:

record, _ := reader.Next()
warcrecord, ok := record.(WARCRecord)
if ok {fmt.Println(warcrecord.ID())}

Directories

Path Synopsis
utils
warcscan
Warcscan is a simple script to enable searching WARC files and retrieving individual WARC records.
Warcscan is a simple script to enable searching WARC files and retrieving individual WARC records.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL