Documentation ¶
Index ¶
- Constants
- Variables
- type ARC
- type ARCReader
- func (r ARCReader) Close() error
- func (r ARCReader) EofSlice(off int64, l int) ([]byte, error)
- func (r ARCReader) IsSlicer() bool
- func (a *ARCReader) Next() (Record, error)
- func (a *ARCReader) NextPayload() (Record, error)
- func (r ARCReader) Read(p []byte) (int, error)
- func (a *ARCReader) Reset(r io.Reader) error
- func (r ARCReader) Size() int64
- func (r ARCReader) Slice(off int64, l int) ([]byte, error)
- type ARCRecord
- type Content
- type Header
- type MultiReader
- type Reader
- type Record
- type WARCReader
- func (r WARCReader) Close() error
- func (h WARCReader) Date() time.Time
- func (r WARCReader) EofSlice(off int64, l int) ([]byte, error)
- func (h WARCReader) Fields() map[string][]string
- func (h WARCReader) ID() string
- func (r WARCReader) IsSlicer() bool
- func (h WARCReader) MIME() string
- func (w *WARCReader) Next() (Record, error)
- func (w *WARCReader) NextPayload() (Record, error)
- func (r WARCReader) Read(p []byte) (int, error)
- func (w *WARCReader) Reset(r io.Reader) error
- func (r WARCReader) Size() int64
- func (r WARCReader) Slice(off int64, l int) ([]byte, error)
- func (h WARCReader) Type() string
- func (h WARCReader) URL() string
- type WARCRecord
Examples ¶
Constants ¶
const ARCTime = "20060102150405"
ARCTime is a time format string for the ARC time format
Variables ¶
var ( ErrReset = errors.New("webarchive: attempted reset on nil MultiReader, use NewReader() first") ErrNotWebarchive = errors.New("webarchive: not a valid ARC or WARC file") ErrVersionBlock = errors.New("webarchive: invalid ARC version block") ErrARCHeader = errors.New("webarchive: invalid ARC header") ErrNotSlicer = errors.New("webarchive: underlying reader must be a slicer to expose Slice and EOFSlice methods") ErrWARCHeader = errors.New("webarchive: invalid WARC header") ErrWARCRecord = errors.New("webarchive: error parsing WARC record") ErrDiscard = errors.New("webarchive: failed to do full read during discard") )
Functions ¶
This section is empty.
Types ¶
type ARC ¶
type ARC struct { FileDesc string // Original pathname of the archive file Address string // IP address of machine that created the archive file FileDate time.Time // Date the archive file was created Version int // ARC version (1 or 2) - this will affect the fields available in the Fields() map OriginCode string // Name of gathering organization }
ARC structs represent the Version blocks at the start of ARC files. Provides information about the ARC file as a whole such as version, file path of the archive file, and date creation of the archive file.
type ARCReader ¶
type ARCReader struct { *ARC // contains filtered or unexported fields }
ARCReader is the ARC implementation of a webarchive Reader
func NewARCReader ¶
NewARCReader creates a new ARC reader from the supplied io.Reader. Use instead of NewReader if you are only working with ARC files.
Example ¶
f, err := os.Open("examples/IAH-20080430204825-00000-blackbook.arc") if errors.Is(err, os.ErrNotExist) { fmt.Print("text/dns\nfiledesc://IAH-20080430204825-00000-blackbook.arc\n20080430204825\nwww.archive.org. 589 IN A 207.241.229.39\n298") return } rdr, err := NewARCReader(f) if err != nil { log.Fatal("failure creating an arc reader") } rec, err := rdr.NextPayload() if err != nil { log.Fatal("failure seeking") } buf := make([]byte, 56) io.ReadFull(rec, buf) var count int arec, ok := rec.(ARCRecord) if !ok { log.Fatal("failure doing ARCRecord interface assertion") } fmt.Println(arec.MIME()) for _, err = rdr.NextPayload(); err != io.EOF; _, err = rdr.NextPayload() { if err != nil { log.Fatal(err) } count++ } fmt.Printf("%s\n%s%d", rdr.FileDesc, buf, count)
Output: text/dns filedesc://IAH-20080430204825-00000-blackbook.arc 20080430204825 www.archive.org. 589 IN A 207.241.229.39 298
func (ARCReader) Close ¶
func (r ARCReader) Close() error
Close closes the underlying gzip reader if the WARC or ARC file is gzipped. If not a gzip file, this is a nop.
func (ARCReader) EofSlice ¶
Slice returns a byte slice with size l from a given offset from the end of the content of the record.
func (*ARCReader) NextPayload ¶
NextPayload iterates to the next payload record. As ARC files do not differentiate between different types of records, the effect of NextPayload for an ARC reader is just to strip HTTP headers. These stripped headers are then made available in the Fields() map.
func (ARCReader) Read ¶
Read reads the content of the record. When iterating with NextPayload, the read will start after any stripped HTTP headers. Otherwise, the read starts immediately after the WARC or ARC header block.
func (ARCReader) Size ¶
func (r ARCReader) Size() int64
Size returns the size in bytes of the content. When iterating with NextPayload, Size returns the size after HTTP headers have been stripped. So the size reported here may be different from that reported in the ARC or WARC's header block.
func (ARCReader) Slice ¶
Slice returns a byte slice with size l from a given offset from the start of the content of the record. When iterating with NextPayload, the slice zero offset starts after any stripped HTTP headers. Otherwise, the zero offset is immediately after the WARC or ARC header block.
type ARCRecord ¶
ARCRecord represents the common fields shared by ARC version 1 and ARC version 2 URL record blocks. ARC version 2 URL record blocks have additional fields not exposed here. These fields are available in the Fields() map. To access the IP() method of an ARCRecord, do an interface assertion on a Record.
Example:
record, _ := reader.Next() arcrecord, ok := record.(ARCRecord) if ok {fmt.Println(arcrecord.IP())}
type Content ¶
type Content interface { Size() int64 Read(p []byte) (n int, err error) Slice(off int64, l int) ([]byte, error) EofSlice(off int64, l int) ([]byte, error) // contains filtered or unexported methods }
Content represents the content portion of a WARC or ARC record.
type Header ¶
type Header interface { URL() string Date() time.Time MIME() string Fields() map[string][]string // contains filtered or unexported methods }
Header represents the common header fields shared by ARC and WARC records.
type MultiReader ¶
type MultiReader struct { Reader // contains filtered or unexported fields }
MultiReader is the concrete type returned by webarchive.NewReader. A MultiReader can represent both a WARC or ARC reader (or both if ARC and WARC files are given to the same Reader using Reset).
Example:
f, _ := os.Open("examples/IAH-20080430204825-00000-blackbook.arc") rdr, _ := NewReader(f) f.Close() f, _ = os.Open("examples/IAH-20080430204825-00000-blackbook.warc.gz") rdr.Reset(f) rdr.Close() f.Close()
type Reader ¶
type Reader interface { Reset(io.Reader) error Next() (Record, error) NextPayload() (Record, error) // skip non-resonse/resource records; merge continuations; strip non-body content from record Close() error }
Reader represents the common methods shared by ARC, WARC and Multi readers.
func NewReader ¶
NewReader returns a new webarchive Reader reading from the io.Reader. The supplied io.Reader can be a WARC, ARC, WARC.GZ or ARC.GZ file.
Example ¶
f, _ := os.Open("examples/IAH-20080430204825-00000-blackbook.arc") // NewReader(io.Reader) can be used to read WARC, ARC or gzipped WARC or ARC files rdr, err := NewReader(f) if err != nil { log.Fatal(err) } // use Next() to iterate through all records in the WARC or ARC file for record, err := rdr.Next(); err == nil; record, err = rdr.Next() { // records implement the io.Reader interface i, err := io.Copy(ioutil.Discard, record) if err != nil { log.Fatal(err) } fmt.Printf("Read: %d bytes\n", i) // records also have URL(), MIME(), Date() and Size() methods fmt.Printf("URL: %s, MIME: %s, Date: %v, Size: %d\n", record.URL(), record.MIME(), record.Date(), record.Size()) // the Fields() method returns all the fields in the WARC or ARC record for key, values := range record.Fields() { fmt.Printf("Field key: %s, Field values: %v\n", key, values) } } f.Close() f, _ = os.Open("examples/IAH-20080430204825-00000-blackbook.warc.gz") defer f.Close() // readers can Reset() to reuse the underlying buffers err = rdr.Reset(f) // the Close() method should be used if you pass in gzipped files, it is a nop for // non-gzipped files defer rdr.Close() // NextPayload() skips non-resource, conversion or response records and merges // continuations into single records. It also strips HTTP headers from response // records. After stripping, those HTTP headers are available alongside the WARC // headers in the record.Fields() map. for record, err := rdr.NextPayload(); err == nil; record, err = rdr.NextPayload() { // DecodePayload(record) decodes any encodings (transfer or // content) declared in a record's HTTP header. // DecodePayloadT(record) just decodes transfer encodings. // Both decode chunked, deflate and gzip encodings. record = DecodePayload(record) i, err := io.Copy(ioutil.Discard, record) if err != nil { log.Fatal(err) } fmt.Printf("Read: %d bytes\n", i) // any skipped HTTP headers can be retrieved from the Fields() map for key, values := range record.Fields() { fmt.Printf("Field key: %s, Field values: %v\n", key, values) } }
Output:
type Record ¶
Record represents both ARC and WARC records.
func DecodePayload ¶
DecodePayload decodes any encodings (transfer or content) declared in a record's HTTP header. Decodes chunked, deflate and gzip encodings.
func DecodePayloadT ¶
DecodePayloadT decodes any transfer encodings declared in a record's HTTP header. Decodes chunked, deflate and gzip encodings.
type WARCReader ¶
type WARCReader struct {
// contains filtered or unexported fields
}
WARCReader is the WARC implementation of a webarchive Reader
func NewWARCReader ¶
func NewWARCReader(r io.Reader) (*WARCReader, error)
NewWARCReader creates a new WARC reader from the supplied io.Reader. Use instead of NewReader if you are only working with ARC files.
Example ¶
f, err := os.Open("examples/IAH-20080430204825-00000-blackbook.warc") if errors.Is(err, os.ErrNotExist) { fmt.Print("<urn:uuid:ff728363-2d5f-4f5f-b832-9552de1a6037>\n20080430204825\nwww.archive.org. 589 IN A 207.241.229.39\n298") return } rdr, err := NewWARCReader(f) if err != nil { log.Fatal("failure creating an warc reader") } rec, err := rdr.NextPayload() if err != nil { log.Fatal("failure seeking: " + err.Error()) } buf := make([]byte, 55) io.ReadFull(rec, buf) var count int wrec, ok := rec.(WARCRecord) if !ok { log.Fatal("failure doing WARCRecord interface assertion") } fmt.Println(wrec.ID()) for _, err = rdr.NextPayload(); err != io.EOF; _, err = rdr.NextPayload() { if err != nil { log.Fatal(err) } count++ } fmt.Printf("%s\n%d", buf, count)
Output: <urn:uuid:ff728363-2d5f-4f5f-b832-9552de1a6037> 20080430204825 www.archive.org. 589 IN A 207.241.229.39 298
func (WARCReader) Close ¶
func (r WARCReader) Close() error
Close closes the underlying gzip reader if the WARC or ARC file is gzipped. If not a gzip file, this is a nop.
func (WARCReader) EofSlice ¶
Slice returns a byte slice with size l from a given offset from the end of the content of the record.
func (WARCReader) Fields ¶
Fields returns a map of all WARC fields for the current Record. If NextPayload was used, this map will also contain any stripped HTTP headers.
func (*WARCReader) Next ¶
func (w *WARCReader) Next() (Record, error)
Next iterates to the next Record. Returns io.EOF at the end of file.
func (*WARCReader) NextPayload ¶
func (w *WARCReader) NextPayload() (Record, error)
NextPayload iterates to the next payload record. It skips non-resource, conversion or response records and merges continuations into single records. It also strips HTTP headers from response records. After stripping, those HTTP headers are available alongside the WARC headers in the record.Fields() map.
func (WARCReader) Read ¶
Read reads the content of the record. When iterating with NextPayload, the read will start after any stripped HTTP headers. Otherwise, the read starts immediately after the WARC or ARC header block.
func (*WARCReader) Reset ¶
func (w *WARCReader) Reset(r io.Reader) error
Reset allows re-use of a ARC reader
func (WARCReader) Size ¶
func (r WARCReader) Size() int64
Size returns the size in bytes of the content. When iterating with NextPayload, Size returns the size after HTTP headers have been stripped. So the size reported here may be different from that reported in the ARC or WARC's header block.
func (WARCReader) Slice ¶
Slice returns a byte slice with size l from a given offset from the start of the content of the record. When iterating with NextPayload, the slice zero offset starts after any stripped HTTP headers. Otherwise, the zero offset is immediately after the WARC or ARC header block.
type WARCRecord ¶
WARCRecord allows access to specific WARC record fields. Other WARC fields not included here are accessible via the Fields() method. To access the ID() and Type() methods of a WARCRecord, do an interface assertion on a Record.
Example:
record, _ := reader.Next() warcrecord, ok := record.(WARCRecord) if ok {fmt.Println(warcrecord.ID())}