cdxj

package module
v0.0.0-...-69ef566 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 11, 2018 License: AGPL-3.0 Imports: 11 Imported by: 5

README

CDXJ

GitHub Slack GoDoc License

Golang package implementing the CDXJ file format used by OpenWayback 3.0.0 (and later) to index web archive contents (notably in WARC and ARC files) and make them searchable via a resource resolution service. The format builds on the CDX file format originally developed by the Internet Archive for the indexing behind the WaybackMachine. This specification builds on it by simplifying the primary fields while adding a flexible JSON 'block' to each record, allowing high flexiblity in the inclusion of additional data.

Copyright (C) 2017 Data Together
This program is free software: you can redistribute it and/or modify it under the terms of the GNU AFFERO General Public License as published by the Free Software Foundation, version 3.0.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the LICENSE file for details.

Getting Involved

We would love involvement from more people! If you notice any errors or would like to submit changes, please see our Contributing Guidelines.

We use GitHub issues for tracking bugs and feature requests and Pull Requests (PRs) for submitting changes

Installation

Use in any golang package with:

import "github.com/datatogether/cdxj"

Development

Coming Soon

Documentation

Overview

Package cdxj implements the CDXJ file format used by OpenWayback 3.0.0 (and later) to index web archive contents (notably in WARC and ARC files) and make them searchable via a resource resolution service. The format builds on the CDX file format originally developed by the Internet Archive for the indexing behind the WaybackMachine. This specification builds on it by simplifying the primary fields while adding a flexible JSON 'block' to each record, allowing high flexiblity in the inclusion of additional data.

Index

Constants

This section is empty.

Variables

View Source
var CanonicalizationScheme = purell.FlagsSafe

CanonicalizationScheme is the default method this package uses to canonicalize urls

Functions

func CanonicalizeURL

func CanonicalizeURL(rawurl string) (string, error)

CanonicalizeURL takes raw url strings & returns their normalized version Canonicalization is applied to URIs to remove trivial differences in the URIs that do not reflect that the URI reference different resources. Examples include removing session ID parameters, unneccessary port declerations (e.g. :80 when crawling HTTP). OpenWayback implements its own canonicalization process. Typically, it will be applied to the searchable URIs in CDXJ files. You can, however, use any canonicalization scheme you care for (including none). You must simply ensure that the same canonicalization process is applied to the URIs when performing searches. Otherwise they may not match correctly.

func SurtURL

func SurtURL(rawurl string) (string, error)

SurtURL is a transformation applied to URIs which makes their left-to-right representation better match the natural hierarchy of domain names. A URI `<scheme://domain.tld/path?query>` has SURT form `<scheme://(tld,domain,)/path?query>`. Conversion to SURT form also involves making all characters lowercase, and changing the 'https' scheme to 'http'. Further, the '/' after a URI authority component -- for example, the third slash in a regular HTTP URI -- will only appear in the SURT form if it appeared in the plain URI form.

func UnSurtPath

func UnSurtPath(surturl string) (string, error)

UnSurtPath gives the path element of a SURT'ed url

func UnSurtURL

func UnSurtURL(surturl string) (string, error)

UnSurtURL turns a SURT'ed url back into a normal Url TODO - should accept SURT urls that contain a scheme

func Validate

func Validate(r io.Reader) error

Validate checks that an io.Reader is a valid cdxj format

Types

type ByteRecords

type ByteRecords [][]byte

ByteRecords implements sortable for a slice marshaled CDXJ byte slices

func (ByteRecords) Len

func (a ByteRecords) Len() int

func (ByteRecords) Less

func (a ByteRecords) Less(i, j int) bool

func (ByteRecords) Swap

func (a ByteRecords) Swap(i, j int)

type Index

type Index []*Record

Index is a list of cdxj records. above a standard slice of record pointers, it includes methods for common patterns for adding and removing records

func (Index) AddRecord

func (index Index) AddRecord(rec *Record) Index

AddRecord adds a single record to the list, unless it's exact URI is already present

func (Index) AddWARCRecord

func (index Index) AddWARCRecord(rec *warc.Record) (Index, error)

AddWARCRecord adds creates a cdxj record from a WARC record and adds it to the index

func (Index) AddWARCRecords

func (index Index) AddWARCRecords(recs warc.Records) (Index, error)

AddWARCRecords adds a list of warc records to the index, creating cdxj records for each WARC record

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

A Reader reads records from a CDXJ-encoded io.Reader.

As returned by NewReader, a Reader expects input conforming to RFC 4180. The exported fields can be changed to customize the details before the first call to Read or ReadAll.

func NewReader

func NewReader(r io.Reader) *Reader

NewReader returns a new Reader that reads from r.

func (*Reader) Read

func (r *Reader) Read() (*Record, error)

Read reads a record from the reader err io.EOF will be returned when the last record is reached

func (*Reader) ReadAll

func (r *Reader) ReadAll() ([]*Record, error)

ReadAll consumes the entire reader, returning a slice of records

type Record

type Record struct {
	// Searchable URI
	// By *searchable*, we mean that the following transformations have been applied to it:
	// 1. Canonicalization - See Appendix A
	// 2. Sort-friendly URI Reordering Transform (SURT)
	// 3. The scheme is dropped from the SURT format
	URI string
	// should correspond to the WARC-Date timestamp as of WARC 1.1.
	// The timestamp shall represent the instant that data capture for record
	// creation began.
	// All timestamps should be in UTC.
	Timestamp time.Time
	// Indicates what type of record the current line refers to.
	// This field is fully compatible with WARC 1.0 definition of
	// WARC-Type (chapter 5.5 and chapter 6).
	RecordType warc.RecordType
	// This should contain fully valid JSON data. The only limitation, beyond those
	// imposed by JSON encoding rules, is that this may not contain any newline
	// characters, either in Unix (0x0A) or Windows form (0x0D0A).
	// The first occurance of a 0x0A constitutes the end of this field (and the record).
	JSON map[string]interface{}
}

Record is an entry in a cdxj index, consisting of uri, timestamp, recordtype, and metadata fields Following the header lines, each additional line should represent exactly one resource in a web archive. Typically in a WARC (ISO 28500) or ARC file, although the exact storage of the resource is not defined by this specification. Each such line shall be refered to as a *record*.

func NewMetadataRecord

func NewMetadataRecord(url string, ts time.Time, data map[string]interface{}) *Record

NewMetadataRecord is a convenience method to create a record with the Metadata record type

func NewRecord

func NewRecord(url string, ts time.Time, rt warc.RecordType, data map[string]interface{}) *Record

NewRecord creates a new cdxj record

func NewRecordFromWARCRecord

func NewRecordFromWARCRecord(rec *warc.Record) (*Record, error)

NewRecordFromWARCRecord generates a cdxj record from a WARC record

func NewRequestRecord

func NewRequestRecord(url string, ts time.Time, data map[string]interface{}) *Record

NewRequestRecord is a convenience method to create a record with the RequestR record type

func NewResourceRecord

func NewResourceRecord(url string, ts time.Time, data map[string]interface{}) *Record

NewResourceRecord is a convenience method to create a record with the Resource record type

func NewResponseRecord

func NewResponseRecord(url string, ts time.Time, data map[string]interface{}) *Record

NewResponseRecord is a convenience method to create a record with the Response record type

func NewRevisitRecord

func NewRevisitRecord(url string, ts time.Time, data map[string]interface{}) *Record

NewRevisitRecord is a convenience method to create a record with the RevisitR record type

func (*Record) MarshalCDXJ

func (r *Record) MarshalCDXJ() ([]byte, error)

MarshalCDXJ outputs a CDXJ representation of r

func (*Record) UnmarshalCDXJ

func (r *Record) UnmarshalCDXJ(data []byte) (err error)

UnmarshalCDXJ reads a cdxj record from a byte slice

type Writer

type Writer struct {
	// contains filtered or unexported fields
}

Writer writes to an io.Writer, create one with NewWriter You *must* call call Close to write the record to the specified writer

func NewWriter

func NewWriter(w io.Writer) *Writer

NewWriter allocates a new CDXJ Writer

func (*Writer) Close

func (w *Writer) Close() error

Close dumps the writer to the underlying io.Writer

func (*Writer) Write

func (w *Writer) Write(r *Record) error

Write a record to the writer

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL