Documentation ¶
Overview ¶
Package cdxj implements the CDXJ file format used by OpenWayback 3.0.0 (and later) to index web archive contents (notably in WARC and ARC files) and make them searchable via a resource resolution service. The format builds on the CDX file format originally developed by the Internet Archive for the indexing behind the WaybackMachine. This specification builds on it by simplifying the primary fields while adding a flexible JSON 'block' to each record, allowing high flexiblity in the inclusion of additional data.
Index ¶
- Variables
- func CanonicalizeURL(rawurl string) (string, error)
- func SurtURL(rawurl string) (string, error)
- func UnSurtPath(surturl string) (string, error)
- func UnSurtURL(surturl string) (string, error)
- func Validate(r io.Reader) error
- type ByteRecords
- type Index
- type Reader
- type Record
- func NewMetadataRecord(url string, ts time.Time, data map[string]interface{}) *Record
- func NewRecord(url string, ts time.Time, rt warc.RecordType, data map[string]interface{}) *Record
- func NewRecordFromWARCRecord(rec *warc.Record) (*Record, error)
- func NewRequestRecord(url string, ts time.Time, data map[string]interface{}) *Record
- func NewResourceRecord(url string, ts time.Time, data map[string]interface{}) *Record
- func NewResponseRecord(url string, ts time.Time, data map[string]interface{}) *Record
- func NewRevisitRecord(url string, ts time.Time, data map[string]interface{}) *Record
- type Writer
Constants ¶
This section is empty.
Variables ¶
var CanonicalizationScheme = purell.FlagsSafe
CanonicalizationScheme is the default method this package uses to canonicalize urls
Functions ¶
func CanonicalizeURL ¶
CanonicalizeURL takes raw url strings & returns their normalized version Canonicalization is applied to URIs to remove trivial differences in the URIs that do not reflect that the URI reference different resources. Examples include removing session ID parameters, unneccessary port declerations (e.g. :80 when crawling HTTP). OpenWayback implements its own canonicalization process. Typically, it will be applied to the searchable URIs in CDXJ files. You can, however, use any canonicalization scheme you care for (including none). You must simply ensure that the same canonicalization process is applied to the URIs when performing searches. Otherwise they may not match correctly.
func SurtURL ¶
SurtURL is a transformation applied to URIs which makes their left-to-right representation better match the natural hierarchy of domain names. A URI `<scheme://domain.tld/path?query>` has SURT form `<scheme://(tld,domain,)/path?query>`. Conversion to SURT form also involves making all characters lowercase, and changing the 'https' scheme to 'http'. Further, the '/' after a URI authority component -- for example, the third slash in a regular HTTP URI -- will only appear in the SURT form if it appeared in the plain URI form.
func UnSurtPath ¶
UnSurtPath gives the path element of a SURT'ed url
Types ¶
type ByteRecords ¶
type ByteRecords [][]byte
ByteRecords implements sortable for a slice marshaled CDXJ byte slices
func (ByteRecords) Len ¶
func (a ByteRecords) Len() int
func (ByteRecords) Less ¶
func (a ByteRecords) Less(i, j int) bool
func (ByteRecords) Swap ¶
func (a ByteRecords) Swap(i, j int)
type Index ¶
type Index []*Record
Index is a list of cdxj records. above a standard slice of record pointers, it includes methods for common patterns for adding and removing records
func (Index) AddRecord ¶
AddRecord adds a single record to the list, unless it's exact URI is already present
func (Index) AddWARCRecord ¶
AddWARCRecord adds creates a cdxj record from a WARC record and adds it to the index
type Reader ¶
type Reader struct {
// contains filtered or unexported fields
}
A Reader reads records from a CDXJ-encoded io.Reader.
As returned by NewReader, a Reader expects input conforming to RFC 4180. The exported fields can be changed to customize the details before the first call to Read or ReadAll.
type Record ¶
type Record struct { // Searchable URI // By *searchable*, we mean that the following transformations have been applied to it: // 1. Canonicalization - See Appendix A // 2. Sort-friendly URI Reordering Transform (SURT) // 3. The scheme is dropped from the SURT format URI string // should correspond to the WARC-Date timestamp as of WARC 1.1. // The timestamp shall represent the instant that data capture for record // creation began. // All timestamps should be in UTC. Timestamp time.Time // Indicates what type of record the current line refers to. // This field is fully compatible with WARC 1.0 definition of // WARC-Type (chapter 5.5 and chapter 6). RecordType warc.RecordType // This should contain fully valid JSON data. The only limitation, beyond those // imposed by JSON encoding rules, is that this may not contain any newline // characters, either in Unix (0x0A) or Windows form (0x0D0A). // The first occurance of a 0x0A constitutes the end of this field (and the record). JSON map[string]interface{} }
Record is an entry in a cdxj index, consisting of uri, timestamp, recordtype, and metadata fields Following the header lines, each additional line should represent exactly one resource in a web archive. Typically in a WARC (ISO 28500) or ARC file, although the exact storage of the resource is not defined by this specification. Each such line shall be refered to as a *record*.
func NewMetadataRecord ¶
NewMetadataRecord is a convenience method to create a record with the Metadata record type
func NewRecordFromWARCRecord ¶
NewRecordFromWARCRecord generates a cdxj record from a WARC record
func NewRequestRecord ¶
NewRequestRecord is a convenience method to create a record with the RequestR record type
func NewResourceRecord ¶
NewResourceRecord is a convenience method to create a record with the Resource record type
func NewResponseRecord ¶
NewResponseRecord is a convenience method to create a record with the Response record type
func NewRevisitRecord ¶
NewRevisitRecord is a convenience method to create a record with the RevisitR record type
func (*Record) MarshalCDXJ ¶
MarshalCDXJ outputs a CDXJ representation of r
func (*Record) UnmarshalCDXJ ¶
UnmarshalCDXJ reads a cdxj record from a byte slice