metha

package module
v0.1.24 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 20, 2017 License: GPL-3.0 Imports: 27 Imported by: 0

README

metha

Command line OAI-PMH incremental harvester. Data is harvested in monthly chunks.

$ metha-sync http://export.arxiv.org/oai2
...

All downloaded files are written to a directory below a base directory. The base directory is ~/.metha by default and can be adjusted with the METHA_DIR environment variable.

$ METHA_DIR=/tmp/harvest metha-sync -dir http://export.arxiv.org/oai2
/tmp/harvest/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky

To show the harvesting directory, you can use the -dir flag:

$ metha-sync -dir http://export.arxiv.org/oai2
/home/miku/.metha/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky

Harvesting can be interrupted any time. The data is currently harvested up to the last full day, so there is a small latency.

Example: If the current date would be Thu Apr 21 14:28:10 CEST 2016, the harvester would request all data since the repositories earliest date and 2016-04-20 23:59:59.

The HTTP client is resilient. You can stream records to stdout:

$ metha-cat http://export.arxiv.org/oai2

This will stream all harvested records to stdout. You can emit records based on datestamp as well:

$ metha-cat -from 2016-01-01 http://export.arxiv.org/oai2

This will only stream records with a datestamp equal or after 2016-01-01.

To just stream all data really fast, use find and zcat over the harvesting directory.

$ find $(metha-sync -dir http://export.arxiv.org/oai2) -name "*gz" | xargs unpigz -c

To display basic repository information:

$ metha-id http://export.arxiv.org/oai2

To list all harvested endpoints:

$ metha-ls

Installation

Use a release or

$ go get github.com/miku/metha/cmd/...

Limitations

Currently the endpoint URL, the format and the set are concatenated and base64 encoded to form the target directory, e.g:

$ echo "U291bmRzI29haV9kYyNodHRwOi8vY29wYWMuamlzYy5hYy51ay9vYWktcG1o" | base64 -d
Sounds#oai_dc#http://copac.jisc.ac.uk/oai-pmh

If you have very long set names or a very long URL and the target directory exceeds e.g. 255 chars (on ext4), the harvest won't work.

Harvesting Roulette

$ URL=$(sort -R <(curl -Lsf https://git.io/vKXFv) | head -1); metha-sync $URL && metha-cat $URL

Errors this harvester can somewhat handle

  • responses with resumption tokens that lead to empty responses
  • gzipped responses, that are not advertised as such
  • funny (illegal) control characters in XML responses
  • repositories, that won't respond unless the dates are given with the exact granualarity
  • repositories with endless token loops
  • repositories that do not support selective harvesting, use -no-intervals flag
  • limited repositories, metha will try up to 8 times with an exponential backoff
  • repositories, which throw occasional HTTP errors, although most of the responses look good, use -ignore-http-errors flag
  • funny XML entities (non-strict XML)

Misc

Show formats of random repository:

$ sort -R contrib/sites.tsv | head -1 | xargs -I {} metha-id {} | jq .formats

Documentation

Index

Constants

View Source
const (
	// DefaultTimeout on requests.
	DefaultTimeout = 5 * time.Minute
	// DefaultMaxRetries is the default number of retries on a single request.
	DefaultMaxRetries = 8
)
View Source
const Day = 24 * time.Hour

Day has 24 hours.

View Source
const Version = "0.1.24"

Version of tools.

Variables

View Source
var (
	// StdClient is the standard lib http client.
	StdClient = Client{Doer: http.DefaultClient}
	// DefaultClient is the more resilient client, that will retry and timeout.
	DefaultClient = Client{Doer: CreateDoer(DefaultTimeout, DefaultMaxRetries)}
	// DefaultUserAgent to identify crawler, some endpoints do not like the Go
	// default (https://golang.org/src/net/http/request.go#L462), e.g.
	// https://calhoun.nps.edu/oai/request.
	DefaultUserAgent = fmt.Sprintf("metha/%s", Version)
	// ControlCharReplacer helps to deal with broken XML: http://eprints.vu.edu.au/perl/oai2. Add more
	// weird things to be cleaned before XML parsing here. Another faulty:
	// http://digitalcommons.gardner-webb.edu/do/oai/?from=2016-02-29&metadataPr
	// efix=oai_dc&until=2016-03-31&verb=ListRecords. Replace control chars
	// outside XML char range.
	ControlCharReplacer = strings.NewReplacer(
		"\u0001", "", "\u0002", "", "\u0003", "",
		"\u0004", "", "\u0005", "", "\u0006", "",
		"\u0007", "", "\u0008", "", "\u0009", "",
		"\u000B", "", "\u000C", "", "\u000E", "",
		"\u000F", "", "\u0010", "", "\u0011", "",
		"\u0012", "", "\u0013", "", "\u0014", "",
		"\u0015", "", "\u0016", "", "\u0017", "",
		"\u0018", "", "\u0019", "", "\u001A", "",
		"\u001B", "", "\u001C", "", "\u001D", "",
		"\u001E", "", "\u001F", "")
)
View Source
var (
	// BaseDir is where all data is stored.
	BaseDir = filepath.Join(UserHomeDir(), ".metha")

	// ErrAlreadySynced only signals completion.
	ErrAlreadySynced = errors.New("already synced")
	// ErrInvalidEarliestDate for unparsable earliest date.
	ErrInvalidEarliestDate = errors.New("invalid earliest date")
)
View Source
var (
	ErrInvalidVerb      = errors.New("invalid OAI verb")
	ErrMissingVerb      = errors.New("missing verb")
	ErrCannotGenerateID = errors.New("cannot generate ID")
	ErrMissingURL       = errors.New("missing URL")
	ErrParameterMissing = errors.New("missing required parameter")
)

Functions

func MoveAndCompress added in v0.1.6

func MoveAndCompress(src, dst string) error

MoveAndCompress will move src to dst, gzipping in the process.

func MustGlob

func MustGlob(pattern string) []string

MustGlob is like filepath.Glob, but panics on bad pattern.

func PrependSchema

func PrependSchema(s string) string

PrependSchema prepends http, if its missing.

func UserHomeDir

func UserHomeDir() string

UserHomeDir returns the home directory of the user.

Types

type About

type About struct {
	Body []byte `xml:",innerxml" json:"body,omitempty"`
}

About has addition record information.

func (About) GoString

func (ab About) GoString() string

GoString is a formatter for About content.

type Client

type Client struct {
	Doer Doer
}

Client can execute requests.

func CreateClient

func CreateClient(timeout time.Duration, retries int) Client

CreateClient creates a client with timeout and retry properties.

func (*Client) Do

func (c *Client) Do(r *Request) (*Response, error)

Do executes a single OAIRequest. ResumptionToken handling must happen in the caller. Only Identify and GetRecord requests will return a complete response.

type Description

type Description struct {
	Body []byte `xml:",innerxml"`
}

Description holds information about a set.

func (Description) GoString

func (desc Description) GoString() string

GoString is a formatter for Description content.

type DirLaster

type DirLaster struct {
	Dir           string
	DefaultValue  string
	ExtractorFunc func(os.FileInfo) string
}

DirLaster extract the maximum value from the files of a directory. The values are extracted per file via TransformFunc, which gets a filename and returns a token. The tokens are sorted and the lexikographically largest element is returned.

func (DirLaster) Last

func (l DirLaster) Last() (string, error)

Last extracts the maximum value from a directory, given an extractor function.

type Doer

type Doer interface {
	Do(*http.Request) (*http.Response, error)
}

Doer is a minimal HTTP interface.

func CreateDoer

func CreateDoer(timeout time.Duration, retries int) Doer

CreateDoer will return http request clients with specific timeout and retry properties.

type GetRecord

type GetRecord struct {
	Record Record `xml:"record,omitempty" json:"record,omitempty"`
}

GetRecord returns a single record.

type HTTPError added in v0.1.8

type HTTPError struct {
	URL          *url.URL
	StatusCode   int
	RequestError error
}

HTTPError saves details of an HTTP error.

func (HTTPError) Error added in v0.1.8

func (e HTTPError) Error() string

Error prints the error message.

type Harvest

type Harvest struct {
	BaseURL string
	Format  string
	Set     string
	From    string
	Until   string

	MaxRequests                int
	DisableSelectiveHarvesting bool
	CleanBeforeDecode          bool
	IgnoreHTTPErrors           bool
	MaxEmptyResponses          int
	SuppressFormatParameter    bool
	DailyInterval              bool

	Identify *Identify
	Started  time.Time

	// Protects the (rare) case, where we are in the process of renaming
	// harvested files and get a termination signal at the same time.
	sync.Mutex
}

Harvest contains parameters for a mass-download. MaxRequests and CleanBeforeDecode are switches to handle broken token implementations and funny chars in responses. Some repos do not support selective harvesting (e.g. zvdd.org/oai2). Set "DisableSelectiveHarvesting" to try to grab metadata from these repositories. From and Until must always be given with 2006-01-02 layout. TODO(miku): make zero type work (lazily run identify).

func NewHarvest

func NewHarvest(baseURL string) (*Harvest, error)

NewHarvest creates a new harvest. A network connection will be used for an initial Identify request.

func (*Harvest) DateLayout

func (h *Harvest) DateLayout() string

DateLayout converts the repository endpoints advertised granularity to Go date format strings.

func (*Harvest) Dir

func (h *Harvest) Dir() string

Dir returns the absolute path to the harvesting directory.

func (*Harvest) Files

func (h *Harvest) Files() []string

Files returns all files for a given harvest, without the temporary files.

func (*Harvest) MkdirAll

func (h *Harvest) MkdirAll() error

MkdirAll creates necessary directories.

func (*Harvest) Run

func (h *Harvest) Run() error

Run starts the harvest.

type Header struct {
	Status     string   `xml:"status,attr" json:"status,omitempty"`
	Identifier string   `xml:"identifier,omitempty" json:"identifier,omitempty"`
	DateStamp  string   `xml:"datestamp,omitempty" json:"datestamp,omitempty"`
	SetSpec    []string `xml:"setSpec,omitempty" json:"setSpec,omitempty"`
}

A Header is part of other requests.

type Identify

type Identify struct {
	RepositoryName    string        `xml:"repositoryName,omitempty" json:"repositoryName,omitempty"`
	BaseURL           string        `xml:"baseURL,omitempty" json:"baseURL,omitempty"`
	ProtocolVersion   string        `xml:"protocolVersion,omitempty" json:"protocolVersion,omitempty"`
	AdminEmail        []string      `xml:"adminEmail,omitempty" json:"adminEmail,omitempty"`
	EarliestDatestamp string        `xml:"earliestDatestamp,omitempty" json:"earliestDatestamp,omitempty"`
	DeletedRecord     string        `xml:"deletedRecord,omitempty" json:"deletedRecord,omitempty"`
	Granularity       string        `xml:"granularity,omitempty" json:"granularity,omitempty"`
	Description       []Description `xml:"description,omitempty" json:"description,omitempty"`
}

Identify reports information about a repository.

type Interval

type Interval struct {
	Begin time.Time
	End   time.Time
}

Interval represents a span of time.

func (Interval) DailyIntervals added in v0.1.14

func (iv Interval) DailyIntervals() []Interval

DailyIntervals segments a given interval into daily chunks.

func (Interval) MonthlyIntervals

func (iv Interval) MonthlyIntervals() []Interval

MonthlyIntervals segments a given interval into montly chunks.

func (Interval) String added in v0.1.14

func (iv Interval) String() string

String formats the interval.

type Laster

type Laster interface {
	Last() (string, error)
}

Laster extracts some maximum value as string.

type ListIdentifiers

type ListIdentifiers struct {
	Headers         []Header `xml:"header,omitempty" json:"header,omitempty"`
	ResumptionToken string   `xml:"resumptionToken,omitempty" json:"resumptionToken,omitempty"`
}

ListIdentifiers lists headers only.

type ListMetadataFormats

type ListMetadataFormats struct {
	MetadataFormat []MetadataFormat `xml:"metadataFormat,omitempty" json:"metadataFormat,omitempty"`
}

ListMetadataFormats lists supported metadata formats.

type ListRecords

type ListRecords struct {
	Records         []Record `xml:"record" json:"record"`
	ResumptionToken string   `xml:"resumptionToken" json:"resumptionToken"`
}

ListRecords lists records.

type ListSets

type ListSets struct {
	Set             []Set  `xml:"set,omitempty"  json:"set,omitempty"`
	ResumptionToken string `xml:"resumptionToken,omitempty" json:"resumptionToken,omitempty"`
}

ListSets lists available sets. TODO(miku): resumptiontoken can have expiration date, etc.

type Metadata

type Metadata struct {
	Body []byte `xml:",innerxml"`
}

Metadata contains the actual metadata, conforming to varying schemas.

func (Metadata) GoString

func (md Metadata) GoString() string

GoString is a formatter for Metadata content.

func (Metadata) MarshalJSON

func (md Metadata) MarshalJSON() ([]byte, error)

MarshalJSON marshals the metadata body.

type MetadataFormat

type MetadataFormat struct {
	MetadataPrefix    string `xml:"metadataPrefix,omitempty" json:"metadataPrefix,omitempty"`
	Schema            string `xml:"schema,omitempty" json:"schema,omitempty"`
	MetadataNamespace string `xml:"metadataNamespace,omitempty" json:"metadataNamespace,omitempty"`
}

MetadataFormat holds information about a format.

type MultiError

type MultiError struct {
	Errors []error
}

MultiError collects a number of errors.

func (*MultiError) Error

func (e *MultiError) Error() string

Error formats all error strings into a single string.

type OAIError

type OAIError struct {
	Code    string `xml:"code,attr" json:"code,omitempty"`
	Message string `xml:",chardata" json:"message,omitempty"`
}

OAIError is an OAI protocol error.

func (OAIError) Error

func (e OAIError) Error() string

Error formats code and message.

type Record

type Record struct {
	Header   Header   `xml:"header,omitempty" json:"header,omitempty"`
	Metadata Metadata `xml:"metadata,omitempty" json:"metadata,omitempty"`
	About    About    `xml:"about,omitempty" json:"about,omitempty"`
}

Record represents a single record.

type Repository

type Repository struct {
	BaseURL string
}

Repository represents an OAI endpoint.

func (Repository) Formats

func (r Repository) Formats() ([]MetadataFormat, error)

Formats returns a list of metadata formats.

func (Repository) Sets

func (r Repository) Sets() ([]Set, error)

Sets returns a list of sets.

type Request

type Request struct {
	BaseURL                 string
	Verb                    string
	Identifier              string
	MetadataPrefix          string
	From                    string
	Until                   string
	Set                     string
	ResumptionToken         string
	CleanBeforeDecode       bool
	SuppressFormatParameter bool
}

A Request can express any request, that can be sent to an OAI server. Not all combination of values will yield valid requests.

func (*Request) URL

func (r *Request) URL() (*url.URL, error)

URL returns the URL for a given request. Invalid verbs and missing parameters are reported here.

type RequestNode

type RequestNode struct {
	Verb           string `xml:"verb,attr" json:"verb,omitempty"`
	Set            string `xml:"set,attr" json:"set,omitempty"`
	MetadataPrefix string `xml:"metadataPrefix,attr" json:"metadataPrefix,omitempty"`
}

RequestNode carries the request information into the response.

type Response

type Response struct {
	ResponseDate string      `xml:"responseDate,omitempty" json:"responseDate,omitempty"`
	Request      RequestNode `xml:"request,omitempty" json:"request,omitempty"`
	Error        OAIError    `xml:"error,omitempty" json:"error,omitempty"`

	GetRecord           GetRecord           `xml:"GetRecord,omitempty" json:"GetRecord,omitempty"`
	Identify            Identify            `xml:"Identify,omitempty" json:"Identify,omitempty"`
	ListIdentifiers     ListIdentifiers     `xml:"ListIdentifiers,omitempty" json:"ListIdentifiers,omitempty"`
	ListMetadataFormats ListMetadataFormats `xml:"ListMetadataFormats,omitempty" json:"ListMetadataFormats,omitempty"`
	ListRecords         ListRecords         `xml:"ListRecords,omitempty" json:"ListRecords,omitempty"`
	ListSets            ListSets            `xml:"ListSets,omitempty" json:"ListSets,omitempty"`
}

Response is the envelope. It can hold any OAI response kind.

func Do

func Do(r *Request) (*Response, error)

Do is a shortcut for DefaultClient.Do.

func (*Response) GetResumptionToken

func (response *Response) GetResumptionToken() string

GetResumptionToken returns the resumption token or an empty string if it does not have a token

func (*Response) HasResumptionToken

func (response *Response) HasResumptionToken() bool

HasResumptionToken determines if the request has a ResumptionToken.

type Set

type Set struct {
	SetSpec        string      `xml:"setSpec,omitempty" json:"setSpec,omitempty"`
	SetName        string      `xml:"setName,omitempty" json:"setName,omitempty"`
	SetDescription Description `xml:"setDescription,omitempty" json:"setDescription,omitempty"`
}

A Set has a spec, name and description.

type Values

type Values struct {
	url.Values
}

Values enhances the builtin url.Values.

func NewValues

func NewValues() Values

NewValues create a new Values container.

func (Values) EncodeVerbatim

func (v Values) EncodeVerbatim() string

EncodeVerbatim is like Encode(), but does not escape the keys and values.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL