dcdump

package module
v0.1.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 4, 2022 License: MIT Imports: 8 Imported by: 0

README

Datacite Dump Tool

DOI

As of Fall 2019 the datacite API is a bit flaky: #237, #851, #188, #709 #897, #898.

This tool tries to get a data dump from the API, until a full dump might be available.

This data has been ingested into fatcat, via fatcat_import.py in 01/2020.

Install and Build

You'll need the go tool installed (i.e. installed go).

$ git clone https://git.archive.org/webgroup/dcdump.git
$ cd dcdump
$ make

Or install with the Go tool:

$ go install github.com/miku/dcdump/cmd/dcdump@latest

Usage

$ dcdump -h
Usage of dcdump:
  -A    do not include affiliation information
  -d string
        directory, where to put harvested files (default ".")
  -debug
        only print intervals then exit
  -e value
        end date for harvest (default 2022-07-04)
  -i string
        [w]eekly, [d]daily, [h]ourly, [e]very minute (default "d")
  -l int
        upper limit for number of requests (default 16777216)
  -p string
        file prefix for harvested files (default "dcdump-")
  -s value
        start date for harvest (default 2018-01-01)
  -sleep duration
        backoff after HTTP error (default 5m0s)
  -version
        show version
  -w int
        parallel workers (approximate) (default 4)

Affiliations

Affiliations are requested by default (turn if off with -A).

Example:

{
  "data": [
    {
      "id": "10.3886/e100985v1",
      "type": "dois",
      "attributes": {
        "doi": "10.3886/e100985v1",
        "identifiers": [
          {
            "identifier": "https://doi.org/10.3886/e100985v1",
            "identifierType": "DOI"
          }
        ],
        "creators": [
          {
            "name": "Porter, Joshua J.",
            "nameType": "Personal",
            "givenName": "Joshua J.",
            "familyName": "Porter",
            "affiliation": [
              {
                "name": "George Washington University"
              }
            ],
            "nameIdentifiers": []
          }
        ],
      ...

Examples

The dcdump tool uses datacite API version 2. We query for intervals and via cursor to circumvent the Index Deep Paging Problem (limit as of 12/2019 is 10000 records for a query, 400 pages x 25 records per page).

To just list the intervals (depending on the -i flag), use the -debug flag:

$ dcdump -i h -s 2019-10-01 -e 2019-10-02 -debug
2019-10-01 00:00:00 +0000 UTC -- 2019-10-01 00:59:59.999999999 +0000 UTC
2019-10-01 01:00:00 +0000 UTC -- 2019-10-01 01:59:59.999999999 +0000 UTC
2019-10-01 02:00:00 +0000 UTC -- 2019-10-01 02:59:59.999999999 +0000 UTC
2019-10-01 03:00:00 +0000 UTC -- 2019-10-01 03:59:59.999999999 +0000 UTC
2019-10-01 04:00:00 +0000 UTC -- 2019-10-01 04:59:59.999999999 +0000 UTC
2019-10-01 05:00:00 +0000 UTC -- 2019-10-01 05:59:59.999999999 +0000 UTC
2019-10-01 06:00:00 +0000 UTC -- 2019-10-01 06:59:59.999999999 +0000 UTC
2019-10-01 07:00:00 +0000 UTC -- 2019-10-01 07:59:59.999999999 +0000 UTC
2019-10-01 08:00:00 +0000 UTC -- 2019-10-01 08:59:59.999999999 +0000 UTC
2019-10-01 09:00:00 +0000 UTC -- 2019-10-01 09:59:59.999999999 +0000 UTC
2019-10-01 10:00:00 +0000 UTC -- 2019-10-01 10:59:59.999999999 +0000 UTC
2019-10-01 11:00:00 +0000 UTC -- 2019-10-01 11:59:59.999999999 +0000 UTC
2019-10-01 12:00:00 +0000 UTC -- 2019-10-01 12:59:59.999999999 +0000 UTC
2019-10-01 13:00:00 +0000 UTC -- 2019-10-01 13:59:59.999999999 +0000 UTC
2019-10-01 14:00:00 +0000 UTC -- 2019-10-01 14:59:59.999999999 +0000 UTC
2019-10-01 15:00:00 +0000 UTC -- 2019-10-01 15:59:59.999999999 +0000 UTC
2019-10-01 16:00:00 +0000 UTC -- 2019-10-01 16:59:59.999999999 +0000 UTC
2019-10-01 17:00:00 +0000 UTC -- 2019-10-01 17:59:59.999999999 +0000 UTC
2019-10-01 18:00:00 +0000 UTC -- 2019-10-01 18:59:59.999999999 +0000 UTC
2019-10-01 19:00:00 +0000 UTC -- 2019-10-01 19:59:59.999999999 +0000 UTC
2019-10-01 20:00:00 +0000 UTC -- 2019-10-01 20:59:59.999999999 +0000 UTC
2019-10-01 21:00:00 +0000 UTC -- 2019-10-01 21:59:59.999999999 +0000 UTC
2019-10-01 22:00:00 +0000 UTC -- 2019-10-01 22:59:59.999999999 +0000 UTC
2019-10-01 23:00:00 +0000 UTC -- 2019-10-01 23:59:59.999999999 +0000 UTC
2019-10-02 00:00:00 +0000 UTC -- 2019-10-02 00:59:59.999999999 +0000 UTC
INFO[0000] 25 intervals

Start and end date are relatively flexible, for example (minute slices for a single day):

$ dcdump -s 2019-05-01 -e '2019-05-01 23:59:59' -i e -debug
2019-05-01 00:00:00 +0000 UTC -- 2019-05-01 00:00:59.999999999 +0000 UTC
...
2019-05-01 23:59:00 +0000 UTC -- 2019-05-01 23:59:59.999999999 +0000 UTC
INFO[0000] 1440 intervals
...

So create some temporary dir (to not pollute the current directory with the harvested files).

$ mkdir tmp

Start harvesting (minute intervals, into tmp, with 2 workers).

$ dcdump -i e -d tmp -w 2

The time windows are not adjusted dynamically. Worse, it seems that even with a low profile harvest (two workers, backoffs, retries) and minute intervals, the harvest still can stall (maybe with a 403 or 500).

If a specific time window fails repeatedly, you can manually touch the file, e.g.

$ touch tmp/dcdump-20190801114700-20190801114759.ndjson

The dcdump tool checks for the existence of the file, before harvesting; this way it's possible to skip unfetchable slices.

After successful runs, concatenate the data to get a newline delimited single file dump of datacite.

$ cat tmp/*ndjson | sort -u > datacite.ndjson

Again, this is ugly, but should all be obsolete as soon as a public data dump is available.

Duration

A duration data point, about 80h.

$ dcdump -version
dcdump 5ae0556 2020-01-21T16:25:10Z

$ dcdump -i e
...
INFO[294683] 1075178 date slices succeeded

real    4911m23.343s
user    930m54.034s
sys     173m7.383s

After 80h, the total size amounts to about 78G.

Archive Items

Initial snapshot

A datacite snapshot from 11/2019 is available as part of the Bulk Bibliographic Metadata collection at Datacite Dump 20191122.

18210075 items, 72GB uncompressed.

Updates
$ curl -sL https://archive.org/download/datacite_dump_20211022/datacite_dump_20211022.json.zst | \
    zstdcat -c -T0 | jq -rc '.id'

10.1001/jama.289.8.989
10.1001/jama.293.14.1723-a
10.1001/jamainternmed.2013.9245
10.1001/jamaneurol.2015.4885
10.1002/2014gb004975
10.1002/2014gl061020
10.1002/2014jc009965
10.1002/2014jd022411
10.1002/2015gb005314
10.1002/2015gl065259
...
$ xz -T0 -cd datacite.ndjson.xz | wc
18210075 2562859030 72664858976

$ xz -T0 -cd datacite.ndjson.xz | sha1sum
6fa3bbb1fe07b42e021be32126617b7924f119fb  -

JI:KNIEKQ2QKJFEGVCTFUZDIMZQBI

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func HarvestBatch

func HarvestBatch(link string, maxRequests int, sleep time.Duration) (string, error)

HarvestBatch takes a link (like https://is.gd/0pwu5c) and follows subsequent pages, writes everything into a tempfile. Returns path to temporary file and an error. Fails, if HTTP status is >= 400; has limited retry capabilities.

Types

type DOIResponse

type DOIResponse struct {
	Data []struct {
		Attributes    interface{} `json:"attributes"`
		Id            string      `json:"id"`
		Relationships struct {
			Client struct {
				Data struct {
					Id   string `json:"id"`
					Type string `json:"type"`
				} `json:"data"`
			} `json:"client"`
		} `json:"relationships"`
		Type string `json:"type"`
	} `json:"data"`
	Included []struct {
		Attributes struct {
			AlternateName interface{}   `json:"alternateName"`
			ClientType    string        `json:"clientType"`
			ContactEmail  string        `json:"contactEmail"`
			Created       string        `json:"created"`
			Description   interface{}   `json:"description"`
			Domains       string        `json:"domains"`
			HasPassword   bool          `json:"hasPassword"`
			IsActive      bool          `json:"isActive"`
			Issn          interface{}   `json:"issn"`
			Language      []interface{} `json:"language"`
			Name          string        `json:"name"`
			Opendoar      interface{}   `json:"opendoar"`
			Re3data       interface{}   `json:"re3data"`
			Symbol        string        `json:"symbol"`
			Updated       string        `json:"updated"`
			Url           interface{}   `json:"url"`
			Year          int64         `json:"year"`
		} `json:"attributes"`
		Id            string `json:"id"`
		Relationships struct {
			Prefixes struct {
				Data []struct {
					Id   string `json:"id"`
					Type string `json:"type"`
				} `json:"data"`
			} `json:"prefixes"`
			Provider struct {
				Data struct {
					Id   string `json:"id"`
					Type string `json:"type"`
				} `json:"data"`
			} `json:"provider"`
		} `json:"relationships"`
		Type string `json:"type"`
	} `json:"included"`
	Links struct {
		Next string `json:"next"`
		Self string `json:"self"`
	} `json:"links"`
	Meta struct {
		Affiliations []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"affiliations"`
		Certificates []interface{} `json:"certificates"`
		Clients      []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"clients"`
		Created []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"created"`
		LinkChecksCitationDoi  int64 `json:"linkChecksCitationDoi"`
		LinkChecksDcIdentifier int64 `json:"linkChecksDcIdentifier"`
		LinkChecksSchemaOrgId  int64 `json:"linkChecksSchemaOrgId"`
		LinkChecksStatus       []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"linkChecksStatus"`
		LinksChecked       int64 `json:"linksChecked"`
		LinksWithSchemaOrg []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"linksWithSchemaOrg"`
		Page     int64 `json:"page"`
		Prefixes []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"prefixes"`
		Providers []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"providers"`
		Registered []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"registered"`
		ResourceTypes []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"resourceTypes"`
		SchemaVersions []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"schemaVersions"`
		Sources []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"sources"`
		States []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"states"`
		Subjects []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"subjects"`
		Total      int64 `json:"total"`
		TotalPages int64 `json:"totalPages"`
	} `json:"meta"`
}

DOIResponse is the https://api.datacite.org/dois endpoint response. TODO(martin): Sort out the interface{} fields, if necessary.

Directories

Path Synopsis
cmd
dcdump
Tool to fetch a full list of DOI from datacite.org API, because as of Fall 2019 a full dump is not yet available (https://git.io/Je6bs, https://git.io/Je6Dg).
Tool to fetch a full list of DOI from datacite.org API, because as of Fall 2019 a full dump is not yet available (https://git.io/Je6bs, https://git.io/Je6Dg).
Package dateutil provides a custom flag for dates.
Package dateutil provides a custom flag for dates.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL