span-crossref-snapshot

command
v0.1.361 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 9, 2024 License: GPL-3.0 Imports: 18 Imported by: 0

Documentation

Overview

Given as single file with crossref works API messages, create a potentially smaller file, which contains only the most recent version of each document.

Works in a three stage, two pass fashion: (1) extract, (2) identify, (3) extract. Performance data point (30M compressed records, 11m33.871s):

2017/07/24 18:26:10 stage 1: 8m13.799431646s 2017/07/24 18:26:55 stage 2: 45.746997314s 2017/07/24 18:29:30 stage 3: 2m34.23537293s

$ span-crossref-snapshot -z crossref.ndj.gz -o out.ndj.gz

Anecdata. We started the new "span-crossref-sync" based workflow in 2022-05-30 and have been requesting daily slices from crossref since 2022-01-01. As of 2023-12-04 we downloaded 701 files (zstd compressed).

               sz
count         701
mean   2816941417
std    2175766872
min             0
25%    1138093994
50%    2739488108
75%    4058532166
max   13751449046

Median daily shipment of about 2.7GB. If we only consider days on which we actually saw data, that number increases to about 3GB.

               sz
count         636
mean   3104836373
std    2079246455
min           573
25%    1541247728
50%    2998517796
75%    4150155730
max   13751449046

At most 13GB per day. Total sum of downloaded data is 1.796TB compressed (3); if we recompress with (19) we get around 1.3TB of raw data, or 8.07TiB uncompressed.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL