crossrefindexer

package module
v0.1.8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 30, 2023 License: MIT Imports: 11 Imported by: 0

README

crossrefindexer

Indexes metadata from Crossref into Elasticsearch. Primarily to be used with Biblio-Glutton. It is currently a single-purpose application that has the format for Glutton hardcoded. If you want to modify what data is being indexed you need to modify the ToSimplifiedPublication function in the root package.

This application can read both regular JSON as well as newline-delimited JSON (NDJSON). It supports GZIP and uncompressed data (TAR will be added). You can read from single files, directories or stdin. Configuration can be done via commandline flags or env variables.

Installation

Make sure you have Go installed and made sure that the correct folders are added to $PATH. Then run:

go install github.com/karatekaneen/crossrefindexer/cmd/crossrefindexer`

Usage

Configuration

To see full configuration options run crossrefindexer --help. The output below is how it looks at the time of writing.

Usage: crossrefindexer

Small CLI application to uncompress and index Crossref metadata. It can read
from file, directories and stdin. It supports both compressed (gzip only at the
time of writing) and raw JSON/NDJSON.

Flags:
  -h, --help                     Show context-sensitive help.
      --remove-index             Remove existing index before starting. WARNING
                                 - you will not get any confirmation prompt
  -f, --file=STRING              Absolute or relative path to a single file
                                 to index. If you set to '-' it will read from
                                 stdin
      --dir=STRING               Absolute or relative path to a directory
                                 containing files to index
      --es.index="crossref"      The index to write to ($ES_INDEX)
      --es.flushbytes=5000000    How many bytes to buffer before flushing.
                                 Defaults to 5M ($ES_FLUSH_BYTES)
      --es.flushinterval=10s     How many seconds to wait before flushing
                                 ($ES_FLUSH_INTERVAL)
      --es.workers=4             Number of goroutines to run ($ES_WORKERS)
  -p, --es.password=STRING       Password to elasticsearch ($ES_PASSWORD)
  -u, --es.username=STRING       Username to elasticsearch ($ES_USER)
      --es.hosts=http://127.0.0.1:9200,...
                                 Elasticsearch hosts ($ES_HOSTS)
      --es.ca=ES.CA,...          CA cert to trust ($ES_CA_CERT)
      --es.noretry               Fail on first failure ($ES_NO_RETRY)
      --es.max-retries=5         Max number of retries after failure
                                 ($ES_MAX_RETRIES)
      --es.compress              If the request body should be compressed
                                 ($ES_COMPRESS)
      --format="unknown"         The format of the uncompressed files. Will try
                                 to detect if not provided but is required if
                                 using stdin. Can be json, ndjson or unknown
  -c, --compression="unknown"    How the data file is compressed. For files it
                                 will use the file extension if not provided.
                                 For dirs it will be ignored. Can be unknown,
                                 none or gzip
      --loglevel="info"          Log verbosity. Can be debug, info, warn, error
Read from stdin

When reading from stdin you must specify both format and compression.

# the part with "-f -" means that it is reading from stdin
cat testdata/2022/0.json.gz | crossrefindexer -f - --format json -c gzip
Read from single file
# Compression is detected from the file extension
crossrefindexer -f testdata/2022/0.json.gz --format json
Read from directory
# Compression is detected from the file extension.
# It supports multiple formats in the same directory.
crossrefindexer --dir testdata/2022 --format json

TODO

  • Support TAR files

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ParseData

func ParseData(container DataContainer, out chan Crossref) error

ParseData reads the data described in the container and passes it via the out channel

Types

type Affiliation

type Affiliation struct {
	Name string `json:"name"`
}

type Author

type Author struct {
	Given       *string        `json:"given"`
	Family      *string        `json:"family"`
	Sequence    *string        `json:"sequence"`
	Affiliation *[]Affiliation `json:"affiliation"`
}

type ContentDomain

type ContentDomain struct {
	Domain               []string `json:"domain"`
	CrossmarkRestriction bool     `json:"crossmark-restriction"`
}

type Crossref

type Crossref struct {
	Abstract            *string       `json:"abstract"` // Gap
	Author              []Author      `json:"author"`
	ContainerTitle      []string      `json:"container-title"`
	ContentDomain       ContentDomain `json:"content-domain"`
	Created             Indexed       `json:"created"`
	Deposited           Indexed       `json:"deposited"`
	Doi                 string        `json:"DOI"`
	Indexed             Indexed       `json:"indexed"`
	IsReferencedByCount int           `json:"is-referenced-by-count"`
	Issn                []string      `json:"ISSN"`
	IssnType            []IssnType    `json:"issn-type"`
	Issue               string        `json:"issue"`
	Issued              DateParts     `json:"issued"`
	JournalIssue        *JournalIssue `json:"journal-issue"` // Gap
	Language            *string       `json:"language"`      // Gap
	Link                *[]Link       `json:"link"`          // Gap
	Member              string        `json:"member"`
	OriginalTitle       *[]any        `json:"original-title"` // 2021
	Page                string        `json:"page"`
	Prefix              string        `json:"prefix"`
	Published           *DateParts    `json:"published"`        // Gap
	PublishedOnline     *DateParts    `json:"published-online"` // Gap
	PublishedOther      *DateParts    `json:"published-other"`  // Gap
	PublishedPrint      *DateParts    `json:"published-print"`
	Publisher           string        `json:"publisher"`
	Reference           *[]Reference  `json:"reference"` // Gap
	ReferenceCount      int           `json:"reference-count"`
	ReferencesCount     int           `json:"references-count"`
	Relation            *Relation     `json:"relation"` // 2021
	Resource            Resource      `json:"resource"`
	Score               float64       `json:"score"`
	ShortContainerTitle *[]string     `json:"short-container-title"` // 2021
	ShortTitle          *[]any        `json:"short-title"`           // 2021
	Source              string        `json:"source"`
	Subject             []string      `json:"subject"`
	Subtitle            *[]any        `json:"subtitle"` // 2021
	Title               []string      `json:"title"`
	Type                string        `json:"type"`
	URL                 string        `json:"URL"`
	UpdatePolicy        *string       `json:"update-policy"` // Gap
	Volume              string        `json:"volume"`
	License             []License     `json:"license"`
	AlternativeID       []string      `json:"alternative-id"`
}

Reference and value semantics reflect required and optional value in json

type DataContainer

type DataContainer struct {
	Data        io.Reader // The data to index if passed by stdin or similar
	Path        string    // Path to the file to read
	Format      Format    // Format of the file, either "json" or "ndjson"
	Compression string    // The kind of compression. Currently only supports "none" or "gzip"
}

func Load

func Load(
	logger *zap.SugaredLogger,
	path, dir string, format Format, compression string,
	data io.Reader,
) ([]DataContainer, error)

Load structures the data that should be indexed. It returns a slice of items to be processed.

func (*DataContainer) Valid

func (d *DataContainer) Valid() error

type DateParts

type DateParts struct {
	DateParts [][]int `json:"date-parts"`
}

type Format

type Format string
const (
	FormatUnknown Format = "unknown"
	FormatJSON    Format = "json"
	FormatNDJSON  Format = "ndjson"
)

type Indexed

type Indexed struct {
	DateParts [][]int   `json:"date-parts"`
	DateTime  time.Time `json:"date-time"`
	Timestamp int64     `json:"timestamp"`
}

type IssnType

type IssnType struct {
	Value string `json:"value"`
	Type  string `json:"type"`
}

type JournalIssue

type JournalIssue struct {
	Issue           *string    `json:"issue"`
	PublishedOnline *DateParts `json:"published-online"`
	PublishedPrint  *DateParts `json:"published-print"`
}

type License

type License struct {
	URL            string  `json:"URL"`
	Start          Indexed `json:"start"`
	DelayInDays    int     `json:"delay-in-days"`
	ContentVersion string  `json:"content-version"`
}
type Link struct {
	URL                 string `json:"URL"`
	ContentType         string `json:"content-type"`
	ContentVersion      string `json:"content-version"`
	IntendedApplication string `json:"intended-application"`
}

type Primary

type Primary struct {
	URL string `json:"URL"`
}

type Reference

type Reference struct {
	Key           string  `json:"key"`
	VolumeTitle   string  `json:"volume-title,omitempty"`
	Author        string  `json:"author"`
	Year          string  `json:"year"`
	FirstPage     string  `json:"first-page,omitempty"`
	ArticleTitle  string  `json:"article-title,omitempty"`
	DoiAssertedBy string  `json:"doi-asserted-by,omitempty"`
	Doi           string  `json:"DOI,omitempty"`
	Volume        string  `json:"volume,omitempty"`
	JournalTitle  string  `json:"journal-title,omitempty"`
	Issue         string  `json:"issue,omitempty"`
	Unstructured  *string `json:"unstructured,omitempty"`
}

type Relation

type Relation struct {
	Cites []any `json:"cites"`
}

type Resource

type Resource struct {
	Primary Primary `json:"primary"`
}

type SimplifiedPublication

type SimplifiedPublication struct {
	Title              []string `json:"title"`
	DOI                string   `json:"DOI"`
	FirstPage          string   `json:"first_page"`
	Journal            []string `json:"journal"`
	AbbreviatedJournal []string `json:"abbreviated_journal"`
	Volume             string   `json:"volume"`
	Issue              string   `json:"issue"`
	Year               int      `json:"year"`
	Bibliographic      string   `json:"bibliographic"`
}

func ToSimplifiedPublication

func ToSimplifiedPublication(pub *Crossref) SimplifiedPublication

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL