marctools

package module
v1.6.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 9, 2019 License: GPL-2.0 Imports: 14 Imported by: 1

README

marctools

Various MARC command line utilities.

Build StatusGo Report Card

Installation

For native RPM or DEB packages see: Releases

If you have a local Go installation, you can just

go get github.com/ubleipzig/marctools/cmd/{marctojson,marctotsv,...}

Executables available:

Autogenerated docs: https://godoc.org/github.com/ubleipzig/marctools

marccount

Prints the number of records found in a file and then exits.

$ marccount fixtures/journals.mrc
10

marcdb

Turn a marc file into an sqlite3 database for random access. Supports secondary keys, so you can add an additional value as key, if needed (e.g. a date).

$ marcdb
Usage: marcdb [OPTIONS] MARCFILE
  -cpuprofile="": write cpu profile to file
  -encode=false: base64 encode record before inserting it
  -o="": output sqlite3 filename
  -secondary="": add a secondary value to the row
  -v=false: prints current program version

$ marcdb -secondary todo -o journals.db fixtures/journals.mrc
$ sqlite3 journals.db ".schema"
CREATE TABLE store (id TEXT, secondary TEXT, record BLOB, PRIMARY KEY (id, secondary));
CREATE INDEX idx_store_id ON store (id);

Note: sqlite3 version 3.8.6 has convenient io helper to extract binary data properly on the command line.

$ sqlite3 journals.db "select record from store where id = 'testsample1'" \
    > testsample1.mrc
$ sqlite3 journals.db "select record from store where id = 'testsample1' \
                       and secondary = 'todo'" > testsample1.mrc

If the -encode flag is set, the record will be base64 encoded before insert:

$ marcdb -encode -o journals.db fixtures/journals.mrc
$ sqlite3 journals.db "select record from store where id = 'testsample1'"
MDE1NzFjYXMgYTIyMDAzNjExYSA0NSAgMDAxMDAxMjAwMDAw... DE0Hh0=

marcdump

Dumps MARC to stdout, similar to yaz-marcdump:

$ marcdump fixtures/testbug2.mrc
001 testbug2
005 20110419140028.0
008 110214s1992    it a     b    001 0 ita d
020 [  ] [(a) 8820737493]
035 [  ] [(a) (OCoLC)ocm30585539]
040 [  ] [(a) RBN], [(c) RBN], [(d) OCLCG], [(d) PVU]
041 [1 ] [(a) ita], [(a) lat], [(h) lat]
043 [  ] [(a) e-it---]
050 [14] [(a) DG848.15], [(b) .V53 1992]
049 [  ] [(a) PVUM]
100 [1 ] [(a) Vico, Giambattista,], [(d) 1668-1744.]
240 [10] [(a) Principum Neapolitanorum coniurationis anni MDCCI ...
245 [13] [(a) La congiura dei Principi Napoletani 1701 :], [(b) (pr ...
250 [  ] [(a) Fictional edition.]
260 [  ] [(a) Morano :], [(b) Centro di Studi Vichiani,], [(c) 1992.]
300 [  ] [(a) 296 p. :], [(b) ill. ;], [(c) 24 cm.]
490 [1 ] [(a) Opere di Giambattista Vico ;], [(v) 2/1]
500 [  ] [(a) Italian and Latin.]
504 [  ] [(a) Includes bibliographical references (p. [277]-281) and index.]
520 [3 ] [(a) Sample abstract.]
590 [  ] [(a) April11phi]
651 [ 0] [(a) Naples (Kingdom)], [(x) History], [(y) Spanish rule, ....
700 [1 ] [(a) Pandolfi, Claudia.]
800 [1 ] [(a) Vico, Giambattista,], [(d) 1668-1744.], [(t) Works.], ...
856 [40] [(u) http://fictional.com/sample/url]
994 [  ] [(a) C0], [(b) PVU]

marcmap

Dumps a list of id, offset, length tuples to stdout (TSV) or to a sqlite3 database:

By default write to stdout:

$ marcmap fixtures/journals.mrc
testsample1 0   1571
testsample2 1571    1195
testsample3 2766    1057
testsample4 3823    1361
testsample5 5184    1707
testsample6 6891    1532
testsample7 8423    1426
testsample8 9849    1251
testsample9 11100   2173
testsample10    13273   1195

Dump listing into an sqlite database with -o FILENAME:

$ marcmap -o seekmap.db fixtures/journals.mrc
$ sqlite3 seekmap.db 'select id, offset, length from seekmap'
testsample1|0|1571
testsample2|1571|1195
testsample3|2766|1057
testsample4|3823|1361
testsample5|5184|1707
testsample6|6891|1532
testsample7|8423|1426
testsample8|9849|1251
testsample9|11100|2173
testsample10|13273|1195

marcsplit

Splits a MARC file into smaller pieces.

$ marcsplit
Usage of marcsplit:
  -C=1: number of records per file
  -cpuprofile="": write cpu profile to file
  -d=".": directory to write to
  -s="split-": split file prefix
  -v=false: prints current program version

$ marcsplit -d /tmp -C 3 -s "example-prefix-" fixtures/journals.mrc
$ ls -1 /tmp/example-prefix-0000000*
/tmp/example-prefix-00000000
/tmp/example-prefix-00000001
/tmp/example-prefix-00000002
/tmp/example-prefix-00000003

marctojson

Converts MARC to JSON. This is a bit slower than yaz-marcdump -i marc -o json, but offers a bit more flexibility in the output format: It is possible to filter fields, omit the leader and to add additional meta information. Also, the output format is terser. It keeps all the information (including order) from MARC, but tries to be as brief as possible, e.g. there are no explicit subfield keys and fields are used only once as keys. Here is a short side-by-side comparison.

$ marctojson
Usage of marctojson:
  -b=10000: batch size for intercom
  -cpuprofile="": write cpu profile to file
  -i=false: ignore marc errors (not recommended)
  -l=false: dump the leader as well
  -m="": a key=value pair to pass to meta
  -p=false: plain mode: dump without content and meta
  -r="": only dump the given tags (e.g. 001,003)
  -recordkey="record": key name of the record
  -v=false: prints current program version and exit
  -w=4: number of workers

Default conversion (abbreviated, pretty-printed):

$ marctojson fixtures/testbug2.mrc | jsonpp
{
   "record" : {
      ...
      "245" : [
         {
            "ind1" : "1",
            "c" : [
               "Giambattista Vico ; a cura di Claudia Pandolfi."
            ],
            "a" : [
               "La congiura dei Principi Napoletani 1701 :"
            ],
            "ind2" : "3",
            "b" : [
               "(prima e seconda stesura) /"
            ]
         }
      ],
      ...
      "250" : [
         {
            "ind2" : " ",
            "a" : [
               "Fictional edition."
            ],
            "ind1" : " "
         }
      ],
      "020" : [
         {
            "ind2" : " ",
            "ind1" : " ",
            "a" : [
               "8820737493"
            ]
         }
      ],
      "490" : [
         {
            "v" : [
               "2/1"
            ],
            "ind2" : " ",
            "a" : [
               "Opere di Giambattista Vico ;"
            ],
            "ind1" : "1"
         }
      ],
      "240" : [
         {
            "a" : [
               "Principum Neapolitanorum coniurationis anni MDCCI historia."
            ],
            "ind1" : "1",
            "l" : [
               "Italian & Latin"
            ],
            "ind2" : "0"
         }
      ],
      "001" : "testbug2"
   },
   "meta" : {}
}

Dump the leader as well with -l and only dump field 040 with -r 040:

$ marctojson -l -r 040 fixtures/testbug2.mrc | jsonpp
{
   "record" : {
      "040" : [
         {
            "ind2" : " ",
            "c" : [
               "RBN"
            ],
            "a" : [
               "RBN"
            ],
            "d" : [
               "OCLCG",
               "PVU"
            ],
            "ind1" : " "
         }
      ],
      "leader" : {
         "status" : "c",
         "sfcl" : "2",
         "lol" : "4",
         "losp" : "5",
         "type" : "a",
         "ba" : "337",
         "impldef" : "m Ma ",
         "length" : "1234",
         "ic" : "2",
         "raw" : "01234cam a2200337Ma 4500",
         "cs" : "a"
      }
   },
   "meta" : {}
}

Restrict JSON to 001 and 245, and use plain mode with -p, which has no meta or content key:

$ marctojson -r "001, 245" -p fixtures/testbug2.mrc | jsonpp
{
   "001" : "testbug2",
   "245" : [
      {
         "ind1" : "1",
         "a" : [
            "La congiura dei Principi Napoletani 1701 :"
         ],
         "ind2" : "3",
         "c" : [
            "Giambattista Vico ; a cura di Claudia Pandolfi."
         ],
         "b" : [
            "(prima e seconda stesura) /"
         ]
      }
   ]
}

Add some value (here the current date) to the meta map:

$ marctojson -r "001, 245" -m date="$(date)" fixtures/testbug2.mrc | jsonpp
{
   "record" : {
      "001" : "testbug2",
      "245" : [
         {
            "ind2" : "3",
            "c" : [
               "Giambattista Vico ; a cura di Claudia Pandolfi."
            ],
            "ind1" : "1",
            "a" : [
               "La congiura dei Principi Napoletani 1701 :"
            ],
            "b" : [
               "(prima e seconda stesura) /"
            ]
         }
      ]
   },
   "meta" : {
      "date" : "Wed Jul 23 17:21:24 CEST 2014"
   }
}

In marctools version 1.6, the record key can be supplied by the user, and the default key for the record data was changed from content to record.

$ marctojson -r "001, 245" -recordkey data fixtures/testbug2.mrc | jsonpp
{
  "data": {
    "001": "testbug2",
    "245": [
      {
        "a": [
          "La congiura dei Principi Napoletani 1701 :"
        ],
        "b": [
          "(prima e seconda stesura) /"
        ],
        "c": [
          "Giambattista Vico ; a cura di Claudia Pandolfi."
        ],
        "ind1": "1",
        "ind2": "3"
      }
    ]
  },
  "meta": {}
}

marctotsv

Converts selected MARC tags to tab-separated values (TSV).

$ marctotsv
Usage: marctotsv [OPTIONS] MARCFILE TAG [TAG, TAG, ...]
  -cpuprofile="": write cpu profile to file
  -f="<NULL>": fill missing values with this
  -i=false: ignore marc errors (not recommended)
  -k=false: skip incomplete lines (missing values)
  -s="": separator to use for multiple values
  -v=false: prints current program version and exit
  -w=4: number of workers

Extract a single column:

$ marctotsv fixtures/journals.mrc 001
testsample1
testsample2
testsample3
testsample4
testsample5
testsample6
testsample7
testsample8
testsample9
testsample10

Extract two columns:

$ marctotsv fixtures/journals.mrc 001 245.a
testsample1 Journal of rational emotive therapy :
testsample2 Rational living.
testsample3 Psychotherapy in private practice.
testsample4 Journal of quantitative criminology.
testsample5 The Journal of parapsychology.
testsample6 Journal of mathematics and mechanics.
testsample7 The Journal of psychology.
testsample8 Journal of psychosomatic research.
testsample9 The journal of sex research
testsample10    Journal of phenomenological psychology.

Use a custom value for undefined fields with -f UNDEF:

$ marctotsv -f UNDEF fixtures/journals.mrc  001 245.a 245.b
testsample1 Journal of rational emotive therapy :   the journal of the In ...
testsample2 Rational living.    UNDEF
testsample3 Psychotherapy in private practice.  UNDEF
testsample4 Journal of quantitative criminology.    UNDEF
testsample5 The Journal of parapsychology.  UNDEF
testsample6 Journal of mathematics and mechanics.   UNDEF
testsample7 The Journal of psychology.  UNDEF
testsample8 Journal of psychosomatic research.  UNDEF
testsample9 The journal of sex research UNDEF
testsample10    Journal of phenomenological psychology. UNDEF

Only keep complete rows with -k:

$ marctotsv -k fixtures/journals.mrc  001 245.a 245.b
testsample1 Journal of rational emotive therapy :   the journal of the In ...

Include all values, separated by a pipe via - s "|":

$ marctotsv -s "|" fixtures/journals.mrc  001 710.a
testsample1 Institute for Rational-Emotive Therapy (New York, N.Y.)
testsample2 Institute for Rational-Emotive Therapy (New York, N.Y.)|Inst ...
testsample3 <NULL>
testsample4 LINK (Online service)
testsample5 Duke University.|ProQuest Psychology Journals.
testsample6 Indiana University.|Indiana University.
testsample7 ProQuest Psychology Journals.
testsample8 ScienceDirect (Online service).
testsample9 Society for the Scientific Study of Sex (U.S.)|Society for ...
testsample10    Ingenta (Firm).

marcuniq

$ marcuniq
Usage: marcuniq [OPTIONS] MARCFILE
  -i=false: ignore marc errors (not recommended)
  -o="": output file (or stdout if none given)
  -v=false: prints current program version
  -x="": comma separated list of ids to exclude (or filename with one id per line)

Exclude three IDs and dump do file:

$ marcuniq -x "testsample1,testsample2" -o filtered.mrc fixtures/journals.mrc
excluded ids interpreted as string
2 ids to exclude loaded
10 records read
8 records written, 0 skipped, 2 excluded, 0 without ID (001)

$ marctotsv filtered.mrc 001
testsample3
testsample4
testsample5
testsample6
testsample7
testsample8
testsample9
testsample10

marcxmltojson

Convert MARCXML to Json. Note that MARCXML does not suffer certain size limits, as binary MARC does.

$ marcxmltojson
Usage: marcxmltojson [OPTIONS] MARCFILE
  -cpuprofile="": write cpu profile to file
  -i=false: ignore marc errors (not recommended)
  -l=false: dump the leader as well
  -m="": a key=value pair to pass to meta
  -p=false: plain mode: dump without content and meta
  -r="": only dump the given tags (e.g. 001,003)
  -v=false: prints current program version and exit
  -w=4: number of workers

Parameters are the same as for marctojson. Both command might merge into one in some future release.


Development

To run the tests just type:

make

To open a coverage report in you browser, run:

make cover

To package an DEB adjust debian/marctools/DEBIAN/control, e.g. update the version, then run:

make deb

To package an RPM, adjust packaging/marctools.spec, e.g. update the version, then run:

make rpm

To package an RPM on a CentOS 6.2 with libc 2.12 setup a VM with veewee and vagrant. Then run:

vagrant up
make vm-setup

Subsequently build RPMs against libc 2.12 with

make rpm-compatible

Previous versions

Versions 1.0 up to 1.3.8 (named gomarckit) used a non-standard project layout and lacked tests. Their version history is preserved under the 1.3.8-maint branch.

Todo

  • Perform and include some performance benchmarks in README.
  • The MARC21 library used might issue more system calls than needed, e.g. in the main Record create loop each data and control field will issue a read system call. It could be more efficient to read MARC in larger block and distribute the Record parsing itself to the workers.
  • Add more tests for more fancy MARC files (encodings, broken dirents, etc.).

Documentation

Index

Constants

View Source
const AppVersion = "1.6.3"

AppVersion is displayed by all command line tools

Variables

This section is empty.

Functions

func BatchWorker

func BatchWorker(in chan []*marc22.Record, out chan []byte, wg *sync.WaitGroup, options JSONConversionOptions)

Batchworker batches work of MARC records to JSON

func FanInWriter

func FanInWriter(writer io.Writer, in chan []byte, done chan bool)

FanInWriter writes the channel content to the writer

func IdentifierList added in v1.6.3

func IdentifierList(filename string, safe bool) []string

IdentifierList returns a slice of strings, containing all ids of the given marc file. Set safe to true to use the slower, more safe method of parsing each record. Fast method breaks when there are multiple 001 fields (invalid, but real-world).

func KeyValueStringToMap

func KeyValueStringToMap(s string) (map[string]string, error)

KeyValueStringToMap turns a string like "key1=value1, key2=value2" into a map.

func MarcMap

func MarcMap(infile string, writer io.Writer, safe bool)

MarcMap writes (id, offset, length) TSV of a given MARC file to a io.Writer

func MarcMapEntries

func MarcMapEntries(infile string, safe bool) chan MapEntry

MarcMapEntries returns a chan of MapEntry structs.

func MarcMapSqlite

func MarcMapSqlite(infile, outfile string, safe bool)

MarcMapSqlite writes (id, offset, length) sqlite3 database of a given MARC file to given output file

func MarcSplit

func MarcSplit(infile string, size int64)

MarcSplit splits a file into parts, each containing at most size records

func MarcSplitDirectory

func MarcSplitDirectory(infile string, size int64, directory string)

MarcSplitDirectory splits a file into parts, each containing at most size records and writes the to specified directory

func MarcSplitDirectoryPrefix

func MarcSplitDirectoryPrefix(infile string, size int64, directory, prefix string)

MarcSplitDirectoryPrefix splits a file into parts, each containing at most size records and writes the to specified directory, using a specific prefix

func RecordCount

func RecordCount(filename string) int64

RecordCount count the number of records in marc file

func RecordLength

func RecordLength(reader io.Reader) (length int64, err error)

RecordLength returns the length of the marc record as stored in the leader

func RecordMap

func RecordMap(record *marc22.Record, filter map[string]bool, includeLeader bool) map[string]interface{}

RecordMap converts a record to a map, optionally keeping only the tags given in filter. If includeLeader is true, the leader is converted as well.

func RecordToSlice

func RecordToSlice(record *marc22.Record,
	tags []string,
	fillna, separator string,
	skipIncompleteLines bool) []string

RecordToSlice returns a string slice with the values of the given tags

func RecordToTSV

func RecordToTSV(record *marc22.Record,
	tags []string,
	fillna, separator string,
	skipIncompleteLines bool) string

RecordToTSV turns a single record into a single TSV line

func StringToMapSet

func StringToMapSet(s string) map[string]bool

StringToMapSet takes a string of the form "val1,val2, val3" and turns it into a poor mans set, a map[string]bool that is.

func Worker

func Worker(in chan *marc22.Record, out chan []byte, wg *sync.WaitGroup, options JSONConversionOptions)

Worker takes a Work item and sends the result (serialized json) on the out channel

Types

type JSONConversionOptions added in v1.6.3

type JSONConversionOptions struct {
	FilterMap     map[string]bool   // which tags to include
	MetaMap       map[string]string // meta information
	IncludeLeader bool
	PlainMode     bool // only dump the content
	IgnoreErrors  bool
	RecordKey     string
}

JsonConversionOptions specify parameters for the MARC to JSON conversion

type MapEntry

type MapEntry struct {
	ID     string
	Offset int64
	Length int64
}

MapEntry contains location information of a single record in a MARC file

type StringSet

type StringSet struct {
	// contains filtered or unexported fields
}

StringSet is map disguised as set

func NewStringSet

func NewStringSet() *StringSet

NewStringSet returns an empty set

func (*StringSet) Add

func (set *StringSet) Add(s string) bool

Add adds a string to a set, returns true if added, false it it already existed (noop)

func (*StringSet) Contains

func (set *StringSet) Contains(s string) bool

Contains returns true if given string is in the set, false otherwise

func (*StringSet) Size

func (set *StringSet) Size() int

Size returns current number of elements in the set

Directories

Path Synopsis
cmd
marccount
Count records in a MARC file
Count records in a MARC file
marcmap
Create a seekmap of the form (sorted by OFFSET) ID OFFSET LENGTH
Create a seekmap of the form (sorted by OFFSET) ID OFFSET LENGTH
marcsnapshot
Keep the newest records among multiple versions in a set of files
Keep the newest records among multiple versions in a set of files
marcsplit
Go version of "yaz-marcdump -s prefix -C 1000 file.mrc"
Go version of "yaz-marcdump -s prefix -C 1000 file.mrc"
marctojson
Performance data point: Converting 6537611 records (7G) into /dev/null take about 9m31s on a Core i5-3470 (about 11k records/s).
Performance data point: Converting 6537611 records (7G) into /dev/null take about 9m31s on a Core i5-3470 (about 11k records/s).
marctotsv
Convert marc to tsv.
Convert marc to tsv.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL