marctools

package module

v1.6.3 Latest Latest Go to latest Published: Mar 9, 2019 License: GPL-2.0 Imports: 14 Imported by: 1

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/ubleipzig/marctools

Links

Open Source Insights

README ¶

marctools

Various MARC command line utilities.

Installation

For native RPM or DEB packages see: Releases

If you have a local Go installation, you can just

go get github.com/ubleipzig/marctools/cmd/{marctojson,marctotsv,...}

Executables available:

Autogenerated docs: https://godoc.org/github.com/ubleipzig/marctools

marccount

Prints the number of records found in a file and then exits.

$ marccount fixtures/journals.mrc
10

marcdb

Turn a marc file into an sqlite3 database for random access. Supports secondary keys, so you can add an additional value as key, if needed (e.g. a date).

$ marcdb
Usage: marcdb [OPTIONS] MARCFILE
  -cpuprofile="": write cpu profile to file
  -encode=false: base64 encode record before inserting it
  -o="": output sqlite3 filename
  -secondary="": add a secondary value to the row
  -v=false: prints current program version

$ marcdb -secondary todo -o journals.db fixtures/journals.mrc
$ sqlite3 journals.db ".schema"
CREATE TABLE store (id TEXT, secondary TEXT, record BLOB, PRIMARY KEY (id, secondary));
CREATE INDEX idx_store_id ON store (id);

Note: sqlite3 version 3.8.6 has convenient io helper to extract binary data properly on the command line.

$ sqlite3 journals.db "select record from store where id = 'testsample1'" \
    > testsample1.mrc
$ sqlite3 journals.db "select record from store where id = 'testsample1' \
                       and secondary = 'todo'" > testsample1.mrc

If the -encode flag is set, the record will be base64 encoded before insert:

$ marcdb -encode -o journals.db fixtures/journals.mrc
$ sqlite3 journals.db "select record from store where id = 'testsample1'"
MDE1NzFjYXMgYTIyMDAzNjExYSA0NSAgMDAxMDAxMjAwMDAw... DE0Hh0=

marcdump

Dumps MARC to stdout, similar to yaz-marcdump:

$ marcdump fixtures/testbug2.mrc
001 testbug2
005 20110419140028.0
008 110214s1992    it a     b    001 0 ita d
020 [  ] [(a) 8820737493]
035 [  ] [(a) (OCoLC)ocm30585539]
040 [  ] [(a) RBN], [(c) RBN], [(d) OCLCG], [(d) PVU]
041 [1 ] [(a) ita], [(a) lat], [(h) lat]
043 [  ] [(a) e-it---]
050 [14] [(a) DG848.15], [(b) .V53 1992]
049 [  ] [(a) PVUM]
100 [1 ] [(a) Vico, Giambattista,], [(d) 1668-1744.]
240 [10] [(a) Principum Neapolitanorum coniurationis anni MDCCI ...
245 [13] [(a) La congiura dei Principi Napoletani 1701 :], [(b) (pr ...
250 [  ] [(a) Fictional edition.]
260 [  ] [(a) Morano :], [(b) Centro di Studi Vichiani,], [(c) 1992.]
300 [  ] [(a) 296 p. :], [(b) ill. ;], [(c) 24 cm.]
490 [1 ] [(a) Opere di Giambattista Vico ;], [(v) 2/1]
500 [  ] [(a) Italian and Latin.]
504 [  ] [(a) Includes bibliographical references (p. [277]-281) and index.]
520 [3 ] [(a) Sample abstract.]
590 [  ] [(a) April11phi]
651 [ 0] [(a) Naples (Kingdom)], [(x) History], [(y) Spanish rule, ....
700 [1 ] [(a) Pandolfi, Claudia.]
800 [1 ] [(a) Vico, Giambattista,], [(d) 1668-1744.], [(t) Works.], ...
856 [40] [(u) http://fictional.com/sample/url]
994 [  ] [(a) C0], [(b) PVU]

marcmap

Dumps a list of id, offset, length tuples to stdout (TSV) or to a sqlite3 database:

By default write to stdout:

$ marcmap fixtures/journals.mrc
testsample1 0   1571
testsample2 1571    1195
testsample3 2766    1057
testsample4 3823    1361
testsample5 5184    1707
testsample6 6891    1532
testsample7 8423    1426
testsample8 9849    1251
testsample9 11100   2173
testsample10    13273   1195

Dump listing into an sqlite database with -o FILENAME:

$ marcmap -o seekmap.db fixtures/journals.mrc
$ sqlite3 seekmap.db 'select id, offset, length from seekmap'
testsample1|0|1571
testsample2|1571|1195
testsample3|2766|1057
testsample4|3823|1361
testsample5|5184|1707
testsample6|6891|1532
testsample7|8423|1426
testsample8|9849|1251
testsample9|11100|2173
testsample10|13273|1195

marcsplit

Splits a MARC file into smaller pieces.

$ marcsplit
Usage of marcsplit:
  -C=1: number of records per file
  -cpuprofile="": write cpu profile to file
  -d=".": directory to write to
  -s="split-": split file prefix
  -v=false: prints current program version

$ marcsplit -d /tmp -C 3 -s "example-prefix-" fixtures/journals.mrc
$ ls -1 /tmp/example-prefix-0000000*
/tmp/example-prefix-00000000
/tmp/example-prefix-00000001
/tmp/example-prefix-00000002
/tmp/example-prefix-00000003

marctojson

Converts MARC to JSON. This is a bit slower than yaz-marcdump -i marc -o json, but offers a bit more flexibility in the output format: It is possible to filter fields, omit the leader and to add additional meta information. Also, the output format is terser. It keeps all the information (including order) from MARC, but tries to be as brief as possible, e.g. there are no explicit subfield keys and fields are used only once as keys. Here is a short side-by-side comparison.

$ marctojson
Usage of marctojson:
  -b=10000: batch size for intercom
  -cpuprofile="": write cpu profile to file
  -i=false: ignore marc errors (not recommended)
  -l=false: dump the leader as well
  -m="": a key=value pair to pass to meta
  -p=false: plain mode: dump without content and meta
  -r="": only dump the given tags (e.g. 001,003)
  -recordkey="record": key name of the record
  -v=false: prints current program version and exit
  -w=4: number of workers

Default conversion (abbreviated, pretty-printed):

$ marctojson fixtures/testbug2.mrc | jsonpp
{
   "record" : {
      ...
      "245" : [
         {
            "ind1" : "1",
            "c" : [
               "Giambattista Vico ; a cura di Claudia Pandolfi."
            ],
            "a" : [
               "La congiura dei Principi Napoletani 1701 :"
            ],
            "ind2" : "3",
            "b" : [
               "(prima e seconda stesura) /"
            ]
         }
      ],
      ...
      "250" : [
         {
            "ind2" : " ",
            "a" : [
               "Fictional edition."
            ],
            "ind1" : " "
         }
      ],
      "020" : [
         {
            "ind2" : " ",
            "ind1" : " ",
            "a" : [
               "8820737493"
            ]
         }
      ],
      "490" : [
         {
            "v" : [
               "2/1"
            ],
            "ind2" : " ",
            "a" : [
               "Opere di Giambattista Vico ;"
            ],
            "ind1" : "1"
         }
      ],
      "240" : [
         {
            "a" : [
               "Principum Neapolitanorum coniurationis anni MDCCI historia."
            ],
            "ind1" : "1",
            "l" : [
               "Italian & Latin"
            ],
            "ind2" : "0"
         }
      ],
      "001" : "testbug2"
   },
   "meta" : {}
}

Dump the leader as well with -l and only dump field 040 with -r 040:

$ marctojson -l -r 040 fixtures/testbug2.mrc | jsonpp
{
   "record" : {
      "040" : [
         {
            "ind2" : " ",
            "c" : [
               "RBN"
            ],
            "a" : [
               "RBN"
            ],
            "d" : [
               "OCLCG",
               "PVU"
            ],
            "ind1" : " "
         }
      ],
      "leader" : {
         "status" : "c",
         "sfcl" : "2",
         "lol" : "4",
         "losp" : "5",
         "type" : "a",
         "ba" : "337",
         "impldef" : "m Ma ",
         "length" : "1234",
         "ic" : "2",
         "raw" : "01234cam a2200337Ma 4500",
         "cs" : "a"
      }
   },
   "meta" : {}
}

Restrict JSON to 001 and 245, and use plain mode with -p, which has no meta or content key:

$ marctojson -r "001, 245" -p fixtures/testbug2.mrc | jsonpp
{
   "001" : "testbug2",
   "245" : [
      {
         "ind1" : "1",
         "a" : [
            "La congiura dei Principi Napoletani 1701 :"
         ],
         "ind2" : "3",
         "c" : [
            "Giambattista Vico ; a cura di Claudia Pandolfi."
         ],
         "b" : [
            "(prima e seconda stesura) /"
         ]
      }
   ]
}

Add some value (here the current date) to the meta map:

$ marctojson -r "001, 245" -m date="$(date)" fixtures/testbug2.mrc | jsonpp
{
   "record" : {
      "001" : "testbug2",
      "245" : [
         {
            "ind2" : "3",
            "c" : [
               "Giambattista Vico ; a cura di Claudia Pandolfi."
            ],
            "ind1" : "1",
            "a" : [
               "La congiura dei Principi Napoletani 1701 :"
            ],
            "b" : [
               "(prima e seconda stesura) /"
            ]
         }
      ]
   },
   "meta" : {
      "date" : "Wed Jul 23 17:21:24 CEST 2014"
   }
}

In marctools version 1.6, the record key can be supplied by the user, and the default key for the record data was changed from content to record.

$ marctojson -r "001, 245" -recordkey data fixtures/testbug2.mrc | jsonpp
{
  "data": {
    "001": "testbug2",
    "245": [
      {
        "a": [
          "La congiura dei Principi Napoletani 1701 :"
        ],
        "b": [
          "(prima e seconda stesura) /"
        ],
        "c": [
          "Giambattista Vico ; a cura di Claudia Pandolfi."
        ],
        "ind1": "1",
        "ind2": "3"
      }
    ]
  },
  "meta": {}
}

marctotsv

Converts selected MARC tags to tab-separated values (TSV).

$ marctotsv
Usage: marctotsv [OPTIONS] MARCFILE TAG [TAG, TAG, ...]
  -cpuprofile="": write cpu profile to file
  -f="<NULL>": fill missing values with this
  -i=false: ignore marc errors (not recommended)
  -k=false: skip incomplete lines (missing values)
  -s="": separator to use for multiple values
  -v=false: prints current program version and exit
  -w=4: number of workers

Extract a single column:

$ marctotsv fixtures/journals.mrc 001
testsample1
testsample2
testsample3
testsample4
testsample5
testsample6
testsample7
testsample8
testsample9
testsample10

Extract two columns:

$ marctotsv fixtures/journals.mrc 001 245.a
testsample1 Journal of rational emotive therapy :
testsample2 Rational living.
testsample3 Psychotherapy in private practice.
testsample4 Journal of quantitative criminology.
testsample5 The Journal of parapsychology.
testsample6 Journal of mathematics and mechanics.
testsample7 The Journal of psychology.
testsample8 Journal of psychosomatic research.
testsample9 The journal of sex research
testsample10    Journal of phenomenological psychology.

Use a custom value for undefined fields with -f UNDEF:

$ marctotsv -f UNDEF fixtures/journals.mrc  001 245.a 245.b
testsample1 Journal of rational emotive therapy :   the journal of the In ...
testsample2 Rational living.    UNDEF
testsample3 Psychotherapy in private practice.  UNDEF
testsample4 Journal of quantitative criminology.    UNDEF
testsample5 The Journal of parapsychology.  UNDEF
testsample6 Journal of mathematics and mechanics.   UNDEF
testsample7 The Journal of psychology.  UNDEF
testsample8 Journal of psychosomatic research.  UNDEF
testsample9 The journal of sex research UNDEF
testsample10    Journal of phenomenological psychology. UNDEF

Only keep complete rows with -k:

$ marctotsv -k fixtures/journals.mrc  001 245.a 245.b
testsample1 Journal of rational emotive therapy :   the journal of the In ...

Include all values, separated by a pipe via - s "|":

$ marctotsv -s "|" fixtures/journals.mrc  001 710.a
testsample1 Institute for Rational-Emotive Therapy (New York, N.Y.)
testsample2 Institute for Rational-Emotive Therapy (New York, N.Y.)|Inst ...
testsample3 <NULL>
testsample4 LINK (Online service)
testsample5 Duke University.|ProQuest Psychology Journals.
testsample6 Indiana University.|Indiana University.
testsample7 ProQuest Psychology Journals.
testsample8 ScienceDirect (Online service).
testsample9 Society for the Scientific Study of Sex (U.S.)|Society for ...
testsample10    Ingenta (Firm).

marcuniq

$ marcuniq
Usage: marcuniq [OPTIONS] MARCFILE
  -i=false: ignore marc errors (not recommended)
  -o="": output file (or stdout if none given)
  -v=false: prints current program version
  -x="": comma separated list of ids to exclude (or filename with one id per line)

Exclude three IDs and dump do file:

$ marcuniq -x "testsample1,testsample2" -o filtered.mrc fixtures/journals.mrc
excluded ids interpreted as string
2 ids to exclude loaded
10 records read
8 records written, 0 skipped, 2 excluded, 0 without ID (001)

$ marctotsv filtered.mrc 001
testsample3
testsample4
testsample5
testsample6
testsample7
testsample8
testsample9
testsample10

marcxmltojson

Convert MARCXML to Json. Note that MARCXML does not suffer certain size limits, as binary MARC does.

$ marcxmltojson
Usage: marcxmltojson [OPTIONS] MARCFILE
  -cpuprofile="": write cpu profile to file
  -i=false: ignore marc errors (not recommended)
  -l=false: dump the leader as well
  -m="": a key=value pair to pass to meta
  -p=false: plain mode: dump without content and meta
  -r="": only dump the given tags (e.g. 001,003)
  -v=false: prints current program version and exit
  -w=4: number of workers

Parameters are the same as for marctojson. Both command might merge into one in some future release.

Development

To run the tests just type:

make

To open a coverage report in you browser, run:

make cover

To package an DEB adjust debian/marctools/DEBIAN/control, e.g. update the version, then run:

make deb

To package an RPM, adjust packaging/marctools.spec, e.g. update the version, then run:

make rpm

To package an RPM on a CentOS 6.2 with libc 2.12 setup a VM with veewee and vagrant. Then run:

vagrant up
make vm-setup

Subsequently build RPMs against libc 2.12 with

make rpm-compatible

Previous versions

Versions 1.0 up to 1.3.8 (named gomarckit) used a non-standard project layout and lacked tests. Their version history is preserved under the 1.3.8-maint branch.

Todo

Perform and include some performance benchmarks in README.
The MARC21 library used might issue more system calls than needed, e.g. in the main Record create loop each data and control field will issue a read system call. It could be more efficient to read MARC in larger block and distribute the Record parsing itself to the workers.
Add more tests for more fancy MARC files (encodings, broken dirents, etc.).

Documentation ¶

Index ¶

Constants
func BatchWorker(in chan []*marc22.Record, out chan []byte, wg *sync.WaitGroup, ...)
func FanInWriter(writer io.Writer, in chan []byte, done chan bool)
func IdentifierList(filename string, safe bool) []string
func KeyValueStringToMap(s string) (map[string]string, error)
func MarcMap(infile string, writer io.Writer, safe bool)
func MarcMapEntries(infile string, safe bool) chan MapEntry
func MarcMapSqlite(infile, outfile string, safe bool)
func MarcSplit(infile string, size int64)
func MarcSplitDirectory(infile string, size int64, directory string)
func MarcSplitDirectoryPrefix(infile string, size int64, directory, prefix string)
func RecordCount(filename string) int64
func RecordLength(reader io.Reader) (length int64, err error)
func RecordMap(record *marc22.Record, filter map[string]bool, includeLeader bool) map[string]interface{}
func RecordToSlice(record *marc22.Record, tags []string, fillna, separator string, ...) []string
func RecordToTSV(record *marc22.Record, tags []string, fillna, separator string, ...) string
func StringToMapSet(s string) map[string]bool
func Worker(in chan *marc22.Record, out chan []byte, wg *sync.WaitGroup, ...)
type JSONConversionOptions
type MapEntry
type StringSet
- func NewStringSet() *StringSet
- func (set *StringSet) Add(s string) bool
- func (set *StringSet) Contains(s string) bool
- func (set *StringSet) Size() int

Constants ¶

View Source

const AppVersion = "1.6.3"

AppVersion is displayed by all command line tools

Variables ¶

This section is empty.

Functions ¶

func BatchWorker ¶

func BatchWorker(in chan []*marc22.Record, out chan []byte, wg *sync.WaitGroup, options JSONConversionOptions)

Batchworker batches work of MARC records to JSON

func FanInWriter ¶

func FanInWriter(writer io.Writer, in chan []byte, done chan bool)

FanInWriter writes the channel content to the writer

func IdentifierList ¶ added in v1.6.3

func IdentifierList(filename string, safe bool) []string

IdentifierList returns a slice of strings, containing all ids of the given marc file. Set safe to true to use the slower, more safe method of parsing each record. Fast method breaks when there are multiple 001 fields (invalid, but real-world).

func KeyValueStringToMap ¶

func KeyValueStringToMap(s string) (map[string]string, error)

KeyValueStringToMap turns a string like "key1=value1, key2=value2" into a map.

func MarcMap ¶

func MarcMap(infile string, writer io.Writer, safe bool)

MarcMap writes (id, offset, length) TSV of a given MARC file to a io.Writer

func MarcMapEntries ¶

func MarcMapEntries(infile string, safe bool) chan MapEntry

MarcMapEntries returns a chan of MapEntry structs.

func MarcMapSqlite ¶

func MarcMapSqlite(infile, outfile string, safe bool)

MarcMapSqlite writes (id, offset, length) sqlite3 database of a given MARC file to given output file

func MarcSplit ¶

func MarcSplit(infile string, size int64)

MarcSplit splits a file into parts, each containing at most size records

func MarcSplitDirectory ¶

func MarcSplitDirectory(infile string, size int64, directory string)

MarcSplitDirectory splits a file into parts, each containing at most size records and writes the to specified directory

func MarcSplitDirectoryPrefix ¶

func MarcSplitDirectoryPrefix(infile string, size int64, directory, prefix string)

MarcSplitDirectoryPrefix splits a file into parts, each containing at most size records and writes the to specified directory, using a specific prefix

func RecordCount ¶

func RecordCount(filename string) int64

RecordCount count the number of records in marc file

func RecordLength ¶

func RecordLength(reader io.Reader) (length int64, err error)

RecordLength returns the length of the marc record as stored in the leader

func RecordMap ¶

func RecordMap(record *marc22.Record, filter map[string]bool, includeLeader bool) map[string]interface{}

RecordMap converts a record to a map, optionally keeping only the tags given in filter. If includeLeader is true, the leader is converted as well.

func RecordToSlice ¶

func RecordToSlice(record *marc22.Record,
	tags []string,
	fillna, separator string,
	skipIncompleteLines bool) []string

RecordToSlice returns a string slice with the values of the given tags

func RecordToTSV ¶

func RecordToTSV(record *marc22.Record,
	tags []string,
	fillna, separator string,
	skipIncompleteLines bool) string

RecordToTSV turns a single record into a single TSV line

func StringToMapSet ¶

func StringToMapSet(s string) map[string]bool

StringToMapSet takes a string of the form "val1,val2, val3" and turns it into a poor mans set, a map[string]bool that is.

func Worker ¶

func Worker(in chan *marc22.Record, out chan []byte, wg *sync.WaitGroup, options JSONConversionOptions)

Worker takes a Work item and sends the result (serialized json) on the out channel

Types ¶

type JSONConversionOptions ¶ added in v1.6.3

type JSONConversionOptions struct {
	FilterMap     map[string]bool   // which tags to include
	MetaMap       map[string]string // meta information
	IncludeLeader bool
	PlainMode     bool // only dump the content
	IgnoreErrors  bool
	RecordKey     string
}

JsonConversionOptions specify parameters for the MARC to JSON conversion

type MapEntry ¶

type MapEntry struct {
	ID     string
	Offset int64
	Length int64
}

MapEntry contains location information of a single record in a MARC file

type StringSet ¶

type StringSet struct {
	// contains filtered or unexported fields
}

StringSet is map disguised as set

func NewStringSet ¶

func NewStringSet() *StringSet

NewStringSet returns an empty set

func (*StringSet) Add ¶

func (set *StringSet) Add(s string) bool

Add adds a string to a set, returns true if added, false it it already existed (noop)

func (*StringSet) Contains ¶

func (set *StringSet) Contains(s string) bool

Contains returns true if given string is in the set, false otherwise

func (*StringSet) Size ¶

func (set *StringSet) Size() int

Size returns current number of elements in the set

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
marccount Count records in a MARC file	Count records in a MARC file
marcdb
marcdump
marcmap Create a seekmap of the form (sorted by OFFSET) ID OFFSET LENGTH	Create a seekmap of the form (sorted by OFFSET) ID OFFSET LENGTH
marcsnapshot Keep the newest records among multiple versions in a set of files	Keep the newest records among multiple versions in a set of files
marcsplit Go version of "yaz-marcdump -s prefix -C 1000 file.mrc"	Go version of "yaz-marcdump -s prefix -C 1000 file.mrc"
marctojson Performance data point: Converting 6537611 records (7G) into /dev/null take about 9m31s on a Core i5-3470 (about 11k records/s).	Performance data point: Converting 6537611 records (7G) into /dev/null take about 9m31s on a Core i5-3470 (about 11k records/s).
marctotsv Convert marc to tsv.	Convert marc to tsv.
marcuniq
marcxmltojson

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL