chronam-ocr-debatcher

command module
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 25, 2019 License: MIT Imports: 10 Imported by: 0

README

Build Status

Chronicling America OCR debatcher

This program takes paths to .tar.bz2 batches of OCR files from the Chronicling America bulk data downloads. It converts each batch into a CSV file, which you can load into a database or do whatever you like with. It will process the batches concurrently.

Usage:

./chronam-ocr-debatcher [--processes=8] <path/to/a/batch.tar.bz2 ...>

You can download binaries from the releases page.

Documentation

Overview

This utility converts Chronicling America OCR batches into CSVs of the OCR text. It takes as its arguments paths to Chronicling America OCR batches which are stored as .tar.bz2 files, which in turn contain directories of text files (which we care about) and XML files (which we don't). The path to the files comprise (with modification) an ID for that page on Chronicling America. This utility reads in each batch, extracts the page text, and writes each of them as a CSV file with a column for the batch ID, page ID, and text. It will process the batches in parallel.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL