goscan2pdf

command

v0.0.0-...-8a85526 Latest Latest Go to latest Published: Jul 23, 2016 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/johnsto/ocrpdf

Links

Open Source Insights

README ¶

goscan2pdf

A tool to convert scanned documents into searchable PDFs

goscan2pdf recognises and extracts text from scanned documents then combines them into (generally) searchable PDFs that look like the original document. It uses the open source Tesseract library for OCR, Leptonica for image manipiulation, gofpdf for document generation, and kingpin for CLI support.

I demand a GUI!

You're probably better off with gscan2pdf, which was the original inspiration for this tool.

Installation

Use go install github.com/johnsto/ocrpdf/goscan2pdf to install the goscan2pdf tool.

Both the Leptonica and Tesseract libraries must be installed.

Leptonica

Ensure that you have the Leptonica 1.71 library installed:

Debian: apt-get install liblept3 (Jessie or newer)
Fedora: yum install leptonica-devel
OpenSUSE: zypper install leptonica-devel
Arch: pacman -S leptonica
Windows: uh...

Tesseract

Ensure that you have the Tesseract 3.02.02 library and data files installed:

Debian: apt-get install libtesseract3 tesseract-ocr (Jessie or newer)
Fedora: yum install tesseract tesseract-devel
OpenSUSE: zypper install tesseract tesseract-devel
Arch pacman -S tesseract tesseract-data-eng
Windows: er...

Usage

Converting a scanned image is as simple as:

goscan2pdf scan.jpg

By default, goscan2pdf will take the filename name of the first input scan as the output document name, in this case, scan.pdf.

You can also specify a document size, document title, enable compression, multiple pages and the output filename:

goscan2pdf -s letter \
	       -t "2015 Taxes" \
           --compress \
           taxes1.jpg taxes2.jpg taxes3.jpg \
       taxes.pdf

See --help for a listing of all available options.

Automatic contrast enhancement to improve the legibility of the text is performed by default, you can disable this with the --contrast=0 flag.

Image support

All images that Leptonica supports can be read, including TIF, JPEG and PNG. However, images in the saved PDF will be either JPEG or PNG, based on the format of the respective image. You can force a specific output format using the --format parameter.

PDF Structure

Pages in the output PDF contain two layers, one with the recognised text, and one with the scanned image. The image is positioned and arranged on top of the text.

In PDF viewers like evince, this arrangement lets you search and select text as if it were an invisible layer on top of the image.

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL