pdftextractor

package module

v0.0.0-...-991d331 Latest Latest Go to latest Published: Oct 27, 2022 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/syhv-git/pdftextractor

Links

Open Source Insights

README ¶

PDFtExtractor

This package performs a text extraction on PDF files, with a focus on performance. The function ExtractText takes two parameters; the path to the PDF file, and a boolean value defining whether to extract text from drawn images as well. It returns a byte slice of the text contents.

The current package has some issues when handling the PDF text objects, and does not properly decode the PDF glyphs

Usage

To use this package in your project, run the following command in your module:

go get -u github.com/syhv-git/pdftextractor

You will also need to download tesseract-ocr and libtesseract-dev (tesseract-ocr-dev for apk). You can add more language data to /usr/share/tesseract-ocr/$VERSION/tessdata/

There may be dependency issues with the Gosseract package. This will require the linux mint package from the same developer

Roadmap

Decode PDF string objects and extract the raw text
- Narrowed issues down to font encoding and cmaps
- Everything before decodeText() works as expected
Optimize the codebase
Test with PDFs containing images
Test with an Image based PDF file
Test interoperability with other PDF versions
Test various encoding types and font styles
Test various string object encodings

Documentation ¶

Index ¶

func ExtractText(src string, incl bool) []byte

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func ExtractText ¶

func ExtractText(src string, incl bool) []byte

ExtractText extracts the text of the source PDF and optionally extracts text from drawn images. Text from images are out of order from extracted text objects

Types ¶

This section is empty.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
pdfium-ocr

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL