pdftextractor

package module
v0.0.0-...-991d331 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 27, 2022 License: MIT Imports: 11 Imported by: 0

README

PDFtExtractor


This package performs a text extraction on PDF files, with a focus on performance. The function ExtractText takes two parameters; the path to the PDF file, and a boolean value defining whether to extract text from drawn images as well. It returns a byte slice of the text contents.

The current package has some issues when handling the PDF text objects, and does not properly decode the PDF glyphs

Usage

To use this package in your project, run the following command in your module:

go get -u github.com/syhv-git/pdftextractor

You will also need to download tesseract-ocr and libtesseract-dev (tesseract-ocr-dev for apk). You can add more language data to /usr/share/tesseract-ocr/$VERSION/tessdata/

There may be dependency issues with the Gosseract package. This will require the linux mint package from the same developer

Roadmap

  • Decode PDF string objects and extract the raw text

    • Narrowed issues down to font encoding and cmaps
    • Everything before decodeText() works as expected
  • Optimize the codebase

  • Test with PDFs containing images

  • Test with an Image based PDF file

  • Test interoperability with other PDF versions

  • Test various encoding types and font styles

  • Test various string object encodings

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ExtractText

func ExtractText(src string, incl bool) []byte

ExtractText extracts the text of the source PDF and optionally extracts text from drawn images. Text from images are out of order from extracted text objects

Types

This section is empty.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL