pdftotext

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 19, 2024 License: MIT Imports: 8 Imported by: 0

README

pdftotext-go

OpenSSF Scorecard

Extract texts with their corresponding page numbers from PDF files. Wraps the command line tool pdftotext (poppler-utils).

Usage

  1. poppler-utils (version >=22.05.0) must be installed and available in the path.
  2. go get "github.com/heussd/pdftotext-go"
  3. See tests for code examples.

Why poppler version >=22.05.0

Version 22.05.0 of poppler introduced a new parameter -tsv, which extracts PDF content with meta data as TSV. This functionality is essential for the operation of this library.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CheckPopplerVersion

func CheckPopplerVersion() (fullVersionString string, err error)

Types

type PdfPage

type PdfPage struct {
	Content string
	// PDF page number (1-based)
	Number int
}

func Extract

func Extract(pdfBytes []byte) (pdfPages []PdfPage, err error)

Extract PDF text content in simplified format

func ExtractOrError

func ExtractOrError(pdfBytes []byte) (pages []PdfPage, err error)

ExtractOrError Just like Extract, but indicates issues with errors

type PopplerTsvRow

type PopplerTsvRow struct {
	Level    int     `col:"0"`
	PageNum  int     `col:"1"`
	ParNum   int     `col:"2"`
	BlockNum int     `col:"3"`
	LineNum  int     `col:"4"`
	WordNum  int     `col:"5"`
	Left     float64 `col:"6"`
	Top      float64 `col:"7"`
	Width    float64 `col:"8"`
	Height   float64 `col:"9"`
	Conf     int     `col:"10"`
	Text     string  `col:"11"`
}

func ExtractInPopplerTsv

func ExtractInPopplerTsv(pdfBytes []byte) (tsvRows []PopplerTsvRow, err error)

ExtractInPopplerTsv Access raw stdout content from Poppler

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL