pdf

package module
v0.0.8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 8, 2024 License: BSD-3-Clause Imports: 16 Imported by: 0

README

PDF Reader

A simple Go library which enables reading PDF files text content. Fork tree:

Reader.GetText returns the text content annotated with text size and weight information. Text is returned in stream order - irrespectve of where it appears on the page, the returned text order is how it appears in the PDF stream.

Attempts are made to separate text blocks that are displayed in separate blocks in the PDF as separate paragraphs.

e.g. with tabular PDF content:

Col 1 header Col 2 header
Text in row 1 col 1 Text in row 1 col 2
Text in row 2 col 1 Text in row 2 col 2

Reader.GetText returns content as:

Col 1 header

Col 2 header

Text in row 1 col 1

Text in row 1 col 2

Text in row 2 col 1

Text in row 2 col 2

Documentation

Overview

Package pdf implements reading of PDF files.

Overview

PDF is Adobe's Portable Document Format, ubiquitous on the internet. A PDF document is a complex data format built on a fairly simple structure. This package exposes the simple structure along with some wrappers to extract basic information. If more complex information is needed, it is possible to extract that information by interpreting the structure exposed by this package.

Specifically, a PDF is a data structure built from Values, each of which has one of the following Kinds:

Null, for the null object.
Integer, for an integer.
Real, for a floating-point number.
Bool, for a boolean value.
Name, for a name constant (as in /Helvetica).
String, for a string constant.
Dict, for a dictionary of name-value pairs.
Array, for an array of values.
Stream, for an opaque data stream and associated header dictionary.

The accessors on Value—Int64, Float64, Bool, Name, and so on—return a view of the data as the given type. When there is no appropriate view, the accessor returns a zero result. For example, the Name accessor returns the empty string if called on a Value v for which v.Kind() != Name. Returning zero values this way, especially from the Dict and Array accessors, which themselves return Values, makes it possible to traverse a PDF quickly without writing any error checking. On the other hand, it means that mistakes can go unreported.

The basic structure of the PDF file is exposed as the graph of Values.

Most richer data structures in a PDF file are dictionaries with specific interpretations of the name-value pairs. The Font and Page wrappers make the interpretation of a specific Value as the corresponding type easier. They are only helpers, though: they are implemented only in terms of the Value API and could be moved outside the package. Equally important, traversal of other PDF data structures can be implemented in other packages as needed.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Page

type Page struct {
	// contains filtered or unexported fields
}

A Page represent a single Page in a PDF file. The methods interpret a Page dictionary stored in V.

func (*Page) Text

func (p *Page) Text() (result text.Text, err error)

Text returns the structured text on the page.

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

A Reader is a single PDF file open for reading.

func NewReader

func NewReader(f io.ReaderAt, size int64) (*Reader, error)

NewReader opens a file for reading, using the data in f with the given total size.

func NewReaderEncrypted

func NewReaderEncrypted(f io.ReaderAt, size int64, pw string) (*Reader, error)

NewReaderEncrypted opens a file for reading, using the data in f with the given total size. If the PDF is encrypted, NewReaderEncrypted calls pw repeatedly to obtain passwords to try. If pw returns the empty string, NewReaderEncrypted stops trying to decrypt the file and returns an error.

func Open

func Open(file string) (*Reader, error)

Open opens a file for reading. Reader.Close should be called when done with the Reader.

func (*Reader) Close

func (r *Reader) Close() error

Close closes the underlying Reader if it is an io.Closer.

func (*Reader) NPages

func (r *Reader) NPages() int

NPages returns the number of pages in the PDF file.

func (*Reader) Page

func (r *Reader) Page(i int) (text.Text, error)

Page returns the page for the given page number. Page numbers are indexed starting at 1, not 0. If the page is not found, Page returns an error.

func (*Reader) Text

func (r *Reader) Text() (text.Text, error)

Text returns a structured Text for all pages of the pdf.

Notes

Bugs

  • The library makes no attempt at efficiency. A value cache maintained in the Reader would probably help significantly.

Directories

Path Synopsis
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL