extractor

package
v0.0.0-...-90a1d6d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 8, 2020 License: AGPL-3.0, AGPL-3.0-only Imports: 14 Imported by: 0

Documentation

Overview

Package extractor is used for quickly extracting PDF content through a simple interface. Currently offers functionality for extracting textual content.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Extractor

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor stores and offers functionality for extracting content from PDF pages.

func New

func New(page *model.PdfPage) (*Extractor, error)

New returns an Extractor instance for extracting content from the input PDF page.

func (*Extractor) ExtractPageImages

func (e *Extractor) ExtractPageImages(options *ImageExtractOptions) (*PageImages, error)

ExtractPageImages returns the image contents of the page extractor, including data and position, size information for each image. A set of options to control page image extraction can be passed in. The options parameter can be nil for the default options. By default, inline stencil masks are not extracted.

func (*Extractor) ExtractPageText

func (e *Extractor) ExtractPageText() (*PageText, int, int, error)

ExtractPageText returns the text contents of `e` (an Extractor for a page) as a PageText.

func (*Extractor) ExtractText

func (e *Extractor) ExtractText() (string, error)

ExtractText processes and extracts all text data in content streams and returns as a string. It takes into account character encodings in the PDF file, which are decoded by CharcodeBytesToUnicode. Characters that can't be decoded are replaced with MissingCodeRune ('\ufffd' = �).

func (*Extractor) ExtractTextWithStats

func (e *Extractor) ExtractTextWithStats() (extracted string, numChars int, numMisses int, err error)

ExtractTextWithStats works like ExtractText but returns the number of characters in the output (`numChars`) and the number of characters that were not decoded (`numMisses`).

type ImageExtractOptions

type ImageExtractOptions struct {
	IncludeInlineStencilMasks bool
}

ImageExtractOptions contains options for controlling image extraction from PDF pages.

type ImageMark

type ImageMark struct {
	Image *model.Image

	// Dimensions of the image as displayed in the PDF.
	Width  float64
	Height float64

	// Position of the image in PDF coordinates (lower left corner).
	X float64
	Y float64

	// Angle in degrees, if rotated.
	Angle float64
}

ImageMark represents an image drawn on a page and its position in device coordinates. All coordinates are in device coordinates.

type PageImages

type PageImages struct {
	Images []ImageMark
}

PageImages represents extracted images on a PDF page with spatial information: display position and size.

type PageText

type PageText struct {
	// contains filtered or unexported fields
}

PageText represents the layout of text on a device page.

func (PageText) Marks

func (pt PageText) Marks() *TextMarkArray

Marks returns the TextMark collection for a page. It represents all the text on the page.

func (PageText) String

func (pt PageText) String() string

String returns a string describing `pt`.

func (PageText) Text

func (pt PageText) Text() string

Text returns the extracted page text.

func (PageText) ToText

func (pt PageText) ToText() string

ToText returns the page text as a single string. Deprecated: This function is deprecated and will be removed in a future major version. Please use Text() instead.

type RenderMode

type RenderMode int

RenderMode specifies the text rendering mode (Tmode), which determines whether showing text shall cause glyph outlines to be stroked, filled, used as a clipping boundary, or some combination of the three. Stroking, filling, and clipping shall have the same effects for a text object as they do for a path object (see 8.5.3, "Path-Painting Operators" and 8.5.4, "Clipping Path Operators").

const (
	RenderModeStroke RenderMode = 1 << iota // Stroke
	RenderModeFill                          // Fill
	RenderModeClip                          // Clip
)

Render mode type.

type TextMark

type TextMark struct {
	// Text is the extracted text. It has been decoded to Unicode via ToUnicode().
	Text string
	// Original is the text in the PDF. It has not been decoded like `Text`.
	Original string
	// BBox is the bounding box of the text.
	BBox model.PdfRectangle
	// Font is the font the text was drawn with.
	Font *model.PdfFont
	// FontSize is the font size the text was drawn with.
	FontSize float64
	// Offset is the offset of the start of TextMark.Text in the extracted text. If you do this
	//   text, textMarks := pageText.Text(), pageText.Marks()
	//   marks := textMarks.Elements()
	// then marks[i].Offset is the offset of marks[i].Text in text.
	Offset int
	// Meta is set true for spaces and line breaks that we insert in the extracted text. We insert
	// spaces (line breaks) when we see characters that are over a threshold horizontal (vertical)
	//  distance  apart. See wordJoiner (lineJoiner) in PageText.computeViews().
	Meta bool
}

TextMark represents extracted text on a page with information regarding both textual content, formatting (font and size) and positioning. It is the smallest unit of text on a PDF page, typically a single character.

getBBox() in test_text.go shows how to compute bounding boxes of substrings of extracted text. The following code extracts the text on PDF page `page` into `text` then finds the bounding box `bbox` of substring `term` in `text`.

ex, _ := New(page)
// handle errors
pageText, _, _, err := ex.ExtractPageText()
// handle errors
text := pageText.Text()
textMarks := pageText.Marks()

	start := strings.Index(text, term)
 end := start + len(term)
 spanMarks, err := textMarks.RangeOffset(start, end)
 // handle errors
 bbox, ok := spanMarks.BBox()
 // handle errors

func (TextMark) String

func (tm TextMark) String() string

String returns a string describing `tm`.

type TextMarkArray

type TextMarkArray struct {
	// contains filtered or unexported fields
}

TextMarkArray is a collection of TextMarks.

func (*TextMarkArray) Append

func (ma *TextMarkArray) Append(mark TextMark)

Append appends `mark` to the mark array.

func (*TextMarkArray) BBox

func (ma *TextMarkArray) BBox() (model.PdfRectangle, bool)

BBox returns the smallest axis-aligned rectangle that encloses all the TextMarks in `ma`.

func (*TextMarkArray) Elements

func (ma *TextMarkArray) Elements() []TextMark

Elements returns the TextMarks in `ma`.

func (*TextMarkArray) Len

func (ma *TextMarkArray) Len() int

Len returns the number of TextMarks in `ma`.

func (*TextMarkArray) RangeOffset

func (ma *TextMarkArray) RangeOffset(start, end int) (*TextMarkArray, error)

RangeOffset returns the TextMarks in `ma` that have `start` <= TextMark.Offset < `end`.

func (TextMarkArray) String

func (ma TextMarkArray) String() string

String returns a string describing `ma`.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL