pdfsearch

package module
v0.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 28, 2019 License: MIT Imports: 14 Imported by: 0

README

Pure Go Full Text Search of PDF Files

This library implements full text search for PDF files.

The are some command lines programs that demonstrate the library's functionality.

Installation

git clone https://github.com/PaperCutSoftware/pdfsearch
cd pdfsearch/examples
go build pdf_search_demo.go
go build pdf_search_verify.go
go build index.go
go build search.go
examples/pdf_search_demo.go

Usage: ./pdf_search_demo -f <PDF path> <search term>

Example: ./pdf_search_demo -f PDF32000_2008.pdf cubic Bézier curve

The example will search PDF32000_2008.pdf for cubic Bézier curve.

pdf_search_demo.go shows how to use the APIs in index_search.go to

  • create indexes over PDF files,
  • search those indexes using full-text search, and
  • mark up PDF files with the locations of the search matches on pages.

It has 3 types of index

  • On-disk. These can be as large as your disk but are slower.
  • In-memory with the index stored in a Go struct. Faster but limited to (virtual) memory size.
  • In-memory with the index serialized to a []byte. Useful for non-Go callers such as web apps.
examples/index.go

Usage: ./index <file pattern>

Example: ./index ~/climate/**/*.pdf

The example creates an on-disk index over the PDFs in ~/climate/ and its subdirectories.

examples/search.go

Usage: ./search <search term>

Example: ./search integrated assessment model

The example searches the on-disk index created by examples/index.go for integrated assessment model.

Libraries

index_search.go uses UniDoc for PDF parsing and bleve for search.

Documentation

Index

Constants

View Source
const (
	// DefaultMaxResults is the default maximum number of results returned.
	DefaultMaxResults = 10
	// DefaultPersistRoot is the default root for on-disk indexes.
	DefaultPersistRoot = "pdf.store"
)
View Source
const (
	PageBottomRight = iota
	PageBottomCenter
	PageBottomLeft
	PageCenterRight
	PageCenter
	PageCenterLeft
	PageTopRight
	PageTopCenter
	PageTopLeft
	PageCustomPosition
)

Variables

This section is empty.

Functions

func AddImageToPdf

func AddImageToPdf(rs io.ReadSeeker, w io.Writer, image goimage.Image, url string, pageNum int,
	loc ImageLocation) error

AddImageToPdf adds an image to a specific page of a PDF. NOTE: This function adds the same image at the same position on every page.

  • rs: io.ReadSeeker for input PDF.
  • w: io.Writer for output (modified) PDF.
  • img: Image to be applied to pages.
  • pageNum: (1-offset) page number to apply image to. Specify 0 for all pages. Specify a negative number to count back from the last page. For example -1 = last page, -2 = second last page.

The image's aspect ratio is maintained.

func ExposeErrors

func ExposeErrors()

ExposeErrors turns off recovery from panics in called libraries.

func IndexPdfMem

func IndexPdfMem(pathList []string, rsList []io.ReadSeeker, report func(string)) ([]byte, error)

IndexPdfMem returns a byte array that contains an index for PDF io.ReaderSeeker's in `rsList`. The names of the PDFs are in the corresponding position in `pathList`. `report` is a supplied function that is called to report progress.

func MarkupPdfResults

func MarkupPdfResults(results PdfMatchSet, outPath string) error

MarkupPdfResults adds rectangles to the text positions of all matches on their PDF pages, combines these pages together and writes the resulting PDF to `outPath`. The PDF will have at most 100 pages because no-one is likely to read through search results over more than 100 pages. There will at most 10 results per page.

Types

type ImageLocation

type ImageLocation struct {
	PagePosition         // Enumerated position.
	XPosMm       float64 // Custom x coordinate in points (positive from right, negative from left).
	YPosMm       float64 // Custom y coordinate in points (positive from bottom, negative from top).
	WidthMm      float64 // Width of the image in mm.
	HeightMm     float64 // Height of the image in mm.
	MarginXMm    float64 // Horizontal page margin in mm.
	MarginYMm    float64 // Vertical page margin in mm.
}

ImageLocation specifies the location of a square image on a page.

type PagePosition

type PagePosition int

PagePosition is an enumerated position on a page.

type PdfIndex

type PdfIndex struct {
	// contains filtered or unexported fields
}

PdfIndex is an opaque struct that describes an index over some PDF files. It consists of - a bleve index (bleveIdx), - a mapping between the PDF files and the bleve index (blevePdf) - controls and statistics.

func FromBytes

func FromBytes(data []byte) (PdfIndex, error)

from2Bufs extracts a PdfIndex from the bytes in `data`.

func IndexPdfFiles

func IndexPdfFiles(pathList []string, persist bool, persistDir string, report func(string)) (
	PdfIndex, error)

IndexPdfFiles returns an index for the PDF files in `pathList`. If `persist` is false, the index is stored in memory. If `persist` is true, the index is stored on disk in `persistDir`. `report` is a supplied function that is called to report progress.

func IndexPdfReaders

func IndexPdfReaders(pathList []string, rsList []io.ReadSeeker, persist bool, persistDir string,
	report func(string)) (PdfIndex, error)

IndexPdfReaders returns a PdfIndex over the PDF contents read by the io.ReaderSeeker's in `rsList`. The names of the PDFs are in the corresponding position in `pathList`. If `persist` is false, the index is stored in memory. If `persist` is true, the index is stored on disk in `persistDir`. `report` is a supplied function that is called to report progress.

func ReuseIndex

func ReuseIndex(persistDir string) PdfIndex

ReuseIndex returns an existing on-disk PdfIndex with directory `persistDir`.

func (PdfIndex) Duration

func (p PdfIndex) Duration() string

Duration returns a string describing how long indexing took and where the time was spent.

func (PdfIndex) Equals

func (p PdfIndex) Equals(q PdfIndex) bool

Equals returns true if `p` contains the same information as `q`.

func (PdfIndex) NumFiles

func (p PdfIndex) NumFiles() int

func (PdfIndex) NumPages

func (p PdfIndex) NumPages() int

func (PdfIndex) Search

func (p PdfIndex) Search(term string, maxResults int) (PdfMatchSet, error)

Search does a full-text search over PdfIndex `p` for `term` and returns up to `maxResults` matches. This is the main search function.

func (PdfIndex) StorageName

func (p PdfIndex) StorageName() string

StorageName returns a descriptive name for index storage mode.

func (PdfIndex) String

func (p PdfIndex) String() string

String returns a string describing `p`.

func (PdfIndex) ToBytes

func (p PdfIndex) ToBytes() ([]byte, error)

ToBytes serializes `i` to a byte array.

type PdfMatchSet

type PdfMatchSet doclib.PdfMatchSet

PdfMatchSet makes doclib.PdfMatchSet public.

func SearchMem

func SearchMem(data []byte, term string, maxResults int) (PdfMatchSet, error)

SearchMem does a full-text search over the PdfIndex in `data` for `term` and returns up to `maxResults` matches. `data` is the serialized PdfIndex returned from IndexPdfMem.

func (PdfMatchSet) Best

func (s PdfMatchSet) Best() PdfMatchSet

Equals makes doclib.PdfMatchSet.Equals public.

func (PdfMatchSet) Equals

func (s PdfMatchSet) Equals(t PdfMatchSet) bool

Equals makes doclib.PdfMatchSet.Equals public.

func (PdfMatchSet) Files

func (s PdfMatchSet) Files() []string

Files makes doclib.PdfMatchSet.Files public.

func (PdfMatchSet) String

func (s PdfMatchSet) String() string

String() makes doclib.PdfMatchSet.String public.

Directories

Path Synopsis
cmd_utils
* PaperCut specific functions.
* PaperCut specific functions.
internal
doclib
* This source implements the main function IndexPdfReaders().
* This source implements the main function IndexPdfReaders().

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL