pdfsearch

package module

v0.0.0 Latest Latest Go to latest Published: Oct 28, 2019 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/papercutsoftware/pdfsearch

Links

Open Source Insights

README ¶

Pure Go Full Text Search of PDF Files

This library implements full text search for PDF files.

The public APIs are in index_search.go.

The are some command lines programs that demonstrate the library's functionality.

examples/pdf_search_demo.go demonstrates the main APIs.
examples/pdf_search_verify.go verifies the consistency of the in-memory and on-disk APIs.
examples/index.go builds an index over a set of PDFs.
examples/search.go searches the index build by examples/index.go.

Installation

git clone https://github.com/PaperCutSoftware/pdfsearch
cd pdfsearch/examples
go build pdf_search_demo.go
go build pdf_search_verify.go
go build index.go
go build search.go

examples/pdf_search_demo.go

Usage: ./pdf_search_demo -f <PDF path> <search term>

Example: ./pdf_search_demo -f PDF32000_2008.pdf cubic Bézier curve

The example will search PDF32000_2008.pdf for cubic Bézier curve.

pdf_search_demo.go shows how to use the APIs in index_search.go to

create indexes over PDF files,
search those indexes using full-text search, and
mark up PDF files with the locations of the search matches on pages.

It has 3 types of index

On-disk. These can be as large as your disk but are slower.
In-memory with the index stored in a Go struct. Faster but limited to (virtual) memory size.
In-memory with the index serialized to a []byte. Useful for non-Go callers such as web apps.

examples/index.go

Usage: ./index <file pattern>

Example: ./index ~/climate/**/*.pdf

The example creates an on-disk index over the PDFs in ~/climate/ and its subdirectories.

examples/search.go

Usage: ./search <search term>

Example: ./search integrated assessment model

The example searches the on-disk index created by examples/index.go for integrated assessment model.

Libraries

index_search.go uses UniDoc for PDF parsing and bleve for search.

Documentation ¶

Index ¶

Constants
func AddImageToPdf(rs io.ReadSeeker, w io.Writer, image goimage.Image, url string, pageNum int, ...) error
func ExposeErrors()
func IndexPdfMem(pathList []string, rsList []io.ReadSeeker, report func(string)) ([]byte, error)
func MarkupPdfResults(results PdfMatchSet, outPath string) error
type ImageLocation
type PagePosition
type PdfIndex
type PdfMatchSet
- func SearchMem(data []byte, term string, maxResults int) (PdfMatchSet, error)

Constants ¶

View Source

const (
	// DefaultMaxResults is the default maximum number of results returned.
	DefaultMaxResults = 10
	// DefaultPersistRoot is the default root for on-disk indexes.
	DefaultPersistRoot = "pdf.store"
)

View Source

const (
	PageBottomRight = iota
	PageBottomCenter
	PageBottomLeft
	PageCenterRight
	PageCenter
	PageCenterLeft
	PageTopRight
	PageTopCenter
	PageTopLeft
	PageCustomPosition
)

Variables ¶

This section is empty.

Functions ¶

func AddImageToPdf ¶

func AddImageToPdf(rs io.ReadSeeker, w io.Writer, image goimage.Image, url string, pageNum int,
	loc ImageLocation) error

AddImageToPdf adds an image to a specific page of a PDF. NOTE: This function adds the same image at the same position on every page.

rs: io.ReadSeeker for input PDF.
w: io.Writer for output (modified) PDF.
img: Image to be applied to pages.
pageNum: (1-offset) page number to apply image to. Specify 0 for all pages. Specify a negative number to count back from the last page. For example -1 = last page, -2 = second last page.

The image's aspect ratio is maintained.

func ExposeErrors ¶

func ExposeErrors()

ExposeErrors turns off recovery from panics in called libraries.

func IndexPdfMem ¶

func IndexPdfMem(pathList []string, rsList []io.ReadSeeker, report func(string)) ([]byte, error)

IndexPdfMem returns a byte array that contains an index for PDF io.ReaderSeeker's in `rsList`. The names of the PDFs are in the corresponding position in `pathList`. `report` is a supplied function that is called to report progress.

func MarkupPdfResults ¶

func MarkupPdfResults(results PdfMatchSet, outPath string) error

MarkupPdfResults adds rectangles to the text positions of all matches on their PDF pages, combines these pages together and writes the resulting PDF to `outPath`. The PDF will have at most 100 pages because no-one is likely to read through search results over more than 100 pages. There will at most 10 results per page.

Types ¶

type ImageLocation ¶

type ImageLocation struct {
	PagePosition         // Enumerated position.
	XPosMm       float64 // Custom x coordinate in points (positive from right, negative from left).
	YPosMm       float64 // Custom y coordinate in points (positive from bottom, negative from top).
	WidthMm      float64 // Width of the image in mm.
	HeightMm     float64 // Height of the image in mm.
	MarginXMm    float64 // Horizontal page margin in mm.
	MarginYMm    float64 // Vertical page margin in mm.
}

ImageLocation specifies the location of a square image on a page.

type PagePosition ¶

type PagePosition int

PagePosition is an enumerated position on a page.

type PdfIndex ¶

type PdfIndex struct {
	// contains filtered or unexported fields
}

PdfIndex is an opaque struct that describes an index over some PDF files. It consists of - a bleve index (bleveIdx), - a mapping between the PDF files and the bleve index (blevePdf) - controls and statistics.

func FromBytes ¶

func FromBytes(data []byte) (PdfIndex, error)

from2Bufs extracts a PdfIndex from the bytes in `data`.

func IndexPdfFiles ¶

func IndexPdfFiles(pathList []string, persist bool, persistDir string, report func(string)) (
	PdfIndex, error)

IndexPdfFiles returns an index for the PDF files in `pathList`. If `persist` is false, the index is stored in memory. If `persist` is true, the index is stored on disk in `persistDir`. `report` is a supplied function that is called to report progress.

func IndexPdfReaders ¶

func IndexPdfReaders(pathList []string, rsList []io.ReadSeeker, persist bool, persistDir string,
	report func(string)) (PdfIndex, error)

IndexPdfReaders returns a PdfIndex over the PDF contents read by the io.ReaderSeeker's in `rsList`. The names of the PDFs are in the corresponding position in `pathList`. If `persist` is false, the index is stored in memory. If `persist` is true, the index is stored on disk in `persistDir`. `report` is a supplied function that is called to report progress.

func ReuseIndex ¶

func ReuseIndex(persistDir string) PdfIndex

ReuseIndex returns an existing on-disk PdfIndex with directory `persistDir`.

func (PdfIndex) Duration ¶

func (p PdfIndex) Duration() string

Duration returns a string describing how long indexing took and where the time was spent.

func (PdfIndex) Equals ¶

func (p PdfIndex) Equals(q PdfIndex) bool

Equals returns true if `p` contains the same information as `q`.

func (PdfIndex) NumFiles ¶

func (p PdfIndex) NumFiles() int

func (PdfIndex) NumPages ¶

func (p PdfIndex) NumPages() int

func (PdfIndex) Search ¶

func (p PdfIndex) Search(term string, maxResults int) (PdfMatchSet, error)

Search does a full-text search over PdfIndex `p` for `term` and returns up to `maxResults` matches. This is the main search function.

func (PdfIndex) StorageName ¶

func (p PdfIndex) StorageName() string

StorageName returns a descriptive name for index storage mode.

func (PdfIndex) String ¶

func (p PdfIndex) String() string

String returns a string describing `p`.

func (PdfIndex) ToBytes ¶

func (p PdfIndex) ToBytes() ([]byte, error)

ToBytes serializes `i` to a byte array.

type PdfMatchSet ¶

type PdfMatchSet doclib.PdfMatchSet

PdfMatchSet makes doclib.PdfMatchSet public.

func SearchMem ¶

func SearchMem(data []byte, term string, maxResults int) (PdfMatchSet, error)

SearchMem does a full-text search over the PdfIndex in `data` for `term` and returns up to `maxResults` matches. `data` is the serialized PdfIndex returned from IndexPdfMem.

func (PdfMatchSet) Best ¶

func (s PdfMatchSet) Best() PdfMatchSet

Equals makes doclib.PdfMatchSet.Equals public.

func (PdfMatchSet) Equals ¶

func (s PdfMatchSet) Equals(t PdfMatchSet) bool

Equals makes doclib.PdfMatchSet.Equals public.

func (PdfMatchSet) Files ¶

func (s PdfMatchSet) Files() []string

Files makes doclib.PdfMatchSet.Files public.

func (PdfMatchSet) String ¶

func (s PdfMatchSet) String() string

String() makes doclib.PdfMatchSet.String public.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples
cmd_utils * PaperCut specific functions.	* PaperCut specific functions.
internal
doclib * This source implements the main function IndexPdfReaders().	* This source implements the main function IndexPdfReaders().
serial
serial/locations
serial/pdf_index
utils

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL