pdf_parser

package module
v0.1.97 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 18, 2023 License: MIT Imports: 10 Imported by: 0

README

pdf_parser by flotzilla - updated by redskal/@sam_phisher

I was looking for a simple PDF parser library and found this one, but it only worked from files on-disk. My use case required me to process data in-memory, so I've adapted it.

It's pretty much a copy-pasta of the relevant functions which were then modified slightly. Hopefully I'll get around to reimplementing this properly from scratch, or with some major refactoring. For now, it serves the purpose for my PoC.

Usage should be something like:

import "github.com/redskal/pdf_parser"

pdf, err := pdf_parser.ParsePdfInMem(byteArrayOfPdfFile)

pdf.GetTitle()
pdf.GetAuthor()
pdf.GetCreator()
pdf.GetISBN()
pdf.GetPublishers() []string
pdf.GetLanguages() []string
pdf.GetDescription()
pdf.GetPagesCount()

Original README is below.


Pdf metadata parser

Go library for parsing pdf metadata information

License

MIT

Usage
import "github.com/flotzilla/pdf_parser"

// parse file
pdf, errors := pdf_parser.ParsePdf("filepath/file.pdf")

// main functions
pdf.GetTitle()
pdf.GetAuthor()
pdf.GetCreator()
pdf.GetISBN()
pdf.GetPublishers() []string
pdf.GetLanguages() []string
pdf.GetDescription()
pdf.GetPagesCount()

Using with custom github.com/sirupsen/logrus logger

import "github.com/flotzilla/pdf_parser"

l := logger.New()
l.SetOutput(os.Stdout)
lg.SetFormatter(&logger.JSONFormatter{})

SetLogger(lg)
file, _ := filepath.Abs("filepath/file.pdf")
pdf, err := ParsePdf(file)

Documentation

Overview

* Added by @sam_phisher to facilitate processing PDFs in-memory * * We should now be able to parse PDF data as byte arrays. * Also fixed some wonky code, but it still needs major refactoring.

Index

Constants

View Source
const BufferSize = 50
View Source
const BufferSize300 = 300

Variables

This section is empty.

Functions

This section is empty.

Types

type InfoObject

type InfoObject struct {
	Title        string
	Author       string
	Creator      string
	CreationDate string
	Producer     string
	ModDate      string
}

type MetaDataRdf

type MetaDataRdf struct {
	Title       string
	Description string
	Creator     string
	Date        string
	Isbn        string

	Publishers []string
	Languages  []string
}

type Metadata

type Metadata struct {
	Type          string
	Subtype       string
	Length        int64
	DL            int64
	RawStreamData []byte
	RdfMeta       *MetaDataRdf
}

type ObjectIdentifier

type ObjectIdentifier struct {
	ObjectNumber     int
	GenerationNumber int
	KeyWord          string
}

type ObjectSubsection

type ObjectSubsection struct {
	Id                      int // objectId
	ObjectsCount            int
	FirstSubsectionObjectId int
	LastSubsectionObjectId  int
	Elements                map[int]*ObjectSubsectionElement
}

Object subsection that contain list of objects for this object

type ObjectSubsectionElement

type ObjectSubsectionElement struct {
	Id               int
	ObjectNumber     int
	GenerationNumber int
	KeyWord          string
}

type PdfInfo

type PdfInfo struct {
	PdfVersion               string
	OriginalXrefOffset       int64
	OriginalTrailerSection   TrailerSection
	AdditionalTrailerSection []*TrailerSection
	XrefTable                []*XrefTable
	Root                     RootObject
	Info                     InfoObject
	Metadata                 Metadata
	PagesCount               int
}

func ParsePdf

func ParsePdf(fileName string) (*PdfInfo, error)

Parse pdf file metadata

func ParsePdfInMemory

func ParsePdfInMemory(fileBytes []byte) (*PdfInfo, error)

Parse pdf file metadata

func (*PdfInfo) GetAuthor

func (pdf *PdfInfo) GetAuthor() string

func (*PdfInfo) GetCover

func (pdf *PdfInfo) GetCover(filepath string) bool

func (*PdfInfo) GetCreator

func (pdf *PdfInfo) GetCreator() string

func (*PdfInfo) GetDate

func (pdf *PdfInfo) GetDate() string

func (*PdfInfo) GetDescription

func (pdf *PdfInfo) GetDescription() string

func (*PdfInfo) GetISBN

func (pdf *PdfInfo) GetISBN() string

func (*PdfInfo) GetLanguage

func (pdf *PdfInfo) GetLanguage() string

func (*PdfInfo) GetLanguages

func (pdf *PdfInfo) GetLanguages() []string

func (*PdfInfo) GetPagesCount

func (pdf *PdfInfo) GetPagesCount() int

func (*PdfInfo) GetPublisherInfo

func (pdf *PdfInfo) GetPublisherInfo() string

func (*PdfInfo) GetPublishers

func (pdf *PdfInfo) GetPublishers() []string

func (*PdfInfo) GetTitle

func (pdf *PdfInfo) GetTitle() string

type RootObject

type RootObject struct {
	Type       string
	Pages      *ObjectIdentifier
	Metadata   *ObjectIdentifier
	PageLabels *ObjectIdentifier
	Lang       string
}

type TrailerSection

type TrailerSection struct {
	IdRaw string
	Info  ObjectIdentifier
	Root  ObjectIdentifier
	Size  string
	Prev  int64
}

type XrefTable

type XrefTable struct {
	Objects           map[int]*ObjectSubsectionElement
	ObjectSubsections map[int]*ObjectSubsection
	SectionStart      int64
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL