pdfparser

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 24, 2023 License: MIT Imports: 13 Imported by: 0

README

pdfparser

A small library for parsing text PDFs in hierarchical format (see annex for example).

The parser will return hierarchized data in JSON format with the positioning and style in the original pdf. A small server has also been included for easier testing and trying the library out.

Build Status

Exported functions

Five functions are exported for your convinience:

// ParsePdf parses a multipart text pdf file and returns text in JSON hierarchical order.
// It takes a multipart fileheader (perfect if you are working with a web server
// and don't want to save the file), and returns the output in JSON format, in a byte stream.
// The JSON is formatted from the TextNode struct, so it follows it's format
func ParsePdf(file *multipart.FileHeader) []byte
// ParsePdfFile parses a text pdf file and returns text JSON in hierarchical order.
// It takes a file name, and returns the output in JSON format, in a byte stream.
// The JSON is formatted from the TextNode struct, so it follows it's format
func ParsePdfFile(filename string) []byte
// ParsePdfToTree parses a text pdf file and returns text in tree hierarchical order.
// It takes a multipart fileheader (perfect if you are working with a web server
// and don't want to save the file), and returns the output in tree format, from the TextNode struct.
func ParsePdfToTree(file *multipart.FileHeader) *TextNode
// ParsePdfFileToTree parses a text pdf file and returns text in tree hierarchical order.
// It takes a file name, and returns the output in tree format, from the TextNode struct.
func ParsePdfFileToTree(filename string) *TextNode
// Creates a router for PDF processing on port 8080.
func CreateRouter()

PDF processing server

Run the server

The server only has one main route: POST /parsepdf. It takes a pdf file and returns the JSON-formatted data. To run the server, you can either run the Docker container or if you have go installed, you can run go run *.go in your command line.

A little docker tutorial

You can run the docker container by first building it, then running it

docker build -t "sgtmilk/pdfparser:<TAG>" . --build-arg ENV=production
docker run -p 8080:8080 sgtmilk/pdfparser:<TAG>
Testing with Postman

If you would like to test the server, there are two routes:

  • GET /ping is the testing route, to see if the base server works. It will return pong in json format.
  • POST /parsepdf is the route for extracting data from a pdf file. You can attach a file to the request by going to the Body tab, and selecting the form-data format. Then, in the key, select File type and type file. In the value, select your file.

Annex

[
    {
        "Value": "This is a document. ",
        "Font": "Calibri-Light-27",
        "Children": [
            {
                "Value": "I like trains. ",
                "Font": "Calibri-Light-15",
                "Children": [
                    {
                        "Value": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
                        "Font": "Open Sans-10",
                        "Children": null,
                        "Position": {
                            "Top": 219.67000000000002,
                            "Bottom": 157.87000000000006,
                            "Left": 69.19391999999996,
                            "Right": 430.27
                        }
                    }
                ],
                "Position": {
                    "Top": 126.55000000000001,
                    "Bottom": 126.55000000000001,
                    "Left": 72.024,
                    "Right": 192.98695999999998
                }
            },
            {
                "Value": "Do you like trains? ",
                "Font": "Calibri-Light-15",
                "Children": [
                    {
                        "Value": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
                        "Font": "Open Sans-10",
                        "Children": null,
                        "Position": {
                            "Top": 109.36000000000007,
                            "Bottom": 47.68000000000001,
                            "Left": 69.60576000000002,
                            "Right": 411.31
                        }
                    }
                ],
                "Position": {
                    "Top": 126.55000000000001,
                    "Bottom": 126.55000000000001,
                    "Left": 72.024,
                    "Right": 192.98695999999998
                }
            }
        ],
        "Position": {
            "Top": 271.51000000000005,
            "Bottom": 271.51000000000005,
            "Left": 67.24283999999999,
            "Right": 285.12896
        }
    }
]

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CreateRouter

func CreateRouter()

Creates a router for PDF processing on port 8080. See README for details

func ParsePdf

func ParsePdf(file *multipart.FileHeader) []byte

ParsePdf parses a multipart text pdf file and returns text in JSON hierarchical order. It takes a multipart fileheader (perfect if you are working with a web server and don't want to save the file), and returns the output in JSON format, in a byte stream. The JSON is formatted from the TextNode struct, so it follows it's format

func ParsePdfFile

func ParsePdfFile(filename string) []byte

ParsePdfFile parses a text pdf file and returns text JSON in hierarchical order. It takes a file name, and returns the output in JSON format, in a byte stream. The JSON is formatted from the TextNode struct, so it follows it's format

Types

type Rect

type Rect struct {
	Top    float64
	Bottom float64
	Left   float64
	Right  float64
}

An agglomeration of positional attributes of a rectangle

type TextNode

type TextNode struct {
	Value    string     // the text value, what is written
	Font     string     // the font of the text
	Children []TextNode // the children of this piece of text (ex: this is the title of ...), or nil if there are none
	Position Rect       // The positioning of this piece of text
}

An agglomeration of multiple attributes of the structure and style of the text

func ParsePdfFileToTree

func ParsePdfFileToTree(filename string) *TextNode

ParsePdfFileToTree parses a text pdf file and returns text in tree hierarchical order. It takes a file name, and returns the output in tree format, from the TextNode struct.

func ParsePdfToTree

func ParsePdfToTree(file *multipart.FileHeader) *TextNode

ParsePdfToTree parses a text pdf file and returns text in tree hierarchical order. It takes a multipart fileheader (perfect if you are working with a web server and don't want to save the file), and returns the output in tree format, from the TextNode struct.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL