pdfparser

package module

v0.1.0 Latest Latest Go to latest Published: Apr 24, 2023 License: MIT Imports: 13 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/SgtMilk/pdfparser

Links

Open Source Insights

README ¶

pdfparser

A small library for parsing text PDFs in hierarchical format (see annex for example).

The parser will return hierarchized data in JSON format with the positioning and style in the original pdf. A small server has also been included for easier testing and trying the library out.

Exported functions

Five functions are exported for your convinience:

// ParsePdf parses a multipart text pdf file and returns text in JSON hierarchical order.
// It takes a multipart fileheader (perfect if you are working with a web server
// and don't want to save the file), and returns the output in JSON format, in a byte stream.
// The JSON is formatted from the TextNode struct, so it follows it's format
func ParsePdf(file *multipart.FileHeader) []byte

// ParsePdfFile parses a text pdf file and returns text JSON in hierarchical order.
// It takes a file name, and returns the output in JSON format, in a byte stream.
// The JSON is formatted from the TextNode struct, so it follows it's format
func ParsePdfFile(filename string) []byte

// ParsePdfToTree parses a text pdf file and returns text in tree hierarchical order.
// It takes a multipart fileheader (perfect if you are working with a web server
// and don't want to save the file), and returns the output in tree format, from the TextNode struct.
func ParsePdfToTree(file *multipart.FileHeader) *TextNode

// ParsePdfFileToTree parses a text pdf file and returns text in tree hierarchical order.
// It takes a file name, and returns the output in tree format, from the TextNode struct.
func ParsePdfFileToTree(filename string) *TextNode

// Creates a router for PDF processing on port 8080.
func CreateRouter()

PDF processing server

Run the server

The server only has one main route: POST /parsepdf. It takes a pdf file and returns the JSON-formatted data. To run the server, you can either run the Docker container or if you have go installed, you can run go run *.go in your command line.

A little docker tutorial

You can run the docker container by first building it, then running it

docker build -t "sgtmilk/pdfparser:<TAG>" . --build-arg ENV=production
docker run -p 8080:8080 sgtmilk/pdfparser:<TAG>

Testing with Postman

If you would like to test the server, there are two routes:

GET /ping is the testing route, to see if the base server works. It will return pong in json format.
POST /parsepdf is the route for extracting data from a pdf file. You can attach a file to the request by going to the Body tab, and selecting the form-data format. Then, in the key, select File type and type file. In the value, select your file.

Annex

[
    {
        "Value": "This is a document. ",
        "Font": "Calibri-Light-27",
        "Children": [
            {
                "Value": "I like trains. ",
                "Font": "Calibri-Light-15",
                "Children": [
                    {
                        "Value": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
                        "Font": "Open Sans-10",
                        "Children": null,
                        "Position": {
                            "Top": 219.67000000000002,
                            "Bottom": 157.87000000000006,
                            "Left": 69.19391999999996,
                            "Right": 430.27
                        }
                    }
                ],
                "Position": {
                    "Top": 126.55000000000001,
                    "Bottom": 126.55000000000001,
                    "Left": 72.024,
                    "Right": 192.98695999999998
                }
            },
            {
                "Value": "Do you like trains? ",
                "Font": "Calibri-Light-15",
                "Children": [
                    {
                        "Value": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
                        "Font": "Open Sans-10",
                        "Children": null,
                        "Position": {
                            "Top": 109.36000000000007,
                            "Bottom": 47.68000000000001,
                            "Left": 69.60576000000002,
                            "Right": 411.31
                        }
                    }
                ],
                "Position": {
                    "Top": 126.55000000000001,
                    "Bottom": 126.55000000000001,
                    "Left": 72.024,
                    "Right": 192.98695999999998
                }
            }
        ],
        "Position": {
            "Top": 271.51000000000005,
            "Bottom": 271.51000000000005,
            "Left": 67.24283999999999,
            "Right": 285.12896
        }
    }
]

Documentation ¶

Index ¶

func CreateRouter()
func ParsePdf(file *multipart.FileHeader) []byte
func ParsePdfFile(filename string) []byte
type Rect
type TextNode
- func ParsePdfFileToTree(filename string) *TextNode
- func ParsePdfToTree(file *multipart.FileHeader) *TextNode

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CreateRouter ¶

func CreateRouter()

Creates a router for PDF processing on port 8080. See README for details

func ParsePdf ¶

func ParsePdf(file *multipart.FileHeader) []byte

ParsePdf parses a multipart text pdf file and returns text in JSON hierarchical order. It takes a multipart fileheader (perfect if you are working with a web server and don't want to save the file), and returns the output in JSON format, in a byte stream. The JSON is formatted from the TextNode struct, so it follows it's format

func ParsePdfFile ¶

func ParsePdfFile(filename string) []byte

ParsePdfFile parses a text pdf file and returns text JSON in hierarchical order. It takes a file name, and returns the output in JSON format, in a byte stream. The JSON is formatted from the TextNode struct, so it follows it's format

Types ¶

type Rect ¶

type Rect struct {
	Top    float64
	Bottom float64
	Left   float64
	Right  float64
}

An agglomeration of positional attributes of a rectangle

type TextNode ¶

type TextNode struct {
	Value    string     // the text value, what is written
	Font     string     // the font of the text
	Children []TextNode // the children of this piece of text (ex: this is the title of ...), or nil if there are none
	Position Rect       // The positioning of this piece of text
}

An agglomeration of multiple attributes of the structure and style of the text

func ParsePdfFileToTree ¶

func ParsePdfFileToTree(filename string) *TextNode

ParsePdfFileToTree parses a text pdf file and returns text in tree hierarchical order. It takes a file name, and returns the output in tree format, from the TextNode struct.

func ParsePdfToTree ¶

func ParsePdfToTree(file *multipart.FileHeader) *TextNode

ParsePdfToTree parses a text pdf file and returns text in tree hierarchical order. It takes a multipart fileheader (perfect if you are working with a web server and don't want to save the file), and returns the output in tree format, from the TextNode struct.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL