pagedata

package

v0.11.3 Latest Latest Go to latest Published: Jun 15, 2020 License: AGPL-3.0, AGPL-3.0 Imports: 13 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/timdrysdale/gradex-cli

Links

Open Source Insights

README ¶

pdfpagedata

read/write gradex pagedata frompdf pages

Why

Tracking pages in PDF documents when they are split into separate files can be done a couple of ways

-- postpend something to the filename for page -- put info in metadata -- use some page text, off page -- protocol buf into stream object

Unfortunately, processing thousands of pages, though different hands, needs more safety than fragile filenames can provide. Plus we will have duplicate files with non-duplicate annotations. So what do we do when two people re-upload their different files with the same name, and then either overwrite or make some modification to the filename? How do we recover from an unfortunate name choice here?

We could stash info in the metadata, but that tends to be file-level, so it is not clear how to handle duplicate custom metadata fields when multiple files from different documents are joined, then split, then joined again etc. Bseides, I've seen editors mess with the metadata, and I don't fancy users editing it either.

Off-page page text seems fragile too, but I get some comfort from reading that people an NOT crop when they want to. A test has been included for this very purpose -which is passing.

Wrinkles

text written in the same place gets read back out in some sort of merged way, so pageData is written in a tiny font (like 0.00001) and randomly scattered around a location that is far off the page. Tag destruction is detected (such as for clases), and multiple page datas on a page are supported.

A collision is possible ... we could always consider writing each hidden data twice ...

Future

Protocol buf into a stream object seems like a more robust way (and it avoids crop and collision worries) but it is probably about a half-day or a day to develop so that makes it a roadmap item for now.

Documentation ¶

Index ¶

Constants
func AddPageDataToPDF(inputPath string, outputPath string, pdMap map[int]PageData) error
func GetLen(input map[int]PageData) int
func GetLinkMap(pageDataMap map[int]PageData) (map[int]Link, error)
func MarshalOneToCreator(c *creator.Creator, pd *PageData) error
func PrettyPrintStruct(layout interface{}) error
func TriageFile(inputPath string) (map[int]Summary, error)
func UnMarshalAllFromFile(inputPath string) (map[int]PageData, error)
type Field
type FileDetail
type ItemDetail
type Link
type PageData
type PageDetail
type ProcessDetail
type Summary

Constants ¶

View Source

const (
	IsPage    = "page"
	IsRegion  = "region"
	IsCover   = "cover"
	IsMontage = "montage"

	IsAnonymous = "anonymous"
	IsIdentity  = "identity"
)

View Source

const (
	StartTag        = "<gradex-pagedata>"
	EndTag          = "</gradex-pagedata>"
	StartTagOffset  = len(StartTag)
	EndTagOffset    = len(EndTag)
	StartHash       = "<hash>"
	EndHash         = "</hash>"
	StartHashOffset = len(StartHash)
	EndHashOffset   = len(EndHash)
)

Variables ¶

This section is empty.

Functions ¶

func AddPageDataToPDF ¶ added in v0.8.4

func AddPageDataToPDF(inputPath string, outputPath string, pdMap map[int]PageData) error

modified from https://github.com/unidoc/unipdf-examples/blob/master/text/pdf_insert_text.go

func GetLen ¶

func GetLen(input map[int]PageData) int

func GetLinkMap ¶ added in v0.5.0

func GetLinkMap(pageDataMap map[int]PageData) (map[int]Link, error)

A non-nil error means there is a broken sequence on at least one page the Linkmap has the details ....

func MarshalOneToCreator ¶

func MarshalOneToCreator(c *creator.Creator, pd *PageData) error

func PrettyPrintStruct ¶

func PrettyPrintStruct(layout interface{}) error

func TriageFile ¶

func TriageFile(inputPath string) (map[int]Summary, error)

func UnMarshalAllFromFile ¶

func UnMarshalAllFromFile(inputPath string) (map[int]PageData, error)

Types ¶

type Field ¶

type Field struct {
	Key   string `json:"k"`
	Value string `json:"v"`
}

type FileDetail ¶

type FileDetail struct {
	Path   string `json:"path"`
	UUID   string `json:"UUID"`
	Number int    `json:"number"`
	Of     int    `json:"of"`
}

type ItemDetail ¶

type ItemDetail struct {
	What    string `json:"what"`
	When    string `json:"when"`
	Who     string `json:"who"`
	UUID    string `json:"UUID"`
	WhoType string `json:"whoType"`
}

whotype exam number:EN matriculation number:UUN etc

type Link ¶ added in v0.5.0

type Link struct {
	First    string
	Last     string
	Sequence []string
	IsLinked bool
}

type PageData ¶

type PageData struct {
	Current  PageDetail   `json:"current"`
	Previous []PageDetail `json:"previous"`
	Revision int          `json:"revision"`
}

type PageDetail ¶

type PageDetail struct {
	Is                  string            `json:"is"` //page, region
	Own                 FileDetail        `json:"own"`
	Original            FileDetail        `json:"original"`
	Current             FileDetail        `json:"current"`
	Item                ItemDetail        `json:"item"`
	Process             ProcessDetail     `json:"process"`
	UUID                string            `json:"UUID"` //for mapping the previous page datas later
	Follows             string            `json:"follows"`
	Revision            int               `json:"revision"` //if we want to rewrite history ....
	Data                []Field           `json:"data"`
	Comments            []comment.Comment `json:"comments"`
	OmittedCommentCount int               `json:"omittedCommentCount"`
}

use custom data for group authorship, if individual authorship must be tracked here else use a group id e.g. group-<uuid> which has the individual authors recorded elsewhere, along with the original submission.

type ProcessDetail ¶

type ProcessDetail struct {
	Name     string  `json:"name"`
	UUID     string  `json:"UUID"` // process batch UUID
	UnixTime int64   `json:"unixTime"`
	For      string  `json:"for"`
	ToDo     string  `json:"toDo"`
	By       string  `json:"by"`
	Data     []Field `json:"data"`
}

type Summary ¶

type Summary struct {
	Is   string //page, region, cover-page etc
	What string //item
	For  string //proc
	ToDo string //proc
}

Used in triaging files at ingest/staging

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL