opencrucible

package module
v0.0.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 5, 2023 License: MIT Imports: 19 Imported by: 0

README

opencrucible

Detect type and extract text from different file type. Similar to Tika Project but in Golang.

Go Reference Go Report Card

Logo OpenCrucible

List of formats read:

Format FileParser MIME Type Metadata
TXT X text/plain; charset=utf-8
RTF X text/rtf
DOC (partial) X application/x-ole-storage
ODT X application/vnd.oasis.opendocument.text X
DOCX X application/vnd.openxmlformats-officedocument.wordprocessingml.document X
PPTX X application/vnd.openxmlformats-officedocument.presentationml.presentation X
PDF X application/pdf X

Documentation

Index

Constants

View Source
const Version = "0.0.7"

Version exposes the current package version.

Variables

This section is empty.

Functions

func DOCFileParseToString added in v0.0.7

func DOCFileParseToString(FileToParse string) (string, error)

func DOCX2Text added in v0.0.3

func DOCX2Text(file io.ReaderAt, size int64) (string, error)

DOCX2Text extracts text of a Word document Size is the full size of the input file.

func DOCXFileParseToString added in v0.0.3

func DOCXFileParseToString(FileToParse string) (string, error)

func DetectFileType

func DetectFileType(StreamToDetect []byte) (string, string, error)

func DetectFileTypeMIME

func DetectFileTypeMIME(StreamToDetect []byte) (string, string, error)

func IsFileDOCX added in v0.0.3

func IsFileDOCX(data []byte) bool

IsFileDOCX checks if the data indicates a DOCX file DOCX has a signature of 50 4B 03 04

func IsFilePPTX added in v0.0.5

func IsFilePPTX(data []byte) bool

IsFilePPTX checks if the data indicates a PPTX file PPTX has a signature of 50 4B 03 04 Warning: This collides with ZIP, DOCX and other zip-based files.

func ODTFileParseToString added in v0.0.4

func ODTFileParseToString(FileToParse string) (string, error)

func ODTParseToString added in v0.0.3

func ODTParseToString(StreamToParse []byte) (string, error)

func OfficeFileMetadata added in v0.0.6

func OfficeFileMetadata(FileToParse string) (*metagoffice.XMLContent, error)

func PDFFileMetadata added in v0.0.5

func PDFFileMetadata(FileToParse string) (*pdf_parser.PdfInfo, error)

See: https://www.lazy-tech.net/project/pdf_metadata_parsing_golang

func PDFFileParseToString added in v0.0.4

func PDFFileParseToString(FileToParse string) (string, error)

func PPTX2Text added in v0.0.5

func PPTX2Text(file io.ReaderAt, size int64) (string, error)

PPTX2Text extracts text of a PowerPoint document Size is the full size of the input file.

func PPTXFileParseToString added in v0.0.5

func PPTXFileParseToString(FileToParse string) (string, error)

func RTFFileParseToString added in v0.0.4

func RTFFileParseToString(FileToParse string) (string, error)

func RTFParseToString

func RTFParseToString(StreamToParse []byte) (string, error)

func TXTFileParseToString added in v0.0.4

func TXTFileParseToString(FileToParse string) (string, error)

func TXTParseToString

func TXTParseToString(StreamToParse []byte) (string, error)

Types

type PPTXDocument added in v0.0.5

type PPTXDocument struct {
	Slides []PPTXSlide
}

PPTXDocument is a PPTX document loaded into memory

func (PPTXDocument) AsText added in v0.0.5

func (doc PPTXDocument) AsText() (text string)

AsText returns the text on all slides

type PPTXSlide added in v0.0.5

type PPTXSlide struct {
	SlideNumber int
	//ThumbnailBase64 string
	TextContent string
}

PPTXSlide is a single PPTX slide

type SlideNumberSorter added in v0.0.5

type SlideNumberSorter []PPTXSlide

SlideNumberSorter is used for sorting

func (SlideNumberSorter) Len added in v0.0.5

func (a SlideNumberSorter) Len() int

func (SlideNumberSorter) Less added in v0.0.5

func (a SlideNumberSorter) Less(i, j int) bool

func (SlideNumberSorter) Swap added in v0.0.5

func (a SlideNumberSorter) Swap(i, j int)

type WordDocument added in v0.0.3

type WordDocument struct {
	Paragraphs []WordParagraph
}

WordDocument is a full word doc

func WordParse added in v0.0.3

func WordParse(doc string) (WordDocument, error)

WordParse parses a word file

func (WordDocument) AsText added in v0.0.3

func (w WordDocument) AsText() string

AsText returns all text in the document

type WordParagraph added in v0.0.3

type WordParagraph struct {
	Style WordStyle `xml:"pPr>pStyle"`
	Rows  []WordRow `xml:"r"`
}

WordParagraph is a single paragraph

type WordRow added in v0.0.3

type WordRow struct {
	Text string `xml:"t"`
}

WordRow ...

type WordStyle added in v0.0.3

type WordStyle struct {
	Val string `xml:"val,attr"`
}

WordStyle ...

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL