textract

package module
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 29, 2023 License: MIT Imports: 9 Imported by: 0

README

Still very early in the development of this code. Will need to find the time to work on it :)

Description

A simple library to extract text content from popular document types, such as Word, PowerPoint, Excel, PDF, etc.

Started developing this module because I need it for another application I've been building and am looking for something that is royalty-free and high performance. I intend to add support for additional document types over time.

This initial version only supports Word/docx documents.

Installation

After you have installed go, run this command to install the textract package:

go get github.com/chchench/textract

Roadmap

  • 1.0.0 - Initial release supports text extraction from (post 2007) Word/docx files

License

Making the source code to this app available under

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Dump

func Dump(path, content string)

func ExtractArchiveContent

func ExtractArchiveContent(path string, filter Filter) (*[]MemberFileContent, error)

func FatalExit

func FatalExit(msg string)

func GetFileExtension

func GetFileExtension(filename string) string

func GetTrueFileType

func GetTrueFileType(fp string) (string, error)

func RetrieveTextFromFile

func RetrieveTextFromFile(path string) (string, error)

Types

type DocumentParser

type DocumentParser interface {
	// contains filtered or unexported methods
}

type DocxParser

type DocxParser struct {
	Content []MemberFileContent
}

type Docx_Body

type Docx_Body struct {
	XMLName    xml.Name         `xml:"body"`
	Paragraphs []Docx_Paragraph `xml:"p"`
}

type Docx_Doc

type Docx_Doc struct {
	XMLName xml.Name    `xml:"document"`
	Bodies  []Docx_Body `xml:"body"`
}

type Docx_Paragraph

type Docx_Paragraph struct {
	XMLName xml.Name   `xml:"p"`
	Runs    []Docx_Run `xml:"r"`
}

type Docx_Run

type Docx_Run struct {
	XMLName xml.Name `xml:"r"`
	Text    string   `xml:"t"`
}

type Filter

type Filter func(string) bool

type MemberFileContent

type MemberFileContent struct {
	Identifier string
	Data       []byte
}

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL