docconv

package module
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 7, 2018 License: MIT Imports: 27 Imported by: 1

README

docconv

Build Status

A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text.

Installation

If you haven't setup Go before, you need to first set a GOPATH (see https://golang.org/doc/code.html#GOPATH).

To fetch and build the code:

$ go get github.com/nuveo/docconv/...

This will also build the command line tool docd into $GOPATH/bin (assumed to be in your PATH already).

Dependencies

tidy, wv, popplerutils, unrtf, https://github.com/JalfResi/justext

Example install of dependencies (not all systems):

$ sudo apt-get install poppler-utils wv unrtf tidy
$ go get github.com/JalfResi/justext
Optional Dependencies

To add image support to the docconv library you first need to install and build https://github.com/otiai10/gosseract. Now you can add -tags ocr to any go command when building/fetching docconv to include support for processing images:

$ go get -tags ocr github.com/nuveo/docconv/...

docd tool

The docd tool runs as either

  1. a service on port 8888 (by default)

    Documents can be sent as a multipart POST request and the plain text (body) and meta information are then returned as a JSON object

  2. via the command line.

    Documents can be sent as an argument, e.g.

    docd -input document.pdf

Optional Flags
  • "addr" - the bind address for the HTTP server, default is ":8888"
  • "log-level"
    • 0: errors & critical info
    • 1: inclues 0 and logs each request as well
    • 2: include 1 and logs the response payloads
  • "readability-length-low" - Sets the readability length low if the ?readability=1 parameter is set
  • "readability-length-high" - Sets the readability length high if the ?readability=1 parameter is set
  • "readability-stopwords-low" - Sets the readability stopwords low if the ?readability=1 parameter is set
  • "readability-stopwords-high" - Sets the readability stopwords high if the ?readability=1 parameter is set
  • "readability-max-link-density" - Sets the readability max link density if the ?readability=1 parameter is set
  • "readability-max-heading-distance" - Sets the readability max heading distance if the ?readability=1 parameter is set
  • "readability-use-classes - Comma separated list of readability classes to use if the ?readability=1 parameter is set
How to start the service

docd -log-level 0 # will only log errors & critical info

docd -addr 8000 -log-level 1 # will run on port 8000 and log each request as well

Example Usage (code)

Some basic code is shown below, but normally you would accept the file by http or open it from the file system. It should be enough to get you started though...

package main

import (
	"encoding/json"
	"io/ioutil"
	"bytes"
	"net/http"
	"mime/multipart"
	"net/textproto"
	"fmt"
)

type ConversionResponse struct {
	Body string             `json:"body"`
	Meta map[string]string  `json:"meta"`
	MSecs uint32            `json:"msecs"`
}

// Use the conversion service to convert data
func ConvertData(input []byte, mimeType string) ([]byte, map[string]string, error) {

	convertUrl := "http://localhost:8888/convert"
	convertParam := "input"

	body := &bytes.Buffer{}
	writer := multipart.NewWriter(body)

	h := make(textproto.MIMEHeader)
	h.Set("Content-Disposition", `form-data; name="`+convertParam+`"; filename="noname"`)
	h.Set("Content-Type", mimeType)
	part, err := writer.CreatePart(h)
	if err != nil {
		return nil, nil, err
	}
	_, err = part.Write(input)
	if err != nil {
		return nil, nil, err
	}
	err = writer.Close()
	if err != nil {
	  return nil, nil, err
	}
	client := &http.Client{}

	request, err := http.NewRequest("POST", convertUrl, body)
	if err != nil {
		return nil, nil, err
	}
	request.Header["Content-Type"] = []string{"multipart/form-data; boundary="+writer.Boundary()}
	resp, err := client.Do(request)
	if err != nil {
		return nil, nil, err
	}
	defer resp.Body.Close()
	jsonBlob, err := ioutil.ReadAll(resp.Body)
	if err != nil {
		return nil, nil, err
	}
	converted := new(ConversionResponse)
	err = json.Unmarshal(jsonBlob, &converted)
	if err != nil {
		return nil, nil, err
	}
	return []byte(converted.Body), converted.Meta, nil
}

func main() {
	input := []byte{} // This would be the file contents
	mimeType := "application/pdf" // Also pass the mimetype of the file
	body, meta, err := ConvertData(input, mimeType)
	fmt.Println("The body text is : ", body)
	fmt.Println("The file meta data is a map : ", meta)
	fmt.Println("Any errors are returned here : ", err)
}

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CSVtoXLSX

func CSVtoXLSX(r io.Reader) (xlsByte string, err error)

CSVtoXLSX convert CSV data to XLSX

func ConvertDoc

func ConvertDoc(r io.Reader) (string, map[string]string, error)

Convert MS Word DOC

func ConvertDocx

func ConvertDocx(r io.Reader) (string, map[string]string, error)

Convert DOCX to text

func ConvertHTML

func ConvertHTML(r io.Reader, readability bool) (string, map[string]string, error)

Convert HTML

func ConvertImage

func ConvertImage(r io.Reader) (string, map[string]string, error)

func ConvertODT

func ConvertODT(r io.Reader) (string, map[string]string, error)

Convert ODT to text

func ConvertPDF

func ConvertPDF(r io.Reader) (string, map[string]string, error)

Convert PDF

func ConvertPages

func ConvertPages(r io.Reader) (string, map[string]string, error)

Convert PAGES to text

func ConvertPathReadability

func ConvertPathReadability(path string, readability bool) ([]byte, error)

TODO(dhowden): Refactor this. Convert a file given a path

func ConvertRTF

func ConvertRTF(r io.Reader) (string, map[string]string, error)

Convert RTF

func ConvertURL

func ConvertURL(input io.Reader, readability bool) (string, map[string]string, error)

Convert URL

func ConvertXLS

func ConvertXLS(r io.Reader) (string, map[string]string, error)

ConvertXLS Convert MS Excel Spreadsheet

func ConvertXLSX

func ConvertXLSX(r io.Reader) (string, map[string]string, error)

ConvertXLSX Excel Spreadsheet

func ConvertXML

func ConvertXML(r io.Reader) (string, map[string]string, error)

Convert XML input

func DocxXMLToText

func DocxXMLToText(r io.Reader) (string, error)

func HTMLReadability

func HTMLReadability(r io.Reader) []byte

Extract the readable text in an HTML document

func HTMLToText

func HTMLToText(input io.Reader) string

func MimeTypeByExtension

func MimeTypeByExtension(filename string) string

Determine the mime type by the file's extension

func SetImageLanguages

func SetImageLanguages(string)

func Tidy

func Tidy(r io.Reader, xmlIn bool) ([]byte, error)

Errors & warnings are deliberately suppressed as tidy throws warnings very easily

func XMLToMap

func XMLToMap(r io.Reader) (map[string]string, error)

Convert XML to a nested string map

func XMLToText

func XMLToText(r io.Reader, breaks []string, skip []string, strict bool) (string, error)

Convert XML to plain text given how to treat elements

Types

type HTMLReadabilityOptions

type HTMLReadabilityOptions struct {
	LengthLow             int
	LengthHigh            int
	StopwordsLow          float64
	StopwordsHigh         float64
	MaxLinkDensity        float64
	MaxHeadingDistance    int
	ReadabilityUseClasses string
}

HTMLReadabilityOptions is a type which defines parameters that are passed to the justext paackage. TODO: Improve this!

var HTMLReadabilityOptionsValues HTMLReadabilityOptions

TODO: Remove this from global state.

type LocalFile

type LocalFile struct {
	*os.File
	// contains filtered or unexported fields
}

LocalFile is a type which wraps an *os.File. See NewLocalFile for more details.

func NewLocalFile

func NewLocalFile(r io.Reader, dir, prefix string) (*LocalFile, error)

NewLocalFile ensures that there is a file which contains the data provided by r. If r is actually an instance of *os.File then this file is used, otherwise a temporary file is created (using dir and prefix) and the data from r copied into it. Callers must call Done() when the LocalFile is no longer needed to ensure all resources are cleaned up.

func (*LocalFile) Done

func (l *LocalFile) Done()

Done cleans up all resources.

type Response

type Response struct {
	Body  string            `json:"body"`
	Meta  map[string]string `json:"meta"`
	MSecs uint32            `json:"msecs"`
}

Response payload sent back to the requestor

func Convert

func Convert(r io.Reader, mimeType string, readability bool) (*Response, error)

TODO(dhowden): Refactor this. Convert a file to plain text & meta data

func ConvertPath

func ConvertPath(path string) (*Response, error)

TODO(dhowden): Refactor this. Convert a file given a path

Directories

Path Synopsis
Package TSP is a generated protocol buffer package.
Package TSP is a generated protocol buffer package.
Package snappy implements the snappy block-based compression format.
Package snappy implements the snappy block-based compression format.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL