text-extraction-service

command module
v2.0.0-...-98876da Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 24, 2024 License: GPL-3.0 Imports: 27 Imported by: 0

README

text-extraction-service

A simple Go service for extracting textual content from PDF, RTF and legacy MS Word (.doc) documents.

Status

This started as an exercise in using Golang. The use case is processing binary documents for search machine indexation.

It optionally embeds NATS.io JetStream as an object store that acts as a persistent cache for extracted content.

The RegEx-based RTF parser is rather inefficient.

Apache Tika is definitively a more versatile and mature solution to be considered.

Features

  • Support for PDFs, RTFs and legacy MS Word (.doc) files
  • Dehyphenation of extracted text, specifically for German
  • NATS can be embedded or run externally (e.g. as a cluster)
  • NATS microservice interface supported

Setup

By default Poppler is used for PDFs. It is needed as dependency which can be installed on Debian via apt-get install libpoppler-glib-dev.

Build locally

# Omit a large, yet unused dependency of Gin
go build -tags nomsgpack

Use MuPDF instead of Poppler via build tag

go build -tags mupdf,nomsgpack

MuPDF or Poppler?

MuPDF is very fast for small documents. Poppler is rather slow but faster and more memory-efficient for large documents.

In my experience Poppler offers the overall best quality when performing text extraction.

Build container images

The repo includes two Containerfiles for building minimal Alpine-based images in a multi-stage style.

# Use a volume to speed up subsequent builds—remove the need to re-download and re-compile all dependencies
mkdir --mode 777 --parents /tmp/cache

# MuPDF-based:
podman build --pull . -f Containerfile.mupdf-alpine -t tes-mupdf:alpine --volume /tmp/cache:/tmp

# Poppler-based:
podman build --pull . -f Containerfile.poppler-alpine -t tes-poppler:alpine --volume /tmp/cache:/tmp

Run containers

# MuPDF based, using a volume for Nats JetStream storage
podman run --rm -it -v nats:/tmp/nats -p 8080:8080 -p 4222:4222 tes-mupdf:alpine

# poppler based, using a volume for Nats JetStream storage
podman run --rm -it -v nats:/tmp/nats -p 8080:8080 -p 4222:4222 tes-poppler:alpine

Config

Configuration happens trough enviroment Variables only.

Environment Variable Description
TES_BUCKET Name of the object store bucket in NATS to use for caching. Default: TES_PLAINTEXTS
TES_REPLICAS Replication factor for object store bucket in external NATS cluster
TES_EXPOSE_NATS Wether to allow connections to the embedded NATS server (bool)
TES_NATS_HOST Listen host/IP of embedded NATS server when TES_EXPOSE_NATSis true. Default: localhost
TES_NATS_PORT Listen Port (TCP) of embedded NATS server when TES_EXPOSE_NATSis true. Default: 4222
TES_NATS_STORE_DIR Storage path for the embedded NATS server. Default: /tmp/nats
TES_MAX_PAYLOAD Max message payload of embedded NATS server. Default: 8 MiB
TES_NATS_URL URL of external NATS server/cluster. If this is set, no embedded NATS server is started
TES_FAIL_WITHOUT_JS If enabled the service exits when JetStream support of the external NATS server/cluster can not be confirmed. Default: true
TES_NATS_TIMEOUT Connection timeout as a time.Duration string. Default: 15s
TES_NATS_CONNECT_RETRIES Number of times a connection to an external NATS server/cluster and to JetStream is being tried. Default: 10
TES_HOST_PORT Listen adress of HTTP server. Default: :8080 (same as 0.0.0.0:8080)
TES_NO_HTTP If true and TES_EXPOSE_NATS is true, too, no HTTP server is started

Usage

TBD

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
pkg
dehyphenator
Package dehyphenator implements a simple algorithm for de-hyphenating German text.
Package dehyphenator implements a simple algorithm for de-hyphenating German text.
docparser
Package docparser implements a parser for legacy MS Word documents (.doc).
Package docparser implements a parser for legacy MS Word documents (.doc).
rtfparser
Package rtfparser implements a parser for the Rich Text Format.
Package rtfparser implements a parser for the Rich Text Format.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL