ocrworker

package module
v1.8.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 12, 2022 License: Apache-2.0 Imports: 26 Imported by: 3

README

GoDoc

OpenOCR makes it simple to host your own OCR REST API.

The heavy lifting OCR work is handled by Tesseract OCR.

Docker is used to containerize the various components of the service.

screenshot

Features

  • Scalable message passing architecture via RabbitMQ.
  • Platform independence via Docker containers.
  • Kubernetes support: workers can run in a Kubernetes Replication Controller
  • Supports 31 languages in addition to English
  • Ability to use an image pre-processing chain. An example using Stroke Width Transform is provided.
  • PDF support via a PDF preprocessor
  • Pass arguments to Tesseract such as character whitelist and page segment mode.
  • REST API docs
  • A Go REST client is available.

Launching OpenOCR on a Docker PAAS

OpenOCR can easily run on any PAAS that supports Docker containers. Here are the instructions for a few that have already been tested:

If your preferred PAAS isn't listed, please open a Github issue to request instructions.

Launching OpenOCR on Ubuntu 14.04

OpenOCR can be launched on anything that supports Docker, such as Ubuntu 14.04.

Here's how to install it from scratch and verify that it's working correctly.

Install Docker

See Installing Docker on Ubuntu instructions.

Find out your host address

$ ifconfig
eth0      Link encap:Ethernet  HWaddr 08:00:27:43:40:c7
          inet addr:10.0.2.15  Bcast:10.0.2.255  Mask:255.255.255.0
          ...

The ip address 10.0.2.15 will be used as the RABBITMQ_HOST env variable below.

Launching OpenOCR command run.sh

  • Install docker
  • Install docker-compose
  • git clone https://github.com/tleyden/open-ocr.git
  • cd open-ocr/docker-compose
  • Type ./run.sh (in case you don't have "execute" right type sudo chmod +x run.sh
  • The runner will ask you if you want to delete the images (choose y or n for each)
  • The runner will ask you to choose between version 1 and 2
    • Version 1 is using the ocr Tesseract 3.04. The memory usage is light. It is pretty fast and not costly in terms of size (a simple aws instance with 1GB of ram and 8GB of storage is sufficiant). Result are acceptable
    • Version 2 is using the ocr Tesseract 4.00. The memory usage is light. It is less fast than tesseract 3 and more costly in terms of size (a simple aws instance with 1GB of ram is sufficient but with an EBS of 16GB of storage). Result are really better compared to version 3.04.
    • To see a comparative you can have a look to the official page of tesseract

You can use the docker-compose without the run.sh. For this just do:

# for v1
export OPEN_OCR_INSTANCE=open-ocr

# for v2
export OPEN_OCR_INSTANCE=open-ocr-2

# then up (with -d to start it as deamon)
docker-compose up

Docker Compose will start four docker instances

You are now ready to decode images → text via your REST API.

Launching OpenOCR with Docker Compose on OSX

  • Install docker
  • Install docker toolbox
  • Checkout OpenOCR repository
  • cd docker-compose directory
  • docker-machine start default
  • docker-machine env
  • Look at the Docker host IP address
  • Run docker-compose up -d to run containers as daemons or docker-compose up to see the log in console

How to test the REST API after turning on the docker-compose up

Where IP_ADDRESS_OF_DOCKER_HOST is what you saw when you run docker-machine env (e.g. 192.168.99.100) and where HTTP_POST is the port number inside the .yml file inside the docker-compose directory presuming it should be the same 9292.

Request

$ curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://IP_ADDRESS_OF_DOCKER_HOST:HTTP_PORT/ocr

Assuming the values are (192.168.99.100 and 9292 respectively)

$ curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://192.168.99.100:9292/ocr

Response

It will return the decoded text for the test image:

< HTTP/1.1 200 OK
< Date: Tue, 13 May 2014 16:18:50 GMT
< Content-Length: 283
< Content-Type: text/plain; charset=utf-8
<
You can create local variables for the pipelines within the template by
prefixing the variable name with a “$" sign. Variable names have to be
composed of alphanumeric characters and the underscore. In the example
below I have used a few variations that work for variable names.

Test the REST API

With image url

Request

$ curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://10.0.2.15:$HTTP_PORT/ocr

Response

It will return the decoded text for the test image:

< HTTP/1.1 200 OK
< Date: Tue, 13 May 2014 16:18:50 GMT
< Content-Length: 283
< Content-Type: text/plain; charset=utf-8
<
You can create local variables for the pipelines within the template by
prefixing the variable name with a “$" sign. Variable names have to be
composed of alphanumeric characters and the underscore. In the example
below I have used a few variations that work for variable names.

With image base64

Request

$ curl -X POST -H "Content-Type: application/json" -d '{"img_base64":"<YOUR BASE 64 HERE>","engine":"tesseract"}' http://10.0.2.15:$HTTP_PORT/ocr

The REST API also supports:

  • Uploading the image content via multipart/related, rather than passing an image URL. (example client code provided in the Go REST client)
  • Tesseract config vars (eg, equivalent of -c arguments when using Tesseract via the command line) and Page Seg Mode
  • Ability to use an image pre-processing chain, e.g. Stroke Width Transform.
  • Non-English languages

See the REST API docs and the Go REST client for details.

Uploading local files using curl

The supplied docs/upload-local-file.sh provides an example of how to upload a local file using curl with multipart/related encoding of the json and image data:

  • usage: docs/upload-local-file.sh <urlendpoint> <file> [mimetype]
  • download the example ocr image wget http://bit.ly/ocrimage
  • example: docs/upload-local-file.sh http://10.0.2.15:$HTTP_PORT/ocr-file-upload ocrimage

Community

Client Libraries

License

OpenOCR is Open Source and available under the Apache 2 License.

Documentation

Index

Constants

View Source
const (
	EngineTesseract = OcrEngineType(iota)
	EngineGoTesseract
	EngineSandwichTesseract
	EngineMock
)
View Source
const (
	PreprocessorIdentity             = "identity"
	PreprocessorStrokeWidthTransform = "stroke-width-transform"
	PreprocessorConvertPdf           = "convert-pdf"
)
View Source
const MockEngineResponse = "mock engine decoder response"

Variables

View Source
var (
	// AppStop and ServiceCanAccept are global. Used to set the flag for logging and stopping the application
	AppStop            bool
	ServiceCanAccept   bool
	ServiceCanAcceptMu sync.RWMutex
)
View Source
var (

	// StopChan is used to gracefully stop http daemon
	StopChan = make(chan bool, 1)

	TechnicalErrorResManager bool
)
View Source
var (
	RequestsTrack      = sync.Map{}
	RequestTrackLength = uint32(0)
)

Functions

func CheckForAcceptRequest

func CheckForAcceptRequest(urlQueue, urlStat string) bool

CheckForAcceptRequest will check by reading the RabbitMQ API if resources for incoming request are available

func GenerateLandingPage

func GenerateLandingPage(appStop, technicalError bool, version string) string

GenerateLandingPage will generate a simple landing page

func InstrumentHttpStatusHandler

func InstrumentHttpStatusHandler(ocrHttpHandler *OcrHTTPStatusHandler) http.Handler

InstrumentHttpStatusHandler wraps httpHandler to provide prometheus metrics

func SetResManagerState

func SetResManagerState(ampqAPIConfig *RabbitConfig)

SetResManagerState sets boolean value of resource manager; if memory of rabbitMQ and the number messages is not exceeding the limit

func StripPasswordFromUrl added in v1.8.0

func StripPasswordFromUrl(urlToLog *url.URL) string

StripPasswordFromUrl strips passwords from URL

Types

type ConvertPdf

type ConvertPdf struct{}

type FlagFunction

type FlagFunction func()

func NoOpFlagFunction

func NoOpFlagFunction() FlagFunction

type FlagFunctionWorker

type FlagFunctionWorker func()

FlagFunctionWorker will be used as argument type for DefaultConfigFlagsWorkerOverride

func NoOpFlagFunctionWorker

func NoOpFlagFunctionWorker() FlagFunctionWorker

NoOpFlagFunctionWorker will return an empty set of cli parameters. In this case default parameter will be used

type IdentityPreprocessor

type IdentityPreprocessor struct{}

type MockEngine

type MockEngine struct{}

func (MockEngine) ProcessRequest

func (MockEngine) ProcessRequest(_ *OcrRequest, _ *WorkerConfig) (OcrResult, error)

ProcessRequest will process incoming OCR request by routing it through the whole process chain

type OcrEngine

type OcrEngine interface {
	ProcessRequest(ocrRequest *OcrRequest, workerConfig *WorkerConfig) (OcrResult, error)
}

func NewOcrEngine

func NewOcrEngine(engineType OcrEngineType) OcrEngine

type OcrEngineType

type OcrEngineType int

func (OcrEngineType) String

func (e OcrEngineType) String() string

func (*OcrEngineType) UnmarshalJSON

func (e *OcrEngineType) UnmarshalJSON(b []byte) (err error)

type OcrHTTPStatusHandler

type OcrHTTPStatusHandler struct {
	RabbitConfig RabbitConfig
}

OcrHTTPStatusHandler is for initial handling of ocr request

func NewOcrHttpHandler

func NewOcrHttpHandler(r *RabbitConfig) *OcrHTTPStatusHandler

func (*OcrHTTPStatusHandler) ServeHTTP

func (s *OcrHTTPStatusHandler) ServeHTTP(w http.ResponseWriter, req *http.Request)

type OcrHttpMultipartHandler

type OcrHttpMultipartHandler struct {
	RabbitConfig RabbitConfig
}

func NewOcrHttpMultipartHandler

func NewOcrHttpMultipartHandler(r *RabbitConfig) *OcrHttpMultipartHandler

func (*OcrHttpMultipartHandler) ServeHTTP

func (s *OcrHttpMultipartHandler) ServeHTTP(w http.ResponseWriter, req *http.Request)

type OcrHttpStatusHandler

type OcrHttpStatusHandler struct{}

func NewOcrHttpStatusHandler

func NewOcrHttpStatusHandler() *OcrHttpStatusHandler

func (*OcrHttpStatusHandler) ServeHTTP

func (*OcrHttpStatusHandler) ServeHTTP(w http.ResponseWriter, req *http.Request)

type OcrQueueManager

type OcrQueueManager struct {
	NumMessages  uint `json:"messages"` // TODO: do not read the number of messages from API because it is slow, and the clients of this product may not behave and put too many requests in too fast.
	NumConsumers uint `json:"consumers"`
	MessageBytes uint `json:"message_bytes"`
}

OcrQueueManager is used as a main component of resource manager

type OcrRequest

type OcrRequest struct {
	ImgUrl            string                 `json:"img_url"`
	ImgBase64         string                 `json:"img_base64"`
	EngineType        OcrEngineType          `json:"engine"`
	ImgBytes          []byte                 `json:"img_bytes"`
	PreprocessorChain []string               `json:"preprocessors"`
	PreprocessorArgs  map[string]interface{} `json:"preprocessor-args"`
	EngineArgs        map[string]interface{} `json:"engine_args"`
	Deferred          bool                   `json:"deferred"`
	ReplyTo           string                 `json:"reply_to"`
	DocType           string                 `json:"doc_type"`
	RequestID         string                 `json:"req_id"`
	PageNumber        uint16                 `json:"page_number"`
	UserAgent         string                 `json:"user_agent"`
	TimeOut           uint                   `json:"time_out"`
	ReferenceID       string                 `json:"reference_id"`
	// decode ocr in http handler rather than putting in queue
	InplaceDecode bool `json:"inplace_decode"`
}

func (*OcrRequest) String

func (ocrRequest *OcrRequest) String() string

type OcrResult

type OcrResult struct {
	Text   string `json:"text"`
	Status string `json:"status"`
	ID     string `json:"id"`
}

func CheckOcrStatusByID

func CheckOcrStatusByID(requestID string) (OcrResult, bool)

CheckOcrStatusByID checks status of an ocr request based on origin of request

func HandleOcrRequest

func HandleOcrRequest(ocrRequest *OcrRequest, workerConfig *RabbitConfig) (OcrResult, int, error)

HandleOcrRequest will process incoming OCR request by routing it through the whole process chain

type OcrRpcClient

type OcrRpcClient struct {
	// contains filtered or unexported fields
}

func NewOcrRpcClient

func NewOcrRpcClient(rc *RabbitConfig) (*OcrRpcClient, error)

func (*OcrRpcClient) DecodeImage

func (c *OcrRpcClient) DecodeImage(ocrRequest *OcrRequest) (OcrResult, int, error)

DecodeImage is the main function to do an ocr on incoming request. It's handling the parameter and the whole workflow

type OcrRpcWorker

type OcrRpcWorker struct {
	Done chan error
	// contains filtered or unexported fields
}

func NewOcrRpcWorker

func NewOcrRpcWorker(wc *WorkerConfig) (*OcrRpcWorker, error)

NewOcrRpcWorker is needed to establish a connection to a message broker

func (*OcrRpcWorker) Run

func (w *OcrRpcWorker) Run() error

func (*OcrRpcWorker) Shutdown

func (w *OcrRpcWorker) Shutdown() error

type Preprocessor

type Preprocessor interface {
	// contains filtered or unexported methods
}

type PreprocessorRpcWorker

type PreprocessorRpcWorker struct {
	Done chan error
	// contains filtered or unexported fields
}

func NewPreprocessorRpcWorker

func NewPreprocessorRpcWorker(rc *RabbitConfig, preprocessor string) (*PreprocessorRpcWorker, error)

func (*PreprocessorRpcWorker) Run

func (w *PreprocessorRpcWorker) Run() error

func (*PreprocessorRpcWorker) Shutdown

func (w *PreprocessorRpcWorker) Shutdown() error

type RabbitConfig

type RabbitConfig struct {
	AmqpURI      string
	Exchange     string
	ExchangeType string
	RoutingKey   string
	Reliable     bool
	AmqpAPIURI   string
	APIPathQueue string
	APIQueueName string
	APIPathStats string
	QueuePrio    map[string]uint8
	QueuePrioArg string
	/* ResponseCacheTimeout sets default(!!!) global timeout in seconds for request
	   engine will be killed after reaching the time limit, user will get timeout error */
	ResponseCacheTimeout uint
	// MaximalResponseCacheTimeout client won't be able to set the ResponseCacheTimeout higher of it's value
	MaximalResponseCacheTimeout uint
	FactorForMessageAccept      uint
}

func DefaultConfigFlagsOverride

func DefaultConfigFlagsOverride(flagFunction FlagFunction) RabbitConfig

func DefaultTestConfig

func DefaultTestConfig() RabbitConfig

type SandwichEngine

type SandwichEngine struct{}

SandwichEngine calls pdfsandwich via exec This implementation returns either the pdf with ocr layer only or merged variant of pdf plus ocr layer with the ability to optimize the output pdf file by calling "gs" tool

func (SandwichEngine) ProcessRequest

func (t SandwichEngine) ProcessRequest(ocrRequest *OcrRequest, workerConfig *WorkerConfig) (OcrResult, error)

ProcessRequest will process incoming OCR request by routing it through the whole process chain

type SandwichEngineArgs

type SandwichEngineArgs struct {
	// contains filtered or unexported fields
}

func NewSandwichEngineArgs

func NewSandwichEngineArgs(ocrRequest *OcrRequest, workerConfig *WorkerConfig) (*SandwichEngineArgs, error)

NewSandwichEngineArgs generates arguments for SandwichEngine which will be used to start involved tools

func (*SandwichEngineArgs) Export

func (t *SandwichEngineArgs) Export() []string

Export return a slice that can be passed to tesseract binary as command line args, eg, ["-c", "tessedit_char_whitelist=0123456789", "-c", "foo=bar"]

type StrokeWidthTransformer

type StrokeWidthTransformer struct{}

type TesseractEngine

type TesseractEngine struct{}

TesseractEngine calls tesseract via exec

func (TesseractEngine) ProcessRequest

func (t TesseractEngine) ProcessRequest(ocrRequest *OcrRequest, _ *WorkerConfig) (OcrResult, error)

ProcessRequest will process incoming OCR request by routing it through the whole process chain

type TesseractEngineArgs

type TesseractEngineArgs struct {
	// contains filtered or unexported fields
}

func NewTesseractEngineArgs

func NewTesseractEngineArgs(ocrRequest *OcrRequest) (*TesseractEngineArgs, error)

func (TesseractEngineArgs) Export

func (t TesseractEngineArgs) Export() []string

Export return a slice that can be passed to tesseract binary as command line args, eg, ["-c", "tessedit_char_whitelist=0123456789", "-c", "foo=bar"]

type WorkerConfig

type WorkerConfig struct {
	AmqpURI           string
	Exchange          string
	ExchangeType      string
	RoutingKey        string
	Reliable          bool
	AmqpAPIURI        string
	APIPathQueue      string
	APIQueueName      string
	APIPathStats      string
	SaveFiles         bool
	Debug             bool
	Tiff2pdfConverter string
	NumParallelJobs   uint
	FlgVersion        bool
}

WorkerConfig will be passed to ocr engines and is used to establish connection to a message broker

func DefaultConfigFlagsWorkerOverride

func DefaultConfigFlagsWorkerOverride(flagFunction FlagFunctionWorker) (WorkerConfig, error)

func DefaultWorkerConfig

func DefaultWorkerConfig() WorkerConfig

DefaultWorkerConfig will set the default set of worker parameters which are needed for testing and connecting to a broker

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL