Configuring the OCR engine

OpenKM can work with several OCR engines, like Tesseract.

OpenKM can be integrated with any OCR engine that can be executed from the command line.

Tesseract

Check the list of supported languages at https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html

Tesseract is an Open Source OCR engine adopted by Goggle. It is an effective option for most use cases. The OCR natively can read TIFF documents and has hight ratio of recognition with images of 300 dpi of resolution and converted to lineart (1 bit color).

The supported image formats are:

TIFF
PNG
JPG
GIF

Installation

OS	Description
Ubuntu and Debian	$ apt-get install tesseract-ocr tesseract-ocr-eng Depending on your locale you may like to install other language files. More information at: https://github.com/tesseract-ocr https://code.google.com/p/tesseract-ocr/
Red Hat and CentOS	More information at: https://code.google.com/p/tesseract-ocr/wiki/Compiling https://code.google.com/p/python-tesseract/wiki/HowToCompileForCentos http://www.vicchiam.com/blog/?p=168
Windows	Download from https://code.google.com/p/tesseract-ocr/ and follow the installation wizard.

Description

Ubuntu and Debian

$ apt-get install tesseract-ocr tesseract-ocr-eng

Depending on your locale you may like to install other language files.

More information at:

Red Hat and CentOS

More information at:

Windows

Download from https://code.google.com/p/tesseract-ocr/ and follow the installation wizard.

Check installation

From the command line execute (where "image.jpg" should be an existing image file):

$ tesseract image.jpg text

The file image.jpg is an image file.

If all goes right a file named text.txt that contains the extracted text will be created.

Configuration

Go to Administration > Configuration parameters and edit the parameter named system.ocr:

OS
Linux	/usr/bin/tesseract ${fileIn} ${fileOut} or /usr/bin/tesseract ${fileIn} ${fileOut} -l esp The parameter -1 esp must be used for the Spanish language support while executing OCR. To see all tesseract options execute: $ tesseract
Windows	c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut} or c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut} -l spa The parameter -1 esp must be used for the Spanish language support while executing OCR. To see all tesseract options execute: c:\Tesseract-ocr-3.0.2> tesseract.exe
Parameter ${fileIn} will be replaced internally by the image document to be processed. Parameter ${fileOut} will be replaced internally by the temporal text file. You can set other additional parameters like the -l spa parameter, for example. Each parameter is separated by a single "white space".

Linux

/usr/bin/tesseract ${fileIn} ${fileOut}

/usr/bin/tesseract ${fileIn} ${fileOut} -l esp

The parameter -1 esp must be used for the Spanish language support while executing OCR.

To see all tesseract options execute:

$ tesseract

Windows

c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut}

c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut} -l spa

The parameter -1 esp must be used for the Spanish language support while executing OCR.

To see all tesseract options execute:

c:\Tesseract-ocr-3.0.2> tesseract.exe

Parameter ${fileIn} will be replaced internally by the image document to be processed.

Parameter ${fileOut} will be replaced internally by the temporal text file.

You can set other additional parameters like the -l spa parameter, for example.

Each parameter is separated by a single "white space".

Additional information

Dictionary

You can also use an OpenOffice.org dictionary to enhance the OCR process. You can find these language specific dictionaries at OpenOffice.org Dictionary Repository.

Configuration

Go to Administration > Configuration parameters and edit the parameter named system.openoffice.dictionary:

OS
Linux	/home/openkm/dictionary/en-GB.zip
Windows	c:\tomcat-7.0.61\dictionary\en-GB.zip

Dictionaries comes in .zip or oxt format. Either can be used.

OCR rotate configuration

There's a configuration parameter that forces document ocr on rotated documents.

The feature is useful, for example, on upside down scanned pages, where standard OCR direction will fail.

When the feature is enabled the operation will be done on all OCR extractions, that can have undesired effects on the OCR templates feature.

Configuration

Go to Administration > Configuration parameters and edit the parameter named system.ocr.rotate:

Property	Description
system.ocr.rotate	The parameter is a collection of degrees separated by character ";" 0;90;180;270;

Additional information:

Table of contents [ Hide Show ]

Tesseract

Installation
Check installation
Configuration
Additional information

Dictionary

Configuration

OCR rotate configuration

Configuration

Additional information:

OpenKM 6.4

Configuring the OCR engine

Tesseract

Installation

Check installation

Configuration

Additional information

Dictionary

Configuration

OCR rotate configuration

Configuration

Additional information: