Configuring OCR engine

OpenKM can work with several OCR engines, like Tesseract.

OpenKM can be integrated with any OCR engine that can be executed from the command line.

Tesseract

Check the list of supported languages at https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html

More information about installation on different OSes and Linux distros is available at https://tesseract-ocr.github.io/tessdoc/Installation.html

Tesseract is an open-source OCR engine adopted by Google. It works really well. OCR can natively read TIFF documents and has a high recognition rate with images of 300 dpi resolution that are converted to lineart (1-bit color).

The supported image formats are:

TIFF
PNG
JPG
GIF

Installation

OS	Description
Ubuntu and Debian	$ apt install tesseract-ocr The default installed language is English, so depending on your locale you may want to install another language file. For example, to install the Spanish language pack: $ apt install tesseract-ocr-spa You can check the installed languages this way: $ tesseract --list-langs More information at: https://github.com/tesseract-ocr/tesseract https://tesseract-ocr.github.io/tessdoc/Installation.html
Red Hat and CentOS	On CentOS, you can install Tesseract from the EPEL repository: $ yum install epel-release $ yum install tesseract The default installed language is English, so depending on your locale you may want to install another language file. For example, to install the Spanish language pack: $ yum install tesseract-langpack-spa You can check the installed languages this way: $ tesseract --list-langs On Red Hat, we recommend contacting Red Hat support for more information. More information at: https://tesseract-ocr.github.io/tessdoc/Installation.html
Windows	Download from https://tesseract-ocr.github.io/tessdoc/Installation.html and follow the installation wizard.

Description

Ubuntu and Debian

$ apt install tesseract-ocr

The default installed language is English, so depending on your locale you may want to install another language file.

For example, to install the Spanish language pack:

$ apt install tesseract-ocr-spa

You can check the installed languages this way:

$ tesseract --list-langs

More information at:

Red Hat and CentOS

On CentOS, you can install Tesseract from the EPEL repository:

$ yum install epel-release
$ yum install tesseract

The default installed language is English, so depending on your locale you may want to install another language file.

For example, to install the Spanish language pack:

$ yum install tesseract-langpack-spa

You can check the installed languages this way:

$ tesseract --list-langs

On Red Hat, we recommend contacting Red Hat support for more information.

More information at:

https://tesseract-ocr.github.io/tessdoc/Installation.html

Windows

Download from https://tesseract-ocr.github.io/tessdoc/Installation.html and follow the installation wizard.

Check installation

From the command line, execute (where image.jpg should be an existing image file):

$ tesseract image.jpg text

The file image.jpg is an image file.

If all goes well, a file named text.txt which contains the extracted text will be created.

Configuration

Go to Administration > Configuration parameters and edit the parameter named system.ocr:

OS
Linux	/usr/bin/tesseract ${fileIn} ${fileOut} or /usr/bin/tesseract ${fileIn} ${fileOut} -l esp The parameter -l esp indicates that the Spanish language support must be used while executing OCR. To see all Tesseract options, execute: $ tesseract
Windows	c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut} or c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut} -l spa The parameter -l esp indicates that the Spanish language support must be used while executing OCR. To see all Tesseract options, execute: c:\Tesseract-ocr-3.0.2> tesseract.exe
The parameter ${fileIn} will be replaced internally by the image document to be processed. The parameter ${fileOut} will be replaced internally by a temporary text file. You can set other additional parameters like the -l spa parameter in the example. Each parameter is separated by a single whitespace.

Linux

/usr/bin/tesseract ${fileIn} ${fileOut}

/usr/bin/tesseract ${fileIn} ${fileOut} -l esp

The parameter -l esp indicates that the Spanish language support must be used while executing OCR.

To see all Tesseract options, execute:

$ tesseract

Windows

c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut}

c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut} -l spa

The parameter -l esp indicates that the Spanish language support must be used while executing OCR.

To see all Tesseract options, execute:

c:\Tesseract-ocr-3.0.2> tesseract.exe

The parameter ${fileIn} will be replaced internally by the image document to be processed.

The parameter ${fileOut} will be replaced internally by a temporary text file.

You can set other additional parameters like the -l spa parameter in the example.

Each parameter is separated by a single whitespace.

Once configured, please go to Administration > Utilities > Plugins, choose com.openkm.plugin.extractor.TextExtractor from the select box and enable the TesseractTextExtractor plugin.

Additional information

Dictionary

You can also use an OpenOffice.org dictionary to enhance the OCR process. You can find these language-specific dictionaries at OpenOffice.org Dictionary Repository.

Configuration

Go to Administration > Configuration parameters and edit the parameter named system.openoffice.dictionary:

OS
Linux	/home/openkm/dictionary/en-GB.zip
Windows	c:\tomcat-8.5.69\dictionary\en-GB.zip

Dictionaries come in .zip or .oxt format. You can use both.

OCR rotate configuration

There's a configuration parameter that forces OCR on rotated documents.

The feature is useful, for example, for upside-down scanned pages, where the standard OCR direction will fail.

When the feature is enabled, the operation will be performed on all OCR extractions, which can have undesired effects on the OCR templates feature.

Configuration

Go to Administration > Configuration parameters and edit the parameter named system.ocr.rotate:

Property	Description
system.ocr.rotate	The parameter is a collection of degrees separated by the character ";". 90;180;270;

Additional information:

Table of contents [ Hide Show ]

Tesseract

Installation
Check installation
Configuration
Additional information

Dictionary

Configuration

OCR rotate configuration

Configuration

Additional information:

OpenKM 8.1

Configuring OCR engine

Tesseract

Installation

Check installation

Configuration

Additional information

Dictionary

Configuration

OCR rotate configuration

Configuration

Additional information: