Configuring OCR engine

OpenKM can work with several OCR engines, for example Tesseract 2.x, Tesseract 3.x, Cuneiform or Abby among others.

OpenKM can be integrated with any OCR engine that can be executed from command line.

OpenKM comes with a TextExtractor plugin for each OCR engine. Only must be a single TextExtractor enabled based on your current OCR engine.

List of OCR TextExtractors:

  • com.openkm.extractor.Tesseract3TextExtractor ( can be used for tesseract version 3.x and 4.x )
  • com.openkm.extractor.Tesseract2TextExtractor
  • com.openkm.extractor.AbbyTextExtractor
  • com.openkm.extractor.CuneiformTextExtractor

The most common configuration is to only have enabled the plugin com.openkm.extractor.Tesseract3TextExtractor.

More information about plugins at Administration > Utilities > Plugins

Tesseract

Check the list of supported languages at https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html

Tesseract is an Open Source OCR engine adopted by Goggle. It works really well. The OCR natively can read TIFF documents and has hight ratio of recognition with images 300 dpi of resolution and converted to lineart (1 bit color).

The supported image formats are:

  • TIFF
  • PNG
  • JPG
  • GIF

Installation

OSDescription

Ubuntu and Debian

$ apt-get install tesseract-ocr tesseract-ocr-eng

Depending on your locale you would like to install other language file.

More information at:

Red Hat and CentOS

More information at:

Windows

Download from https://code.google.com/p/tesseract-ocr/ and follow the installation wizard.

Check installation

From command line execute ( where image.jpg should be an existing image file ):

$ tesseract image.jpg text

The file image.jpg is an image file.

If all goes right a file named text.txt what contains the extracted text will be created.

Configuration

Go to Administration > Configuration parameters and edit the parameter named system.ocr:

OS 

Linux

/usr/bin/tesseract ${fileIn} ${fileOut}

or

/usr/bin/tesseract ${fileIn} ${fileOut} -l esp

The parameter -l esp indicate must be used the Spanish language support while executing OCR.

To see all tesseract options execute:

$ tesseract

 

Windows

c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut}

or

c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut} -l spa

 

The parameter -l esp indicate must be used the Spanish language support while executing OCR.

To see all tesseract options execute:

c:\Tesseract-ocr-3.0.2> tesseract.exe

Parameter ${fileIn} will be replaced internally by image document to be processed.

Parameter ${fileOut} will be replaced internally by temporal text file.

Can set other additional parameters like -l spa parameter in example.

Each parameter is separated by a single "white space".

Additional information

Cuneiform

We recommend the use of Tesseract rather than cuneiform.

The supported image formats are:

  • TIFF
  • PNG
  • JPG
  • GIF

Installation

OSDescription

Ubuntu and Debian

$ apt-get install cuneiform

More information at:

Red Hat and Centos

More information at:

Windows

Not available to be executed from command line.

More information at:

Check installation

From command line execute ( where image.jpg should be an existing image file ):

$ cuneiform image.jpg -o text.txt

The file image.jpg is an image file.

If all goes right a file named text.txt what contains the extracted text will be created.

Configuration

Go to Administration > Configuration parameters and edit the parameter named system.ocr:

OS 

Linux

/usr/bin/cuneiform ${fileIn} -o ${fileOut}

or

/usr/bin/cuneiform ${fileIn} -o ${fileOut} -l esp

The parameter -l esp indicate must be used the Spanish language support while executing OCR.

To see all cuneiform options execute:

$ cuneiform

Windows

Not available.

Parameter ${fileIn} will be replaced internally by image document to be processed.

Parameter ${fileOut} will be replaced internally by temporal text file.

Can set other additional parameters like -l spa parameter in example.

Before the ${fileOut} must be set -o parameter.

Each parameter is separated by a single "white space".

Abby OCR for linux

OCR4 Linux is a commercial OCR engine. It is the slowest of all tested tools, but keep in mind that it also reads nearly any image format, while you probably need to convert your images for the other tools first.

Installation

OSDescription

Ubuntu, Debian, Red Hat and Centos

More information at:

Windows

 Not available.

Check installation

From command line execute ( where image.jpg should be an existing image file ):

$ /usr/local/bin/abbyyocr9 abbyyocr11 -ii -fm -if image.jpg -tet UTF8 -of text.txt

The file image.jpg is an image file.

If all goes right will be created a file named text.txt what contains the extracted text.

Configuration

Go to Administration > Configuration parameters and edit the parameter named system.ocr:

OS 

Linux

/usr/local/bin/abbyyocr11 -ii -fm -if ${fileIn} -tet UTF8 -of ${fileOut}

There're a lot of parameters, to see all ocr4linux options execute:

$ abbyyocr11 -?

Windows

Not available.

Parameter ${fileIn} will be replaced internally by image document to be processed.

Parameter ${fileOut} will be replaced internally by temporal text file.

Can set other additional parameters like -l spa parameter in example.

Each parameter is separated by a single "white space".

Dictionary

You can also use an OpenOffice.org dictionary to enhance the OCR process. Can find these language specific dictionaries at OpenOffice.org Dictionary Repository.

Configuration

Go to Administration > Configuration parameters and edit the parameter named system.openoffice.dictionary:

OS 

Linux

/home/openkm/dictionary/en-GB.zip

Windows

c:\tomcat-7.0.61\dictionary\en-GB.zip

Dictionaries comes in .zip or oxt format. Can use both.

OCR rotate configuration

There's a configuration parameter that force document ocr on rotated document.

The feature is useful for example in upside down scanned pages, where standard OCR direction will fails.

Configuration

Go to Administration > Configuration parameters and edit the parameter named system.ocr.rotate

PropertyDescription

system.ocr.rotate

The parameter is a collection of degrees separated by character ";"

0;90;180;270;

Additional information: