Configuring OCR engine
OpenKM can work with several OCR engines, like Tesseract.
OpenKM can be integrated with any OCR engine that can be executed from the command line.
Tesseract
Check the list of supported languages at https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html
More info about installation on different OSs and Linux distros at https://tesseract-ocr.github.io/tessdoc/Installation.html
Tesseract is an open-source OCR engine adopted by Google. It works really well. It can natively read TIFF documents and has a high recognition rate with images of 300 dpi resolution converted to line art (1-bit color).
The supported image formats are:
- TIFF
- PNG
- JPG
- GIF
Installation
| OS | Description |
|---|---|
|
Ubuntu and Debian |
$ apt install tesseract-ocr The default installed language is English, so depending on your locale you may want to install another language file. For example, to install the Spanish language pack: $ apt install tesseract-ocr-spa You can check the installed languages this way: $ tesseract --list-langs More information at: |
|
Red Hat and CentOS |
In CentOS you can install Tesseract from the EPEL repository: $ yum install epel-release The default installed language is English, so depending on your locale you may want to install another language file. For example, to install the Spanish language pack: $ yum install tesseract-langpack-spa You can check the installed languages this way: $ tesseract --list-langs On Red Hat, we recommend contacting Red Hat support for more information. More information at: |
|
Windows |
Download from https://tesseract-ocr.github.io/tessdoc/Installation.html and follow the installation wizard. |
Check installation
From the command line, execute (where image.jpg should be an existing image file):
$ tesseract image.jpg text
The file image.jpg is an image file.
If all goes right, a file named text.txt that contains the extracted text will be created.
Configuration
Go to Administration > Configuration parameters and edit the parameter named system.ocr:
| OS | |
|---|---|
|
Linux |
/usr/bin/tesseract ${fileIn} ${fileOut} or /usr/bin/tesseract ${fileIn} ${fileOut} -l esp The parameter -l esp indicates that Spanish language support must be used when executing OCR. To see all Tesseract options, execute: $ tesseract
|
|
Windows |
c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut} or c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut} -l spa
The parameter -l esp indicates that Spanish language support must be used when executing OCR. To see all Tesseract options, execute: c:\Tesseract-ocr-3.0.2> tesseract.exe |
|
Parameter ${fileIn} will be replaced internally with the image document to be processed. Parameter ${fileOut} will be replaced internally with a temporary text file. You can set additional parameters, such as the -l spa parameter in the example. Each parameter is separated by a single space. |
|
Once configured, please go to Administration > Utilities > Plugins, choose com.openkm.plugin.extractor.TextExtractor from the list and enable the TesseractTextExtractor plugin.
Additional information
- http://code.google.com/p/tesseract-ocr/
- Tesseract - Summary & first experiences.
- Tesseract OCR Google Groups
- First Interactions with Tesseract OCR on Ubuntu Linux
- http://code.google.com/p/tesseract-ocr/wiki/ReadMe
- http://code.google.com/p/tesseract-ocr/wiki/FAQ
Dictionary
You can also use an OpenOffice.org dictionary to enhance the OCR process. You can find these language-specific dictionaries at OpenOffice.org Dictionary Repository.
Configuration
Go to Administration > Configuration parameters and edit the parameter named system.openoffice.dictionary:
| OS | |
|---|---|
|
Linux |
/home/openkm/dictionary/en-GB.zip |
|
Windows |
c:\tomcat-8.5.69\dictionary\en-GB.zip |
Dictionaries come in .zip or .oxt format. You can use both.
OCR rotate configuration
There's a configuration parameter that forces OCR on rotated documents.
The feature is useful, for example, in upside-down scanned pages, where the standard OCR direction will fail.
When the feature is enabled, the operation will be done on all OCR extractions, which can have undesired effects on the OCR templates feature.
Configuration
Go to Administration > Configuration parameters and edit the parameter named system.ocr.rotate:
| Property | Description |
|---|---|
|
system.ocr.rotate |
The parameter is a collection of degrees separated by the character ";" 90;180;270; |