Configuring OCR engine
OpenKM can work with several OCR engines, like Tesseract.
OpenKM can be integrated with any OCR engine that can be executed from command line.
Tesseract
Check the list of supported languages at https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html
More info about installation in different OS and Linux distros at https://tesseract-ocr.github.io/tessdoc/Installation.html
Tesseract is an Open Source OCR engine adopted by Goggle. It works really well. The OCR natively can read TIFF documents and has hight ratio of recognition with images 300 dpi of resolution and converted to lineart (1 bit color).
The supported image formats are:
- TIFF
- PNG
- JPG
- GIF
Installation
OS | Description |
---|---|
Ubuntu and Debian |
$ apt install tesseract-ocr The default installed language is English so, depending on your locale you would like to install other language file. For example, to install Spanish language: $ apt install tesseract-ocr-spa You can check the installed languages this way: $ tesseract --list-langs More information at: |
Red Hat and CentOS |
In CentOS you can install Tesseract from the EPEL repository: $ yum install epel-release The default installed language is English so, depending on your locale you would like to install other language file. For example, to install Spanish language: $ yum install tesseract-langpack-spa You can check the installed languages this way: $ tesseract --list-langs In RedHat we recommend contacting the RedHat support for more information. More information at: |
Windows |
Download from https://tesseract-ocr.github.io/tessdoc/Installation.html and follow the installation wizard. |
Check installation
From command line execute ( where image.jpg should be an existing image file ):
$ tesseract image.jpg text
The file image.jpg is an image file.
If all goes right a file named text.txt what contains the extracted text will be created.
Configuration
Go to Administration > Configuration parameters and edit the parameter named system.ocr:
OS | |
---|---|
Linux |
/usr/bin/tesseract ${fileIn} ${fileOut} or /usr/bin/tesseract ${fileIn} ${fileOut} -l esp The parameter -l esp indicate must be used the Spanish language support while executing OCR. To see all tesseract options execute: $ tesseract
|
Windows |
c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut} or c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut} -l spa
The parameter -l esp indicate must be used the Spanish language support while executing OCR. To see all tesseract options execute: c:\Tesseract-ocr-3.0.2> tesseract.exe |
Parameter ${fileIn} will be replaced internally by image document to be processed. Parameter ${fileOut} will be replaced internally by temporal text file. Can set other additional parameters like -l spa parameter in example. Each parameter is separated by a single "white space". |
Once configured, please go to Administration > Utilities > Plugins, choose com.openkm.plugin.extractor.TextExtractor from select and enable the TesseractTextExtractor plugin
Additional information
- http://code.google.com/p/tesseract-ocr/
- Tesseract - Summary & first experiences.
- Tesseract OCR Google Groups
- First Interactions with Tesseract OCR on Ubuntu Linux
- http://code.google.com/p/tesseract-ocr/wiki/ReadMe
- http://code.google.com/p/tesseract-ocr/wiki/FAQ
Dictionary
You can also use an OpenOffice.org dictionary to enhance the OCR process. Can find these language specific dictionaries at OpenOffice.org Dictionary Repository.
Configuration
Go to Administration > Configuration parameters and edit the parameter named system.openoffice.dictionary:
OS | |
---|---|
Linux |
/home/openkm/dictionary/en-GB.zip |
Windows |
c:\tomcat-8.5.69\dictionary\en-GB.zip |
Dictionaries comes in .zip or oxt format. Can use both.
OCR rotate configuration
There's a configuration parameter that force document ocr on rotated document.
The feature is useful for example in upside down scanned pages, where standard OCR direction will fails.
When the feature is enabled the operation will be done on all OCR extractions, that can take undesired effects on OCR templates feature.
Configuration
Go to Administration > Configuration parameters and edit the parameter named system.ocr.rotate:
Property | Description |
---|---|
system.ocr.rotate |
The parameter is a collection of degrees separated by character ";" 90;180;270; |