Configuring the OCR engine
OpenKM can work with several OCR engines, for example Tesseract 2.x, Tesseract 3.x, Cuneiform or Abby among others.
OpenKM can be integrated with any OCR engine that can be executed from the command line.
Tesseract
Tesseract is an Open Source OCR engine adopted by Goggle. It is an effective option for most use cases. The OCR natively can read TIFF documents and has hight ratio of recognition with images of 300 dpi of resolution and converted to lineart (1 bit color).
The supported image formats are:
- TIFF
- PNG
- JPG
- GIF
Installation
OS | Description |
---|---|
Ubuntu and Debian |
$ apt-get install tesseract-ocr tesseract-ocr-eng Depending on your locale you may like to install other language files. More information at: |
Red Hat and CentOS |
More information at: |
Windows |
Download from https://code.google.com/p/tesseract-ocr/ and follow the installation wizard. |
Check installation
From the command line execute (where "image.jpg" should be an existing image file):
$ tesseract image.jpg text
The file image.jpg is an image file.
If all goes right a file named text.txt that contains the extracted text will be created.
Configuration
Go to Administration > Configuration parameters and edit the parameter named system.ocr:
OS | |
---|---|
Linux |
/usr/bin/tesseract ${fileIn} ${fileOut} or /usr/bin/tesseract ${fileIn} ${fileOut} -l esp The parameter -1 esp must be used for the Spanish language support while executing OCR. To see all tesseract options execute: $ tesseract
|
Windows |
c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut} or c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut} -l spa
The parameter -1 esp must be used for the Spanish language support while executing OCR. To see all tesseract options execute: c:\Tesseract-ocr-3.0.2> tesseract.exe |
Parameter ${fileIn} will be replaced internally by the image document to be processed. Parameter ${fileOut} will be replaced internally by the temporal text file. You can set other additional parameters like the -l spa parameter, for example. Each parameter is separated by a single "white space". |
Additional information
- http://code.google.com/p/tesseract-ocr/
- Tesseract - Summary & first experiences.
- Tesseract OCR Google Groups
- First Interactions with Tesseract OCR on Ubuntu Linux
- http://code.google.com/p/tesseract-ocr/wiki/ReadMe
- http://code.google.com/p/tesseract-ocr/wiki/FAQ
Cuneiform
We recommend the use of Tesseract rather than cuneiform.
The supported image formats are:
- TIFF
- PNG
- JPG
- GIF
Installation
OS | Description |
---|---|
Ubuntu and Debian |
$ apt-get install cuneiform More information at: |
Red Hat and Centos |
More information at: |
Windows |
Not available to be executed from the command line. More information at: |
Check installation
From the command line execute (where image.jpg should be an existing image file):
$ cuneiform image.jpg -o text.txt
The file image.jpg is an image file.
If all goes right a file named text.txt that contains the extracted text will be created.
Configuration
Go to Administration > Configuration parameters and edit the parameter named system.ocr:
OS | |
---|---|
Linux |
/usr/bin/cuneiform ${fileIn} -o ${fileOut} or /usr/bin/cuneiform ${fileIn} -o ${fileOut} -l esp The parameter -l esp must be used for the Spanish language support while executing OCR. To see all cuneiform options execute: $ cuneiform |
Windows |
Not available. |
Parameter ${fileIn} will be replaced internally by the image document to be processed. Parameter ${fileOut} will be replaced internally by the temporal text file. You can set other additional parameters like -l spa parameter, for example. The -o parameter must be set before the ${fileOut} parameter. Each parameter is separated by a single "white space". |
Abby OCR for linux
OCR4 Linux is a commercial OCR engine. It is the slowest of all the tested tools, but keep in mind that it also reads nearly any image format, while you may need to convert your images for the other tools first.
Installation
OS | Description |
---|---|
Ubuntu, Debian, Red Hat and Centos |
More information at: |
Windows |
Not available. |
Check installation
From the command line execute (where image.jpg should be an existing image file):
$ /usr/local/bin/abbyyocr9 abbyyocr11 -ii -fm -if image.jpg -tet UTF8 -of text.txt
The file image.jpg is an image file.
If all goes right a file named text.txt will be created that contains the extracted text.
Configuration
Go to Administration > Configuration parameters and edit the parameter named system.ocr:
OS | |
---|---|
Linux |
/usr/local/bin/abbyyocr11 -ii -fm -if ${fileIn} -tet UTF8 -of ${fileOut} There are a lot of parameters, to see all ocr4linux options execute: $ abbyyocr11 -? |
Windows |
Not available. |
Parameter ${fileIn} will be replaced internally bythe image document to be processed. Parameter ${fileOut} will be replaced internally by the temporal text file. You dan set other additional parameters like -l spa parameter, for example. Each parameter is separated by a single "white space". |
Dictionary
You can also use an OpenOffice.org dictionary to enhance the OCR process. You can find these language specific dictionaries at OpenOffice.org Dictionary Repository.
Configuration
Go to Administration > Configuration parameters and edit the parameter named system.openoffice.dictionary:
OS | |
---|---|
Linux |
/home/openkm/dictionary/en-GB.zip |
Windows |
c:\tomcat-7.0.61\dictionary\en-GB.zip |
Dictionaries comes in .zip or oxt format. Either can be used.
OCR rotate configuration
There's a configuration parameter that forces document ocr on rotated documents.
The feature is useful, for example, on upside down scanned pages, where standard OCR direction will fail.
When the feature is enabled the operation will be done on all OCR extractions, that can have undesired effects on the OCR templates feature.
Configuration
Go to Administration > Configuration parameters and edit the parameter named system.ocr.rotate:
Property | Description |
---|---|
system.ocr.rotate |
The parameter is a collection of degrees separated by character ";" 0;90;180;270; |