Tesseract 4: disable page segmentation
Symptoms
Tesseract version 4 added a new feature named "page segmentation" which is enabled by default. Usually you might be interested in processing an OCR document as a single uniform block of text rather than page segmentation.
Cause
Tesseract version 4 extracts text using "page segmentation" by default.
Solution
Description of the steps to solve the issue:
https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage
Basically, add the new parameter --psm to the configuration parameter system.ocr:
system.ocr = /usr/bin/tesseract ${fileIn} ${fileOut} -l eng --psm 6
Properties
Properties | |
---|---|
Date |
2018-12-21 |
Applies to |
|
Keywords
- AllVersions