Tesseract 4 disable page segmentation
Symptons
Tesseract version 4 added a new feature named "page segmentation" what is enabled by default. Usually you might be interested in processing OCR document as a single uniform block of text rathern than page segmentation.
Cause
Tesseract version 4 works extracting text with "page segmentation" by default.
Solution
Description of the steps for solving the issue:
https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage
Basically adding new parameter --psm to configuration parameter system.ocr:
system.ocr = /usr/bin/tesseract ${fileIn} ${fileOut} -l eng --psm 6
Properties
Properties | |
---|---|
Date |
2018-12-21 |
Applies to |
|
Keywords
- AllVersions