Tesseract 4: disable page segmentation

Symptoms

Tesseract version 4 added a new feature named "page segmentation" which is enabled by default. Usually you might be interested in processing an OCR document as a single uniform block of text rather than page segmentation.

Cause

Tesseract version 4 extracts text using "page segmentation" by default.

Solution

Description of the steps to solve the issue:

https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage

Basically, add the new parameter --psm to the configuration parameter system.ocr:

system.ocr = /usr/bin/tesseract ${fileIn} ${fileOut} -l eng --psm 6

Properties

Properties
Date	2018-12-21
Applies to	Core.

Keywords

AllVersions

�

Table of contents [ Hide Show ]

Symptoms
Cause
Solution
Properties