Tesseract 4 disable page segmentation

Symptons

Tesseract version 4 added a new feature named "page segmentation" what is enabled by default. Usually you might be interested in processing OCR document as a single uniform block of text rathern than page segmentation.

Cause

Tesseract version 4 works extracting text with "page segmentation" by default.

Solution

Description of the steps for solving the issue:

https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage

Basically adding new parameter --psm to configuration parameter system.ocr:

system.ocr = /usr/bin/tesseract ${fileIn} ${fileOut} -l eng --psm 6

Properties

Properties

Date

2018-12-21

Applies to

    Core.

 

 

Keywords

  • AllVersions
  •