Tesseract 4: disable page segmentation

Symptoms

Tesseract version 4 added a new feature named "page segmentation" which is enabled by default. Usually you might be interested in processing an OCR document as a single uniform block of text rather than page segmentation.

Cause

Tesseract version 4 extracts text using "page segmentation" by default.

Solution

Description of the steps to solve the issue:

https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage

Basically, add the new parameter --psm to the configuration parameter system.ocr:

system.ocr = /usr/bin/tesseract ${fileIn} ${fileOut} -l eng --psm 6

Properties

Properties

Date

2018-12-21

Applies to

    Core.

 

 

Keywords

  • AllVersions
  •