OCR templates

OCR Templates allow creating zonal OCR templates which allow recognising and extracting structured text from scanned images.  

For best results, images should be scanned at least at 200 dpi resolution to get good text recognition from the OCR engine.

Understanding the stages of OCR template execution:

Recognise

Data extraction

Save data

On recognition, the application based on the OCR template's definition tries to identify the document. A document will be considered recognised only if an OCR template matches 100%.

During data extraction, the values are extracted from the document.

When saving data, the dynamic values obtained in the data extraction stage are saved into metadata groups. Optionally, they are saved as static values defined on the OCR template (normally used to identify the document type).

There are two field types in OCR templates:

  • Validation Control fields.
  • Data Capture fields.

Validation Controls fields are used only in the recognition stage.

Data Capture fields are used in the data extraction stage and optionally can also be used in the recognition stage.

Allowed MIME types for the recognition stage are:

  • PNG
  • JPG
  • GIF
  • BMP
  • PDF

Validation Control field description

Validation Control fields are used only in the recognise stage to help identify the document by checking a document section's pattern.

Each Validation Control field is composed of:

FieldDescriptionRequired
Name

Name of the field.

true.

Order

Order of evaluation

true.
Type

List of OCR Validation Controls. Each validator is a condition that must be met.

true.

Value

A value. *

Pattern

A Java regex pattern.

*

Rotation

Indicates the image rotation that must be applied.

Available image rotations:

  • -270
  • -180
  • -90
  • 0
  • 90
  • 180
  • 270
true.

Custom

Sets commands to process images by the OCR. Several commands can be set, one per line.

For example to highlight the red color

/usr/bin/convert ${fileIn} -matte ( +clone -fuzz 50% -transparent red ) -compose DstOut -composite -negate ${fileOut}

false.

OCR

OCR command line.

When present it will be executed in place of the system.ocr configuration parameter.

More information Configuration parameters.

For example to force digits capture.

/usr/bin/tesseract ${fileIn} ${fileOut} digits

false.

Empty value allowed

When checked, indicates an OCR empty text value is also valid in the document identification process.

false

* These parameters are used in the Validation Control. They can be required or not depending on which Validation Control is used.

 

Data Capture field description

Data Capture fields are used to retrieve data in the data extraction stage but can also be used in the recognise stage to help identify a document by checking a document section's pattern.

Each Data Capture field is composed of:

FieldDescriptionRequired
Name

Name of the field.

true.

Order

Order of evaluation

true.
Type

List of OCR Data Capture.

true.
Property

Metadata field where the captured data will be stored.

true.

Pattern

A Java regex pattern.

*

Rotation

Indicates the image rotation that must be applied.

Available image rotations:

  • -270
  • -180
  • -90
  • 0
  • 90
  • 180
  • 270
true.

Custom

Sets commands to process the images by the OCR.  Several commands can be set, one per line.

For example to highlight the red color

/usr/bin/convert ${fileIn} -matte ( +clone -fuzz 50% -transparent red ) -compose DstOut -composite -negate ${fileOut}

false.

OCR

OCR command line.

When present it will be executed in place of the system.ocr configuration parameter.

More information Configuration parameters.

For example to force digits capture.

/usr/bin/tesseract ${fileIn} ${fileOut} digits

false.

Use to recognise

When it is set to true, it indicates that the field will also be used in the recognise stage.

false.

* This parameter is used in the Data Capture. It can be required or not depending on which Data Capture is used.

 

Understanding OCR Validation Control and OCR Data Capture

The application provides a complete list of available OCR Validation Controls and OCR Data Capture from the Plugins view. The application's plugin architecture helps administrators add new features to the existing application.

By default, the application comes with built-in plugins:

You can also create your own plugins:

For more information about how to Register a new plugin.

Create a new OCR template

  • On the top right click New OCR template icon.
  • Fill in the fields.
    • Metadata field (static metadata values, normally used to identify document type).
  • Choose the file (allowed MIME type).
  • Click on the Create button.

  • Add Validation Control fields as needed.
    • Validation Controls
      • Click on Fields.
        • To create a new Validation Control:
          • Click on New Validation Control Field icon.
          • Fill in the fields.
          • Choose the area with the mouse on the left image.
          • Click on the Create button.
        • To edit a Validation Control:
          • Click on  Edit a Validation Control icon.
          • Modify the fields.
          • Click on the Edit button.
        • To delete a Validation Control:
          • Click on  Delete a Validation Control icon.
          • Click on the Delete button.

  • Add Data Capture fields as needed.
    • Data Capture
      • Click on Fields.
      • To create a new Data Capture
        • Click on  New Data Capture Field icon.
        • Fill in the fields.
        • Choose the area with the mouse on the left image.
        • Click on the Create button.
      • To edit a Data Capture:
        • Click on  Edit a Data Capture icon.
        • Modify the fields.
        • Click on the Edit button.
      • To delete a Data Capture:
        • Click on  Delete a Data Capture icon.
        • Click on the Delete button.

Edit an OCR template

  • Click on  Edit OCR template icon.
  • Modify the fields.
  • Click on the Edit button.

Edit OCR template fields

Edit an OCR Validation Control

Click on Fields.

  • Click on  Edit a Validation Control icon.
  • Modify the fields.
  • Click on the Edit button.

Edit an OCR Data Capture

Click on Fields.

  • Click on  Edit a Data Capture icon.
  • Modify the fields.
  • Click on the Edit button.

Delete OCR template fields

Delete an OCR Validation Control

Click on Fields.

  • Click on  Delete a Validation Control icon.
  • Click on the Delete button.

Delete an OCR Data Capture

Click on Fields.

  • Click on  Delete a Data Capture icon.
  • Click on the Delete button.

Check OCR template

  • Click on the Check icon and the OCR template will be checked.

The Capture Data field shows the result of the text extraction by the OCR.

The Validation Control field shows the result of the compliance validation (true or false).

Check document recognition

  • At the top right, click on Recognise icon.
  • Choose the file to be recognised (allowed MIME type).
  • Choose the recognition level.
  • Click on the Recognise button.

The recognise level determines the speed used to identify a document. By default, the application uses the slow level.

  • In the fast level, it considers only the Validation Control fields.
  • In the medium level, it also requires that the Data Capture fields are not empty.
  • In the slow level, it also verifies that the Data Capture fields are correct (comply with patterns, etc.).

Delete an OCR template

  • Click on  Delete OCR template icon.
  • Click on the Delete button.

Enable OCR Data Capture

FeatureDescription

Profile General tab

When OCR data capture is enabled from the Profile General tab, the OCR Data Capture process will be executed automatically for each newly uploaded document.

If an automation definition is present, these parameters have no effect.

With the automation feature, all these cases and more are covered.

Automation

Validations:

  • Is OCRDataCaptureFile

More information about Built-in Automation Validator plugins.

Actions:

  • OCR DataCapture
  • Add OCRDataCaptureToWizard

More information about Built-in Automation Action plugins.