OCR templates

OCR Templates allow to create zonal OCR templates which allows to recognise and extract estructured text from scanned images.  

For best results, images should be scanned at least at 200 dpi resolution to get good text recognition from the OCR engine.

Understanding stages on the OCR templates execution:

Recognise

Extraction data

Save data

On recognition, the application based on ocr template's definition tries to identify the document. A document will be considered recognised only if an OCR template checks 100%

On extraction data, the values are extracted from the document.

On save data, the dynamic values got in the extraction data stage are saved into metadata groups. Optionaly are saved as static values defined on the OCR template ( normally used to identify the document type ).

There're two field types on OCR templates:

  • Validation Control fields.
  • Data Capture fields.

Validation Controls fields are used only on the recognition stage.

Data Capture fields are used on the extraction data stage and optionaly can be used also on recognition stage.

Allowed MIME types on the recognise stage are:

  • PNG
  • JPG
  • GIF
  • BMP
  • PDF

Validation Control field description

Validation Control fields are used only in the Recognise stage in helping on the document identification by checking a document's section pattern.

Each Validation Control field is composed of:

FieldDescriptionRequired
Name

Name of the field.

true

Order

Order of the evaluation

true
Type

List of OCR Validation Controls. Each validator is a condition that must be complied.

true

Value

A value. *

Pattern

A Java regex pattern.

*

Rotation

Indicates the image rotation that must be applied.

Available image rotations:

  • -270
  • -180
  • -90
  • 0
  • 90
  • 180
  • 270
true

Custom

Sets commands to process images by the OCR. Several commands can be set , one per line.

For example to highlight the red color

/usr/bin/convert ${fileIn} -matte ( +clone -fuzz 50% -transparent red ) -compose DstOut -composite -negate ${fileOut}

false

OCR

OCR command line.

When present it will be executed in place of the system.ocr configuration parameter.

More information Configuration parameters.

For example to force digits capture.

/usr/bin/tesseract ${fileIn} ${fileOut} digits

false

Empty value allowed

When checked indicates ocr empty text value is also valid in identifying document process.

false

* These parameters are used into to the Validation Control. Can be required or not depending on which Validation Control is used.

 

Data Capture field description

Data Capture fields are used to retrieve data in the extraction stage but also can be used in the Recognise stage in helping a document identification by checking a document section pattern.

Each Data Capture field is composed of:

FieldDescriptionRequired
Name

Name of the field.

true

Order

Order of evaluation

true
Type

List of the  OCR Data Capture.

true
Property

Metadata field where will be stored the captured data.

true

Pattern

A Java regex pattern.

*

Rotation

Indicates the image rotation that must be applied.

Available image rotations:

  • -270
  • -180
  • -90
  • 0
  • 90
  • 180
  • 270
true

Custom

Sets commands to process the images by the OCR.  Several commands can be set, one per line.

For example to highlight the red color

/usr/bin/convert ${fileIn} -matte ( +clone -fuzz 50% -transparent red ) -compose DstOut -composite -negate ${fileOut}

false

OCR

OCR command line.

When present it will be executed in place of the system.ocr configuration parameter.

More information Configuration parameters.

For example to force digits capture.

/usr/bin/tesseract ${fileIn} ${fileOut} digits

false

Use to recognize

When it's set to true, indicates that the field will also be used on the recognise stage.

false

* This parameter is used in the Data Capture. It can be required or not depending on which Data Capture is used.

 

Understanding OCR Validation Control and OCR Data Capture

The application provides a complete list of available OCR Validation Controls and OCR Data Capture from Plugins view. Application plugins architecture help the administrators to add new features to the existing application.

By default application comes with built in plugins:

Also your own plugins can be created:

For more information about how to Register a new plugin.

Create a new OCR template

  • On the top right click New OCR templace icon.
  • Fill in the fields.
    • Metadata field ( static metadata values, normally used to identify document type ).
  • Chose the file ( allowed MIME type ).
  • Click on Create button.

  • Add Validation Controls fields according to your convenience.
    • Validation Controls
      • Click on Fields.
        • To create a new Validation Control:
          • Click on New Validation Control Field icon.
          • Fill the fields.
          • Choose the area with mouse on the left image.
          • Click on Create button.
        • To edit a Validation Control:
          • Click on  Edit a Validation Control icon.
          • Modify the fields.
          • Click on Edit button.
        • To delete a Validation Control:
          • Click on  Delete a Validation Control icon.
          • Click on Delete button.

  • Add Data Capture fields according to your convenience.
    • Data Capture
      • Click on Fields.
      • To create a new Data Capture
        • Click on  New Data Capture Field icon.
        • Fill in the fields.
        • Choose the area with mouse on the left image.
        • Click on Create Button.
      • To edit a Data Capture:
        • Click on  Edit a Data Capture icon.
        • Modify the fields.
        • Click on Edit button.
      • To delete a Data Capture:
        • Click on  Delete a Data Capture icon.
        • Click on Delete button.

Edit a OCR template

  • Click on  Edit OCR template icon.
  • Modify de fields.
  • Click on Edit button.

Edit a OCR template fields

Edit a OCR Validation Control

Click on Fields.

  • Click on  Edit a Validation Control icon.
  • Modify the fields.
  • Click on Edit button.

Edit a OCR Data Capture

Click on Fields.

  • Click on  Edit a Data Capture icon.
  • Modify the fields.
  • Click on Edit button.

Delete a OCR template fields

Delete a OCR Validation Control

Click on Fields.

  • Click on  Delete a Validation Control icon.
  • Click on Delete button.

Delete a OCR Data Capture

Click on Fields.

  • Click on  Delete a Data Capture icon.
  • Click on Delete button.

Check OCR template

  • Click on  Check icon and the OCR template will be checked.

In the Capture Data field is shown the result of the text extraction by the OCR.

In the Validation Control field is shown the result of the compliance validation ( true or false ).

Check recognise a document

  • On top right click on Recognise icon.
  • Choose the file to be recognised ( allowed MIME type ).
  • Choose the recognise level.
  • Click on Recognise button.

Recognise level, use more or less speed to identify a document. By default application uses slow.

  • In the fast level, it takes only in consideration the Validation Control fields.
  • In the medium level, it also takes in consideration that the Data Capture fields must not be empty.
  • In the slow level, it also takes in consideration that the Data Capture fields are right ( complish with patterns etc... ).

Delete a OCR template

  • Click on  Delete OCR template icon.
  • Click on Delete button.

Enable OCR Data Capture

FeatureDescription

Profile General tab

When the OCR data capture is enabled from the Profile General tab, the OCR Data Capture process will be executed automatically on each new uploaded document.

In case that  is present an automation definition, these parameters have no effect.

With the automation feature all these cases and more are covered.

Automation

Validations:

  • Is OCRDataCaptureFile

More information about Built-in Automation Validator plugins.

Actions:

  • OCR DataCapture
  • Add OCRDataCaptureToWizard

More information about Built-in Automation Action plugins.