Performance configuration parameters

Text extractor queue

There are several configuration parameters to adapt the document text extraction to your needs. Text extraction is one of the processes that can use a lot of hardware resources (memory and CPUs). An aggressive text extraction policy can decrease the performance of the application and affect the end user's perception of the user interface.

When documents are uploaded, they are automatically added to the Text extraction queue. You can see the Text extraction queue at Administration > Statistics > Text extraction queue.

There's a crontab task named "Text extractor worker" at Administration > Crontab that controls the indexing cycle.

When a document is uploaded, it will not be searchable by content until it is processed by the Text extraction queue.

If the application needs to index a large number of files per day, or you have imported a considerable number of files, consider creating two crontab tasks to change these parameters between day (less aggressive) and night (more aggressive).

Some MIME types, like PDF or images (which require an OCR engine), can use a lot of CPU (up to 100%).

Field / PropertyTypeDescription
text.extraction.batch Boolean

Indicates the number of documents to be processed on each indexing cycle.

When the parameter "text.extraction.concurrent" is enabled, it is good practice to set it to a multiple of the "text.extraction.threads" value.

20

text.extraction.concurrent

Boolean

Enable or disable the concurrent text extraction threads feature.

True

text.extraction.threads Integer

The number of concurrent threads used for the text extraction feature.

This number should be less than or equal to the number of hardware cores.

2


There's a crontab task named "Mail extractor worker" at Administration > Crontab, which controls the indexing cycle.

When a mail is uploaded, it will not be searchable by content until it is processed by the Mail extraction queue. Furthermore, attachments won't appear until they are processed.

Field / PropertyTypeDescription
mail.extraction.batch Boolean

Indicates the number of emails to be processed on each indexing cycle.

When the parameter "mail.extraction.concurrent" is enabled, it is good practice to set it to a multiple of the "mail.extraction.threads" value.

20

mail.extraction.concurrent

Boolean

Enable or disable the concurrent mail extraction threads feature.

True

mail.extraction.threads Integer

The number of concurrent threads used for the mail extraction feature.

This number should be less than or equal to the number of hardware cores.

2

Download files and folders as a ZIP file

It is not good practice to download high volumes of files as ZIP files. In most cases, consider using the application export tool at Administration > Utilities > Repository export.

Downloading files and folders as a ZIP file can take a lot of time and hardware resources. First, the application will prepare a temporary folder on the server that will be used to create the ZIP file, which will then be forwarded to the user interface. To prevent blocking the user interface while waiting for process completion, there are two parameters that will notify the user when they have exceeded some limits (number of files or size) and the process will go into background mode.

Field / PropertyTypeDescription
download.zip.live.max.files Integer

Set the maximum number of files allowed to be downloaded in live mode.

100

download.zip.live.max.file.size String

Set the maximum size of all files allowed to be downloaded in live mode.

For more information about these parameters, read User interface configuration parameters.

Antivirus checker

There are several configuration parameters to adapt the antivirus checker to your needs. The antivirus checker is one of the processes that can use a lot of hardware resources (memory and CPUs). An aggressive antivirus checker policy can decrease the performance of the application and affect the end user's perception of the user interface. The antivirus checker can work in live mode (as part of the uploading process: the antivirus checks the document and, if a virus is detected, a warning error is immediately raised and the document is not incorporated into the repository) or in background mode.

Antivirus checker in live mode can dramatically degrade the user interface and the end user's experience. Take into consideration that to analyse some documents, the antivirus can take 3 seconds or more, which will be added to the upload time.

If the application needs to analyse a large number of files per day, or you have imported a considerable number of files, consider creating two crontab tasks to change these parameters between day (less aggressive) and night (more aggressive).

Field / PropertyTypeDescription

system.antivir

String

Set the antivirus path.

background.antivirus.check

Boolean

Enable or disable the background antivirus checker. By default, it is set to false.

There's a crontab task (Administration > Crontab) named Antivirus Checker Worker that controls the background antivirus checker.

When the background check is disabled, the antivirus is processed during document upload. That can add extra time to the uploading process (in some cases, the antivirus may take more than 3 seconds to check a document). When you perform a large import of files, the total time to complete the operation can be multiplied by 3× or more with this option disabled, because the antivirus check runs as part of the upload process. Our suggestion is to set it to enabled.

true

background.antivirus.check.threads

Integer

The number of concurrent threads used for antivirus.

This number should be less than or equal to the number of hardware cores.

2

background.antivirus.check.batch

Integer

Indicates the number of documents to be processed on each indexing cycle.

When the parameter "background.antivirus.check" is enabled, it is good practice to set it to a multiple of the "background.antivirus.check.threads" value.

20

For more information, read Antivirus configuration parameters.

Re-indexing the whole Lucene repository

The application can use two different ways to rebuild Lucene indexes: sequential or parallel. By default, sequential re-indexing is enabled, but you can select the mode using the "hibernate.indexer.rebuild" configuration parameter.

Field / PropertyTypeDescription

hibernate.indexer.rebuild

String

By default, it is set to "flush". Possible values are "mass", "flush", and "delayed". Take a look at the table below to see the details of each value.

You can use the "hibernate.indexer.batch.size.load.objects" configuration parameter to indicate to Hibernate how many objects should be handled at a time. To avoid OutOfMemory problems, the repository needs to be re-indexed in batches. If the value of this property is too low, performance will be poor, but if it is too high, you can have OutOfMemory problems.

ModeDescription

Mass Indexer

This mode will re-index the repository in parallel, using several threads. This may be faster, but it is also problematic in terms of resources. The repository will be in read-only mode until completed.

You have several configuration properties to tune its performance:

  • hibernate.indexer.batch.size.load.objects: batch size used to load the entities.

30

  • hibernate.indexer.threads.load.objects: number of threads used to load the entities.

4

Flush to Indexes

This mode will rebuild the index sequentially in batches. It's less resource-hungry but may be slower. The repository will be in read-only mode until completed.

You have several configuration properties to tune its performance:

  • hibernate.indexer.batch.size.load.objects: batch size used to load the entities.

30

Delayed Indexer

There's a cron task named "Lucene Delayed Indexer" which processes the queue of nodes to be indexed in the background.

This queue is filled up when you go to Administration > Tools > Rebuild indexes and choose "Lucene indexes". When the delayed indexer is enabled, the nodes are included in the batch queue to be processed with this background process.

This mode will rebuild the index sequentially in batches. It's less resource-hungry and may be slower, but you have more control because you can choose which nodes need to be re-indexed, and the repository won't be in read-only mode while it is working.

You have several configuration properties to tune its performance:

  • hibernate.indexer.delayed.max.objects: how many nodes will be processed in a batch.

1000

  • hibernate.indexer.batch.size.load.objects: number of nodes to read from the database at the same time.

4

Execution timeout

The application executes external programs to process documents; for example, it extracts text with an OCR engine, analyses documents with antivirus software, and transforms documents to other formats, among other actions. Sometimes these processes can take a long time or, for some reason, do not finish correctly, and the process continues running on the OS consuming resources. To prevent this, the parameter "system.execution.timeout" sets the maximum allowed execution time for external applications.

Field / PropertyTypeDescription

system.execution.timeout

Integer

Set the maximum allowed execution time for external applications.

The units are minutes. By default, the configuration is set to 5 minutes.

5

Activity log actions

OpenKM can log a lot of information related to user activity, but sometimes these actions do not need to be logged and can fill your activity log table.

The table where the activity log is stored - OKM_ACTIVITY - may grow quickly, storing millions of records.

There is a configuration property named "activity.log.actions" where you can set which actions to log. By default, this is set to the most common or interesting actions. You can use regular expressions to define these actions. Read Java Regex Tutorial for more information about Java regular expressions.

Field / PropertyTypeDescription

activity.log.actions

List

LOGIN
LOGOUT
CREATE_.*
DELETE_.*
PURGE_.*
MOVE_.*
COPY_.*
CHECKOUT_DOCUMENT
CHECKIN_DOCUMENT
GET_DOCUMENT_CONTENT.*
MISC_TEXT_EXTRACTION_FAILURE

For a complete list of actions see Activity log.

Dashboard activity

The OpenKM dashboard displays a global view of application activity. To provide this, the application needs to process a large amount of information, which can decrease overall performance (for example, to get the most recently uploaded documents, OpenKM might execute slow database queries). Therefore, we suggest that very large repositories disable dashboard activity.

Field / PropertyTypeDescription

dashboard.user.activity

Boolean

Set to false to disable the feature.

 

Lucene

The Lucene search engine can be disabled or configured in several ways. Depending on the configuration, the behavior of the application and overall performance might change.

These configuration parameters go into the "openkm.properties" configuration file.

Field / PropertyTypeDescription

spring.jpa.properties.hibernate.search.enabled

String

Enable or disable the Lucene search engine.

Allowed values:

  • true
  • false

true

spring.jpa.properties.hibernate.search.automatic_indexing.synchronization.strategy

String

Set the Lucene worker execution mode. By default it is set to "write-sync", which means that an entity modification will not complete until the Lucene index has been properly updated. In the case of "async", entity modifications will be faster because the Lucene part is added to a queue and processed in a background thread.

Allowed values:

  • sync
  • async

sync

Other configuration properties

Field / PropertyTypeDescription
background.parent.node.purge Boolean

Performs node purge as a background task so the user can continue working without waiting for the process to end.