Performance configuration parameters

Text extractor queue

There are several configuration parameters to adapt the document text extraction to your needs. Text extraction is one of the processes that can use a lot of hardware resources (memory and CPUs). An aggressive text extraction policy can decrease the performance of the application and affect the end user's experience with the user interface.

When documents are uploaded, they are automatically added to Text extraction queue. You can see the Text extraction queue at Administration > Statistics > Text extraction queue.

There's a crontab task named "Text extractor worker" at Administration > Cron tab that controls the indexing cycle

When a document is uploaded, it will not be searchable by content until it is processed by the Text extraction queue.

In case the application needs to index a lot of files per day, or you have imported a considerable amount of files, consider creating two crontab tasks to change these parameters between day (less aggressive) and night (more aggressive).

Some mime-types like PDF or images (which need an OCR engine) can use a lot of CPU (up to 100%).

Field / Property	Type	Description
text.extraction.batch	Boolean	Indicates the number of documents to be processed during each indexing cycle. When parameter "text.extraction.concurrent" is enabled, it is good practice to make it a multiple of the "text.extraction.threads" value. 20
text.extraction.concurrent	Boolean	Enable or disable the concurrent text extraction threads feature. True
text.extraction.threads	Integer	Number of concurrent threads that will be used for the text extraction feature. This number should be less than or equal to the number of hardware cores. 2

Field / Property

Type

Description

text.extraction.batch

Boolean

Indicates the number of documents to be processed during each indexing cycle.

When parameter "text.extraction.concurrent" is enabled, it is good practice to make it a multiple of the "text.extraction.threads" value.

text.extraction.concurrent

Boolean

Enable or disable the concurrent text extraction threads feature.

True

text.extraction.threads

Integer

Number of concurrent threads that will be used for the text extraction feature.

This number should be less than or equal to the number of hardware cores.

There's a crontab task named "Mailextractor worker" at Administration > Cron tab that controls the indexing cycle

When a mail is uploaded, it will not be searchable by content until it is processed by the Mail extraction queue. Furthermore, attachments won't appear until they are processed.

Field / Property	Type	Description
mail.extraction.batch	Boolean	Indicates the number of emails to be processed during each indexing cycle. When parameter "mail.extraction.concurrent" is enabled, it is good practice to make it a multiple of "mail.extraction.threads" value. 20
mail.extraction.concurrent	Boolean	Enable or disable the concurrent mail extraction threads feature. True
mail.extraction.threads	Integer	Number of concurrent threads that will be used for the mail extraction feature. This number should be less than or equal to the number of hardware cores. 2

OpenOffice LibreOffice conversion service

As part of application startup, OpenKM executes an OpenOffice or LibreOffice service. This is used internally for conversion purposes, for example, to convert DOC files to PDF.

The OpenOffice or LibreOffice service can use a lot of hardware resources (CPU up to 100%), which can decrease the performance of the application. A good practice is to move the OpenOffice or LibreOffice conversion service to another server.

Field / Property	Type	Description
system.openoffice.server	String	URL to the OpenOffice or LibreOffice service. http://192.168.1.34:8080/converter/convert
system.openoffice.tasks	String	Restart the service after x conversions to prevent memory leaks. 5

For more information read:

Download files and folders as ZIP file

It is not a good practice to download a high volume of files as ZIP files. In most cases, consider using the application export tool at Administration > Utilities > Repository export.

Downloading files and folders as a ZIP file can take a lot of time and hardware resources. First, the application will prepare a temporary folder on the server that will be used to create the ZIP file, and it will finally be forwarded to the user interface. To prevent blocking the user interface while waiting for process completion, there are two parameters that will notify the user when they have exceeded some limits (number of files or size) and the process will go into background mode.

Field / Property	Type	Description
download.zip.live.max.files	Integer	Set the maximum number of files allowed to download in live mode. 100
download.zip.live.max.file.size	String	Set the maximum total size of all files allowed to be downloaded in live mode.

For more information about these parameters read User interface configuration parameters.

Antivirus checker

There are several configuration parameters to adapt the antivirus checker to your needs. The antivirus checker is one of the processes that can use a lot of hardware resources (memory and CPUs). An aggressive antivirus checker policy can decrease the performance of the application and affect the end user's experience with the user interface. The antivirus checker can work in live mode (as documents are uploaded the antivirus checks the document and if a virus is detected a warning is immediately raised and the document is not added to the repository) or in background mode.

The antivirus checker in live mode can dramatically affect the user interface and the end user's experience. Take into consideration that to analyse some documents the antivirus can take 3 seconds or more, which will be added to the upload time.

In case the application needs to analyse a lot of files per day, or you have imported a considerable amount of files, consider creating two crontab tasks to change these parameters between day (less aggressive) and night (more aggressive).

Field / Property	Type	Description
system.antivir	String	Set the antivirus path.
background.antivirus.check	Boolean	Enable or disable the background antivirus checker. By default it is set to false. There's a crontab task ( Administration > Crontab ) named Antivirus Checker Worker that takes control of the background antivirus checker. When the background check is disabled, the antivirus is processed while uploading any document. That can add extra time to the uploading process (in some cases more than 3 seconds to check a document). When you perform a huge import of files, the total time to complete the operation can be multiplied by 3x or more with this option disabled. Our suggestion is to set it enabled. true
background.antivirus.check.threads	Integer	Number of concurrent threads that will be used for antivirus checks. This number should be less than or equal to the number of hardware cores. 2
background.antivirus.check.batch	Integer	Indicates the number of documents to be processed on each indexing cycle. When parameter "background.antivirus.check" is enabled, it is good practice to make it a multiple of the "background.antivirus.check.threads" value. 20

For more information read Antivirus configuration parameters.

Re-indexing the whole Lucene repository

The application can use two different ways of rebuilding Lucene indexes: sequential or parallel. By default, sequential re-indexing is enabled, but you can select the mode using the "hibernate.indexer.rebuild" configuration parameter.

Field / Property	Type	Description
hibernate.indexer.rebuild	String	By default set to "flush". Possible values are "mass", "flush" and "delayed". Take a look at the below table to see the different values details.

You can use the "hibernate.indexer.batch.size.load.objects" configuration parameter to indicate to Hibernate how many objects should be handled each time. To avoid OutOfMemory problems, the repository needs to be re-indexed in batches. If the value of this property is too low, performance will be poor, but if it is too high you can have OutOfMemory problems.

Mode	Description
Mass Indexer	This mode will reindex the repository in parallel, using several threads. This may be faster but can also be problematic in terms of resources. The repository will be in read-only mode until completion You have several configuration properties to tune its performance: hibernate.indexer.batch.size.load.objects: batch size used to load the entities. 30 hibernate.indexer.threads.load.objects: number of threads used to load the entities. 4
Flush to Indexes	This mode will rebuild the index sequentially in batches. It's less resource-intensive but may be slower. The repository will be in read-only mode until completion You have several configuration properties to tune its performance: hibernate.indexer.batch.size.load.objects: batch size used to load the entities. 30
Delayed Indexer	There's a cron task named "Lucene Delayed Indexer" which processes the queue of nodes to be indexed in the background. This queue is filled when you go to Administration > Tools > Rebuild indexes and choose "Lucene indexes". When the delayed indexer is enabled, the nodes are included in the batch queue to be processed by this background process. This mode will rebuild the index sequentially in batches. It's less resource-intensive and may be slower, but you have more control because you can choose which nodes need to be re-indexed and the repository won't be in read-only mode while it's working. You have several configuration properties to tune its performance: hibernate.indexer.delayed.max.objects: how many nodes will be processed in batch. 1000 hibernate.indexer.batch.size.load.objects: number of nodes to read from database at the same time. 4

Mode

Description

Mass Indexer

This mode will reindex the repository in parallel, using several threads. This may be faster but can also be problematic in terms of resources. The repository will be in read-only mode until completion

You have several configuration properties to tune its performance:

hibernate.indexer.batch.size.load.objects: batch size used to load the entities.

hibernate.indexer.threads.load.objects: number of threads used to load the entities.

Flush to Indexes

This mode will rebuild the index sequentially in batches. It's less resource-intensive but may be slower. The repository will be in read-only mode until completion

You have several configuration properties to tune its performance:

hibernate.indexer.batch.size.load.objects: batch size used to load the entities.

Delayed Indexer

There's a cron task named "Lucene Delayed Indexer" which processes the queue of nodes to be indexed in the background.

This queue is filled when you go to Administration > Tools > Rebuild indexes and choose "Lucene indexes". When the delayed indexer is enabled, the nodes are included in the batch queue to be processed by this background process.

This mode will rebuild the index sequentially in batches. It's less resource-intensive and may be slower, but you have more control because you can choose which nodes need to be re-indexed and the repository won't be in read-only mode while it's working.

You have several configuration properties to tune its performance:

hibernate.indexer.delayed.max.objects: how many nodes will be processed in batch.

1000

hibernate.indexer.batch.size.load.objects: number of nodes to read from database at the same time.

Execution timeout

The application executes external applications to process documents, for example, extracts text with an OCR engine, analyses with antivirus software, transforms documents to other formats, among other actions. Sometimes these processes can take a long time or, for some reason, are not finished correctly and the process remains running on the OS consuming resources. To prevent this, the parameter "system.execution.timeout" sets the maximum allowed execution time for external applications.

Field / Property	Type	Description
system.execution.timeout	Integer	Set the maximum allowed time of execution on external applications. The units are minutes. By default the configuration is set to 5 minutes. 5

Field / Property

Type

Description

system.execution.timeout

Integer

Set the maximum allowed time of execution on external applications.

The units are minutes. By default the configuration is set to 5 minutes.

Upload bandwidth

Sometimes you want to restrict the total bandwidth used by each user while uploading files.

Field / Property	Type	Description
upload.throttle.filter	Boolean	Limit the total upload bandwidth to 10 KB/sec per user.

Remote conversion service

On a huge repository with a lot of concurrent users, it is a good idea to configure a server only for document transformation.

To enable this feature you need to install a specific OpenKM service application. Contact OpenKM technical staff for more information.

Field / Property	Type	Description
remote.conversion.server	String	Set the URL of the conversion service. http://192.168.2.1:8080/converter/convert

Microsoft Office conversion service

On a huge repository with a lot of concurrent users, it is a good idea to configure a server only for document transformation.

To enable this feature you need to install a specific OpenKM service application. Contact OpenKM technical staff for more information.

Field / Property	Type	Description
remote.msoffice.conversion.service	String	Set the URL of the conversion service. https://localhost:5400/rest/conversion/pdf

Activity log actions

OpenKM can log a lot of information related to the activity of the users, but sometimes these actions do not need to be logged and can fill your activity log table.

The table where activity log is stored - OKM_ACTIVITY - may grow quickly, storing millions of records.

There is a configuration property named "activity.log.actions" where you can set which actions to log. By default this is set to the most common or interesting actions. You can use regular expressions to define these actions. Read Java Regex Tutorial for more information about Java regular expressions.

Field / Property	Type	Description
activity.log.actions	List	LOGIN LOGOUT CREATE_.* DELETE_.* PURGE_.* MOVE_.* COPY_.* CHECKOUT_DOCUMENT CHECKIN_DOCUMENT GET_DOCUMENT_CONTENT.* MISC_TEXT_EXTRACTION_FAILURE

For a complete list of actions see Activity log.

Dashboard activity

The OpenKM dashboard displays a global view of application activity. For this, the application needs to process a huge amount of information which can decrease the overall performance of the application (for example, to get the last documents uploaded into the system, OpenKM might execute slow database queries). Therefore, we suggest disabling dashboard activity on huge repositories.

Field / Property	Type	Description
dashboard.user.activity	Boolean	Set to false for disabling the feature.

Repository stats

This info is available at Administration > Stats and shows some graphs with information about repository size and number of documents, folders, emails and records by context. In the case of huge repositories these queries may take too much time to complete, so it's recommended to disable them:

Field / Property	Type	Description
repository.stats.enabled	Boolean	By default is true. Set to false to disable repository stats.

Lucene

Lucene search engine can be disabled or configured in several ways. Depending on this, the behavior of the application and the overall performance might change.

These configuration parameters go into the "openkm.properties" configuration file.

Field / Property	Type	Description
spring.jpa.properties.hibernate.search.autoregister_listeners	String	Enable or disable Lucene search engine. Allowed values: on off on
spring.jpa.properties.hibernate.search.default.worker.execution	String	Set the Lucene worker execution mode. By default, it is set to "sync" which means that every entity modification will not complete until the Lucene index has been properly updated. In the case of "async", these entity modifications will be faster because the Lucene part is added to a queue and processed in a background thread. Allowed values: sync async sync

Field / Property

Type

Description

spring.jpa.properties.hibernate.search.autoregister_listeners

String

Enable or disable Lucene search engine.

Allowed values:

spring.jpa.properties.hibernate.search.default.worker.execution

String

Set the Lucene worker execution mode. By default, it is set to "sync" which means that every entity modification will not complete until the Lucene index has been properly updated. In the case of "async", these entity modifications will be faster because the Lucene part is added to a queue and processed in a background thread.

Allowed values:

sync
async

sync

Other configuration properties

Field / Property	Type	Description
background.parent.node.purge	Boolean	Performs node purge as a background task so the user can keep working without need to wait until the process has ended.

Table of contents [ Hide Show ]

Text extractor queue
OpenOffice LibreOffice conversion service
Download files and folders as ZIP file
Antivirus checker
Re-indexing the whole Lucene repository
Execution timeout
Upload bandwidth
Remote conversion service
Microsoft Office conversion service
Activity log actions
Dashboard activity
Repository stats
Lucene
Other configuration properties

OpenKM 7.1

Performance configuration parameters

Text extractor queue

OpenOffice LibreOffice conversion service

Download files and folders as ZIP file

Antivirus checker

Re-indexing the whole Lucene repository

Execution timeout

Upload bandwidth

Remote conversion service

Microsoft Office conversion service

Activity log actions

Dashboard activity

Repository stats

Lucene

Other configuration properties