Performance configuration parameters

Text extractor queue

There are several configuration parameters to adapt the document text extraction to your needs. Text extraction is one of the processes that can use a lot of hardware resources ( memory and CPU's ). Aggressive text extraction policy can decrease the performance of the application and affect the end user feeling with user interface.

When documents are uploaded are automatically added into Text extraction queue. Can see Text extraction queue at Administration > Statistics > Text extraction queue.

There's a crontab task named "Text extractor worker" at Administration > Cron tab what control the indexing cycle

When a document is uploaded, it will not be able to search by content until is processed by Text extraction queue.

In case the application need indexing a lot of files per day, or you imported a considerable amount of files, consider create two crontab task to change these parameters between day ( less aggressive ) and night ( more aggressive ).

Some mime-types like pdf or images ( what needs OCR engine ) can use a lot of CPU ( top 100% ).

Field / PropertyTypeDescription
text.extraction.batch Boolean

Indicate the number of documents to be processed on each indexing cycle.

When parameter "text.extraction.concurrent" is enabled, is a good practice  to be multiple of "text.extraction.threads" value.

20

text.extraction.concurrent

Boolean

Enable or disable concurrent text extraction concurrent threads feature.

True

text.extraction.threads Integer

Number of concurrent threads will be used for text extraction feature.

This number should be less or equal  the number of hardware cores.

2


There's a crontab task named "Mailextractor worker" at Administration > Cron tab what control the indexing cycle

When a mail is uploaded, it will not be able to search by content until is processed by Mail extraction queue. Furthermore, attachment won't appear until it is processed.  

Field / PropertyTypeDescription
mail.extraction.batch Boolean

Indicate the number of mails to be processed on each indexing cycle. 

When parameter "mail.extraction.concurrent" is enabled, is a good practice  to be multiple of "mail.extraction.threads" value.

20

mail.extraction.concurrent

Boolean

Enable or disable concurrent mail extraction concurrent threads feature.

True

mail.extraction.threads Integer

Number of concurrent threads will be used for mail extraction feature.

This number should be less or equal  the number of hardware cores.

2

OpenOffice LibreOffice conversion service

As part of application startup, OpenKM executes an OpenOffice or LibreOffice service. That's used internally for conversion purposes, for example to converting doc files to pdf.

OpenOffice or LibreOffice service can use a lot of hardware resources ( CPU top 100% ), that can decrease the performance of the application. A good practice is move OpenOffice or LibreOffice conversion service to another server.

Field / PropertyTypeDescription
system.openoffice.server String

URL to OpenOffice or LibreOffice service.

http://192.168.1.34:8080/converter/convert

system.openoffice.tasks String

Restart service after x conversions to prevent memory leaks.

5

For more information read:

Download files and folders as ZIP file

It is not a good practice to download as ZIP files high volume of files. In most cases take in consideration the application export tool at Administration > Utilities > Repository export.

Downloading files and folders as ZIP file can take a lot of time and hardware resources. First, the application will prepare a temporal folder in the server that will be used to create the ZIP file, finally will be forwarded to user interface. To prevent blocking user interface waiting for process completion there are two parameters that will notice the user has exceeded some limits ( number of files or size ) and the process will go into background mode.

Field / PropertyTypeDescription
download.zip.live.max.files Integer

Set the maximum number of files allowed to download in live mode.

100

download.zip.live.max.file.size String

Set the maximum size of all files allowed to download in live mode.

For more information about these parameter read User interface configuration parameters.

Antivirus checker

There are several configuration parameters to adapt the antivirus checker to your needs. Antivirus checker is one of the process that can use a lot of hardware resources ( memory and CPU's ). Aggressive antivirus checker policy can decrease the performance of the application and affect the end user feeling with user interface. The Antivirus checker can work in live mode ( as uploading process antivirus check the document and if a virus is detected a warning error is immediately raised and the document is not incorporated to the repository ) or check in background mode.

Antivirus checker in live mode can dramatically decrease the user interface and the end user feeling. Take in consideration that to analyse some documents the antivirus can check it in 3 seconds or more that will be added to the uploading time process.

In case the application needs to analyse a lot of files per day, or you imported a considerable amount of files, consider create two crontab task to change these parameters between day ( less aggressive ) and night ( more aggressive ).

Field / PropertyTypeDescription

system.antivir

String

Set the antivirus path.

background.antivirus.check

Boolean

Enable or disable background antivirus checker. By default is set to false.

There's a crontab task ( Administration > Crontab )named Antivirus Checker Worker that take control of background antivirus checker.

When the background check is disabled the antivirus is processed while uploading any document. That can add extra time to the uploading process ( in some cases more than 3 seconds is the time needed to antivirus for checking a document ). When you do a huge import of files the total time to complete the operation can be multiplied by x3 or more with this option disabled, because as part. Our suggestion is set it enabled.

true

background.antivirus.check.threads

Integer

Number of concurrent threads will be used for antivirus.

This number should be less or equal  the number of hardware cores.

2

background.antivirus.check.batch

Integer

Indicate the number of documents be processed on each indexing cycle.

When parameter "background.antivirus.check" is enabled, is a good practice  to be multiple of "background.antivirus.check.threads" value.

20

For more information read Antivirus configuration parameters.

Re-indexing the whole Lucene repository

Application can use two different ways of rebuilding Lucene indexes: sequential or parallel. By default sequential re-indexing is enabled, but you can select the mode using the "hibernate.indexer.rebuild" configuration parameter.

Field / PropertyTypeDescription

hibernate.indexer.rebuild

String

By default set to "flush". Possible values are "mass", "flush" and "delayed". Take a look at the below table to see the different values details.

You can use the "hibernate.indexer.batch.size.load.objects" configuration parameter to indicate to Hibernate how many object should handle every time. To avoid OutOfMemory problems, the repository needs to be re-indexed in batch. If the value of this property is too low, the performance will be bad but if is too high you can have OutOfMemory problems.

ModeDescription

Mass Indexer

This mode will reindex the repository in parallel, using several threads. This may be faster but be also problematic in terms of resources. The repository will be in read-only mode until completed

You have several configuration properties to tune its performance:

  • hibernate.indexer.batch.size.load.objects: batch size used to load the entities.

30

  • hibernate.indexer.threads.load.objects: number of threads used to load the entities.

4

Flush to Indexes

This mode will rebuild the index sequentially in batches. It's less resource hungry but may be slower. The repository will be in read-only mode until completed

You have several configuration properties to tune its performance:

  • hibernate.indexer.batch.size.load.objects: batch size used to load the entities.

30

Delayed Indexer

There's a cron task named "Lucene Delayed Indexer" which process the queue of nodes to being indexed in the background.

This queue is filled up when you go to Administration > Tools > Rebuild indexes and choose "Lucene indexes". When the delayed indexer is enabled, the nodes are included into the batch queue to be processed with this background process.

This mode will rebuild the index sequentially in batches. It's less resource hungry but may be slower but you have more control because can choose which nodes need to be re-indexed and the repository won't be in read-only mode while working.

You have several configuration properties to tune its performance:

  • hibernate.indexer.delayed.max.objects: how many nodes will be processed in batch.

1000

  • hibernate.indexer.batch.size.load.objects: number of nodes to read from database at the same time.

4

Execution timeout

The Application executes external applications to process documents, for example extracts the text with an OCR engine, analyse with an Antivirus software, transforms documents to other formats among other actions. Sometimes these processes can take a lot of time or for some reason  is not finished correctly and the process keep on the OS consuming resources. To prevent it, the parameter "system.execution.timeout" set the maximum allowed time of execution on external applications.

Field / PropertyTypeDescription

system.execution.timeout

Integer

Set the maximum allowed time of execution on external applications.

The units are minutes. By default the configuration is set to 5 minutes.

5

Upload bandwidth

Sometimes you want to restrict the total bandwidth used by each user while uploading files.

Field / PropertyTypeDescription

upload.throttle.filter

Boolean

Limit the total uploading bandwidth to 10kb/sec per user.

Remote conversion service

On huge repository with a lot of concurrent users is a good idea to configure a server only for document transformation.

To enable this feature is needed installing a specific OpenKM service application.  Contact with OpenKM technical staff for more information.

Field / PropertyTypeDescription

remote.conversion.server

String

Set the URL of the conversion service.

http://192.168.2.1:8080/converter/convert

Microsoft Office conversion service

On a huge repository with a lot of concurrent users is a good idea to configure a server only for document transformation.

To enable this feature is needed installing a specific OpenKM service application.  Contact with OpenKM technical staff for more information.

Field / PropertyTypeDescription

remote.msoffice.conversion.service

String

Set the URL of the conversion service.

https://localhost:5400/rest/conversion/pdf

Activity log actions

OpenKM can log a lot of information related to the activity of the users , but sometimes these actions don't need to be logged and fill your activity log table.

The table where activity log is stored - OKM_ACTIVITY - may grow quickly storing millions of records.

There is a configuration property name "activity.log.actions" where you can set which actions to log. By default this is set to the most common or interesting actions. You can use regular expressions to define these actions. Read Java Regex Tutorial for more info about Java regular expressions.

Field / PropertyTypeDescription

activity.log.actions

List

LOGIN
LOGOUT
CREATE_.*
DELETE_.*
PURGE_.*
MOVE_.*
COPY_.*
CHECKOUT_DOCUMENT
CHECKIN_DOCUMENT
GET_DOCUMENT_CONTENT.*
MISC_TEXT_EXTRACTION_FAILURE

For a complete list of action see Activity log.

Dashboard activity

OpenKM dashboard displays a global view of application activity. For it, application needs to process a huge number of information what can decrease the overal perfomance of the application ( for example for getting the last documents uploaded into the system OpenKM might execute database slow queries ). For it, we suggest on huge repositories disable dashboard activity.

Field / PropertyTypeDescription

dashboard.user.activity

Boolean

Set to false for disabling the feature.

Repository stats

This info is available at Administration > Stats and show some graphs with information about repository size and number of documents, folder, mails and records by context. In case of huge repositories these queries may take too much time to complete, so it's recommended to disable them:

Field / PropertyTypeDescription
repository.stats.enabled Boolean

By default is true. Set to false to disable repository stats.

Lucene

Lucene search engine can be disabled or configured in serveral ways. Depending on it, the behavior of the application and the performance over all might change.

These configuration parameteres going into the "openkm.properties" configuration file.

Field / PropertyTypeDescription

spring.jpa.properties.hibernate.search.autoregister_listeners

String

Enable or disable Lucene search engine.

Allowed values:

  • on
  • off

on

spring.jpa.properties.hibernate.search.default.worker.execution

String

Set the Lucene worker execution mode. By default is set to "sync" which means that every entity modification won't end until the Lucene index has been properly updated. In case of "async" these entities modification will be faster because the Lucene part is added to a queue and processed in a background thread.

Allowed values:

  • sync
  • async

sync

Other configuration properties

Field / PropertyTypeDescription
background.parent.node.purge Boolean

Performs node purge as a background task so the user can keep working without need to wait until the process has ended.