Performance configuration parameters
Text extractor queue
There are several configuration parameters to adapt the document text extraction to your needs. Text extraction is one of the processes that can use a lot of hardware resources ( memory and CPUs). An aggressive text extraction policy can decrease the performance of the application and affect the end user's feelings about the user interface.
When documents are uploaded are automatically added into Text extraction queue. Can see the Text extraction queue at Administration > Statistics > Text extraction queue.
There's a crontab task named "Text extractor worker" at Administration > Cron tab that controls the indexing cycle
When a document is uploaded, it will not be able to search by content until it is processed by the Text extraction queue.
In case the application needs to index a lot of files per day, or you imported a considerable amount of files, consider creating two crontab tasks to change these parameters between day ( less aggressive ) and night ( more aggressive ).
Some mime-types like PDF or images (which need an OCR engine ) can use a lot of CPU ( top 100% ).
Field / Property | Type | Description |
---|---|---|
text.extraction.batch | Boolean |
Indicate the number of documents to be processed on each indexing cycle. When parameter "text.extraction.concurrent" is enabled, it is a good practice to be a multiple of "text.extraction.threads" value. 20 |
text.extraction.concurrent |
Boolean |
Enable or disable the concurrent text extraction concurrent threads feature. True |
text.extraction.threads | Integer |
The number of concurrent threads will be used for the text extraction feature. This number should be less than or equal to the number of hardware cores. 2 |
There's a crontab task named "Mailextractor worker" at Administration > Cron tab, which controls the indexing cycle
When a mail is uploaded, it will not be able to search by content until it is processed by the Mail extraction queue. Furthermore, attachment won't appear until it is processed.
Field / Property | Type | Description |
---|---|---|
mail.extraction.batch | Boolean |
Indicate the number of emails to be processed on each indexing cycle. When parameter "mail.extraction.concurrent" is enabled, it is a good practice to be a multiple of "mail.extraction.threads" value. 20 |
mail.extraction.concurrent |
Boolean |
Enable or disable the concurrent mail extraction concurrent threads feature. True |
mail.extraction.threads | Integer |
The number of concurrent threads will be used for the mail extraction feature. This number should be less than or equal to the number of hardware cores. 2 |
Download files and folders as ZIP file
It is not a good practice to download as ZIP files high volume of files. In most cases, take in consideration the application export tool at Administration > Utilities > Repository export.
Downloading files and folders as a ZIP file can take a lot of time and hardware resources. First, the application will prepare a temporary folder on the server that will be used to create the ZIP file, which will finally be forwarded to the user interface. To prevent blocking the user interface waiting for process completion, there are two parameters that will notify the user has exceeded some limits ( number of files or size ) and the process will go into background mode.
Field / Property | Type | Description |
---|---|---|
download.zip.live.max.files | Integer |
Set the maximum number of files allowed to download in live mode. 100 |
download.zip.live.max.file.size | String |
Set the maximum size of all files allowed to download in live mode. |
For more information about these parameters, read User interface configuration parameters.
Antivirus checker
There are several configuration parameters to adapt the antivirus checker to your needs. Antivirus checker is one of the processes that can use a lot of hardware resources ( memory and CPUs). An aggressive antivirus checker policy can decrease the performance of the application and affect the end user's feelings about the user interface. The Antivirus checker can work in live mode ( as the uploading process antivirus checks the document and if a virus is detected, a warning error is immediately raised and the document is not incorporated into the repository ) or in background mode.
Antivirus checker in live mode can dramatically decrease the user interface and the end user's experience. Take into consideration that to analyse some documents, the antivirus can check them in 3 seconds or more, which will be added to the uploading time process.
In case the application needs to analyse a lot of files per day, or you imported a considerable amount of files, consider creating two crontab tasks to change these parameters between day ( less aggressive ) and night ( more aggressive ).
Field / Property | Type | Description |
---|---|---|
system.antivir |
String |
Set the antivirus path. |
background.antivirus.check |
Boolean |
Enable or disable the background antivirus checker. By default is set to false. There's a crontab task ( Administration > Crontab )named Antivirus Checker Worker that takes control of the background antivirus checker. When the background check is disabled, the antivirus is processed while uploading any document. That can add extra time to the uploading process ( in some cases, more than 3 seconds is the time needed for an antivirus to check a document ). When you do a huge import of files, the total time to complete the operation can be multiplied by x3 or more with this option disabled, because as part of. Our suggestion is to set it to enabled. true |
background.antivirus.check.threads |
Integer |
The number of concurrent threads will be used for antivirus. This number should be less than or equal to the number of hardware cores. 2 |
background.antivirus.check.batch |
Integer |
Indicate the number of documents to be processed on each indexing cycle. When parameter "background.antivirus.check" is enabled, it is a good practice to be a multiple of "background.antivirus.check.threads" value. 20 |
For more information, read Antivirus configuration parameters.
Re-indexing the whole Lucene repository
The application can use two different ways of rebuilding Lucene indexes: sequential or parallel. By default, sequential re-indexing is enabled, but you can select the mode using the "hibernate.indexer.rebuild" configuration parameter.
Field / Property | Type | Description |
---|---|---|
hibernate.indexer.rebuild |
String |
By default, set to "flush". Possible values are "mass", "flush," and "delayed". Take a look at the table below to see the different value details. |
You can use the "hibernate.indexer.batch.size.load.objects" configuration parameter to indicate to Hibernate how many objects should be handled every time. To avoid OutOfMemory problems, the repository needs to be re-indexed in batches. If the value of this property is too low, the performance will be bad, but if it is too high, you can have OutOfMemory problems.
Mode | Description |
---|---|
Mass Indexer |
This mode will reindex the repository in parallel, using several threads. This may be faster, but it is also problematic in terms of resources. The repository will be in read-only mode until completed You have several configuration properties to tune its performance:
30
4 |
Flush to Indexes |
This mode will rebuild the index sequentially in batches. It's less resource hungry but may be slower. The repository will be in read-only mode until completed You have several configuration properties to tune its performance:
30 |
Delayed Indexer |
There's a cron task named "Lucene Delayed Indexer" which processes the queue of nodes to be indexed in the background. This queue is filled up when you go to Administration > Tools > Rebuild indexes and choose "Lucene indexes". When the delayed indexer is enabled, the nodes are included in the batch queue to be processed with this background process. This mode will rebuild the index sequentially in batches. It's less resource hungry but may be slower, but you have more control because you can choose which nodes need to be re-indexed, and the repository won't be in read-only mode while working. You have several configuration properties to tune its performance:
1000
4 |
Execution timeout
The Application executes external applications to process documents, for example, extracts the text with an OCR engine, analyzes with an Antivirus software, transforms documents to other formats, among other actions. Sometimes these processes can take a lot of time or for some reason are not finished correctly, and the process keeps on the OS consuming resources. To prevent it, the parameter "system.execution.timeout" sets the maximum allowed time of execution on external applications.
Field / Property | Type | Description |
---|---|---|
system.execution.timeout |
Integer |
Set the maximum allowed time of execution for external applications. The units are minutes. By default, the configuration is set to 5 minutes. 5 |
Activity log actions
OpenKM can log a lot of information related to the activity of the users, but sometimes these actions don't need to be logged and fill your activity log table.
The table where the activity log is stored - OKM_ACTIVITY - may grow quickly, storing millions of records.
There is a configuration property named "activity.log.actions" where you can set which actions to log. By default, this is set to the most common or interesting actions. You can use regular expressions to define these actions. Read Java Regex Tutorial for more info about Java regular expressions.
Field / Property | Type | Description |
---|---|---|
activity.log.actions |
List |
LOGIN |
For a complete list of action see Activity log.
Dashboard activity
The OpenKM dashboard displays a global view of application activity. For this, the application needs to process a huge amount of information, which can decrease the overall performance of the application ( for example, for getting the last documents uploaded into the system, OpenKM might execute database slow queries ). For this, we suggest that huge repositories disable dashboard activity.
Field / Property | Type | Description |
---|---|---|
dashboard.user.activity |
Boolean |
Set to false for disabling the feature. |
Lucene
The Lucene search engine can be disabled or configured in several ways. Depending on it, the behavior of the application and the performance overall might change.
These configuration parameters are going into the "openkm.properties" configuration file.
Field / Property | Type | Description |
---|---|---|
spring.jpa.properties.hibernate.search.enabled |
String |
Enable or disable the Lucene search engine. Allowed values:
true |
spring.jpa.properties.hibernate.search.automatic_indexing.synchronization.strategy |
String |
Set the Lucene worker execution mode. By default is set to "write-sync" which means that every entity modification won't end until the Lucene index has been properly updated. In case of "async", these entity modifications will be faster because the Lucene part is added to a queue and processed in a background thread. Allowed values:
sync |
Other configuration properties
Field / Property | Type | Description |
---|---|---|
background.parent.node.purge | Boolean |
Performs node purge as a background task so the user can keep working without needing to wait until the process has ended. |