Performance Tuning
The following parameters are expert settings intended for advanced users. Improper configuration can lead to performance degradation instead of improvement.
Key considerations
Basic principle
When tuning performance, consider the impact of settings within the context of the entire system. High values for parallelism, typically exceeding 8-12, can exhaust resources and reduce performance.
Resource allocation
It’s essential to monitor hardware utilization (CPU, memory, disk operations, and temporary storage space usage) for all services, especially the application and database servers, when adjusting these settings.
Ensure that the hardware used for MDM is not shared with other services that might degrade performance and stability.
MDM transformation plans
When using MDM transformation plans, avoid steps that interact directly with the MDM repository (for example, SQL Select, SQL Execute) or MDM read and write steps, as well as external calls (for example, Web Lookup, Json Call).
These actions typically cause performance issues that are outside product support and need to be addressed on a project basis. Using such steps in load and export interfaces can also cause localized performance issues.
Parallelism setup
MDM processing parallelism can be influenced in several ways. Individual tasks can run in parallel, and operations and entities within the processing can run concurrently.
Task parallelism (processing parallelism)
This setting determines the number of jobs of a given type that can run concurrently. By default, parallelism is disabled, which is suitable for batch-oriented, analytical solutions where no modifications are made using MDM Web App.
For most other scenarios, we recommend using a single batch parallel strategy nme.parallel.strategy=single_batch
.
This approach allows for processing service requests from MDM Web App and APIs while also handling streams, which are considered a type of online processing.
Currently, only one batch load can run.
Reprocessing is considered a type of batch processing.
We recommend scheduling these batches outside of business hours or dividing them into smaller chunks (microbatches) to minimize possible conflicts and restarts of the batch load, or using streams instead.
A full parallelism strategy is recommended only when the majority of traffic is through streams and APIs, and there are no long-running batch loads. Combining this strategy with large batches can lead to unpredictable job conflicts, subsequent restarts, and extended processing times.
Operation parallelism
Operation parallelism within each processing phase needs to be managed independently. There are three main areas that are always sequential:
-
Acquisition
-
Master Data Consolidation
-
Committing
The next phase can start only after the previous one is completed. Each phase has a distinct method for setting up parallelism.
Acquisition parallelism (delta detection or change detection phase)
This phase applies to batch and streaming data loads.
It controls the number of entities for which the system detects changes (insert, update, delete, or no change) and generates a delta load for further processing.
You can set the level of parallelism using the nme.delta.parallel
runtime parameter, see Runtime Parameters, section Performance.
Keep in mind that this operation is typically database-intensive. Increasing parallelism will result in a higher load on the database.
Master data consolidation parallelism
Start with consolidation parallelism, allowing individual operations (subtasks) to run in parallel if constraints allow.
Consolidation parallelism can be configured using the nme.consolidation.parallel
runtime parameter.
Keep in mind that the described parallelisms multiply with the consolidation parallelism.
You can define a threshold to avoid parallelism for small transactions using the nme.consolidation.parallel.threshold
parameter.
For more information, see Runtime Parameters, section Performance.
Cleanse and master validation operation
For a low number of entities or underutilized hardware, increase cleanse or validate parallelism.
To configure it, use the nme.clean.parallelism
and nme.validate.parallelism
runtime parameters.
See Runtime Parameters, section Performance.
Typically, plan step parallelism is unnecessary despite being available as a standard option.
Matching
Typically, a central matching entity (for example, Party, Customer, Vendor, Patient) is processed as a single entity. Configure matching parallelism only for this central entity, because it usually runs standalone. Related or secondary matching plans generally run in parallel due to consolidation parallelism and do not require additional internal parallelism unless hardware is underutilized.
For more information, see Matching step performance tuning.
Operational plan optimization
This fine-tuning option optimizes the order of individual operations, their interference, and shortens processing time.
For optimal performance, order the operations or subtasks by:
-
Operation type: cleansing, matching of the gold (central) matching entity, related matchings, aggregate, merge, master validation.
-
Data volume and complexity, prioritizing the most time-consuming operations.
Consider operation plan constraints. If the data model has more independent parts, start with the largest one. Place the cleansing of smaller parts around the matching of the largest part.
For more information, see configuration:operational-plan.adoc.
Committing parallelism
For a low number of entities or underutilized hardware, increase the number of workers per entity. If there are more workers than entities, use elastic assignment of workers based on data volumes.
Committing parallelism can be configured using the nme.commit.parallel runtime
parameter, see Runtime Parameters, section Performance.
Keep in mind that this operation is typically database-intensive. Increasing parallelism will result in a higher load on the database.
Overall parallelism settings
Align parallelism settings with the number of available CPU cores, typically not exceeding 8-12. Be aware that during consolidation and committing phases, enforcing parallelism among entities and within entities is highly resource-demanding (especially the application server and database for committing), particularly when parallelism on each entity is enabled (not depicted in the schema below).
JVM parameters
The most relevant Java Virtual Machine (JVM) settings are described in JVM Configuration.
XMX memory settings
MDM uses both heap space (XMX memory) and non-heap space. The size of XMX memory should not exceed 70% of physical memory. Do not use |
MDM Server performance tuning
For the list of various parameters that influence the performance of MDM Server, see Runtime Parameters.
To use some of these parameters, define them in the runtime properties file, as in the following example.
For settings that are rarely modified, such as SMTP SSL flags or global JVM flags, you can add them as a Java parameter (for example, -Dnme.copyColumns.skipUnchanged=true
).
The values of runtime parameters set during the engine startup are available in the Admin Center under Runtime Parameters (see MD Process Monitoring, section Runtime parameters).
...
nme.parallel.strategy=single_batch
nme.delta.parallel=4
nme.consolidation.parallel=4
nme.commit.trimTooLongStrings=true
...
Runtime and steps performance tuning
It is possible to pass various runtime parameters in bulk in a runtime properties file or create a performance file for any plan.
In case of MDM, take note of the specific parameters used to configure parallelism of the Matching step, as described in the following section.
Matching step performance tuning
The Matching step can have the parallelism set on three levels—partition, key, and matching rule—with the following parameters:
-
mduPartitionParallelism
: Partitions defined in the Matching step are processed in parallel. -
mduKeyParallelism
: Key rules defined within each partition are processed in parallel. -
mduMatchParallelism
: Matching rules defined within each key rule are processed in parallel.
Allowed values:
-
0 - Parallelism disabled (default).
-
1 - Parallelism enabled.
...
mduPartitionParallelism=1
mduKeyParallelism=1
mduMatchParallelism=1
...
Once the parallelism is enabled, it is the actual number of Partitions, Key Rules, and Match Rules in the Matching step configuration that determines the resulting level of parallelism. For example, if there are two partitions, five keys, and ten matching rules configured, and the parallelism is set on all 3 levels, as in the example settings (with all 3 parameters set to 1), the overall parallelism will require 100 threads. |
Setting parallelism on step level
While it is possible to specify a global parallelism value that will be used by any Matching step, it is also possible to specify each parallelism setting separately for each Matching step with a specific notation.
For example, the following setting will be applied only on the Matching step whose ID is Party Matching
.
Party\ Matching.mduPartitionParallelism=1
Spaces must be escaped using a backward slash ( To avoid escaping, you can use underscores in the |
It is also possible to combine both global and step-specific settings.
...
### Global Matching Parallelism Settings ###
mduPartitionParallelism=1
mduKeyParallelism=0
mduMatchParallelism=0
...
### Party Matching ###
Party\ Matching.mduPartitionParallelism=1
Party\ Matching.mduKeyParallelism=1
Party\ Matching.mduMatchParallelism=1
MDM database performance tuning
We recommend using PostgreSQL database as the storage for MDM. Oracle and MS SQL databases are supported, however, no further enhancements will be made for these platforms starting from version 16. |
PostgreSQL
Certain parameters for PostgreSQL databases can be configured in the MDM Server application.properties
file, see MDM Server Application Properties, section Database Performance.
Oracle and MS SQL
MDM allows hinting for all SQL queries used internally.
SQL queries are available in the <ONE_HOME>/vldb-commands_<DB>.xml
and the file is referenced in the MDM configuration file.
Currently, only Oracle hints are pre-prepared: see the vldb-commands_oracle.xml
file.
The ability to hint database queries was previously implemented using the runtime parameters. This method still works without any manual interventions. |
MDM database settings
Amazon Aurora PostgreSQL configuration
If the apg_plan_mgmt extension
is enabled, set the apg_plan_mgmt.use_plan_baselines
to false
.
Any other configuration should be confirmed by your database administrator.
Persistence layer debug logging
Logging for persistence layer SELECT operations can be set up in the MDM Server application.properties
file (see MDM Server Application Properties, section Logging):
logging.level.com.ataccama.mdm.persistence=INFO
ataccama.one.mdm.db.logging.enabled=true
ataccama.one.mdm.db.logging.long-running-operation=10s
ataccama.one.mdm.db.logging.show-parameters=false
If you want to log all operations, set the value of the ataccama.one.mdm.db.logging.long-running-operation to -1s .
|
Was this page useful?