Lookups Management in MDM
This article covers how to manage lookup files in MDM. Learn about two storage approaches—static (Git-based) and dynamic (object storage with VFS)—and when to use each based on file size, update frequency, and uptime requirements.
What are lookups?
Lookups are specialized data files that store lists of values for efficient use during data processing. They enable quick data validation, reference checks, and fuzzy matching operations by mapping data directly to machine memory for fast access.
Lookups are commonly used to:
-
Validate reference data, such as email domains or country codes, against known lists.
-
Check customer data against reference databases.
-
Perform data cleansing operations.
-
Standardize values during data processing.
Using large or numerous lookup files can result in excessive RAM consumption.
In that case, you might need to increase the available RAM or reduce the -Xmx JVM parameter for the MDM Server to ensure stable performance.
|
In MDM, lookups are stored as .lkp
files.
You can think of them as technical snapshots of your reference data at a specific point in time.
Once an update is made to your reference data, a new version of the lookup file must be created as well.
In the context of lookup files, versioning refers to updating the lookup file to reflect the latest changes in the underlying reference data. We do not recommend using lookup files as a means of versioning reference data; use a dedicated RDM solution instead. |
How lookup files are updated
Lookup files (.lkp
) are created as output of the Lookup Builder step in ONE plans, which reads values from source data and processes it into a structured format.
They are then used throughout the data processing pipeline in ONE Desktop using the Lookup step for reading lookup files.
For more information about the Lookup Builder and Lookup steps, refer to the ONE Desktop help (Help > Help Contents). |
The .lkp
format doesn’t support incremental updates and any change to your reference data requires building a completely new lookup file.
This is why, when files are stored only locally, updating a lookup file used in a ONE plan requires rebuilding the file and restarting the MDM Server.
Choosing your approach
To handle lookups updates more efficiently, MDM supports two approaches for managing lookup files, each suited to different use cases.
Both approaches work with cloud and self-managed, on-premises deployments. The dynamic approach supports the following storage options:
-
Ataccama Cloud: MinIO is the only supported object storage solution for dynamic lookup management.
-
Self-managed: You can use any compatible object storage solution, such as Amazon S3, Azure Data Lake Storage 2, or a mounted hard drive.
In addition, both approaches support using complex folder structures, with as many subfolders as needed, allowing you to organize lookup files by category or purpose.
Keep in mind that using a storage option other than MinIO might result in additional configuration steps and issues that are not covered by this article. |
Static approach
With static approach, lookup files are stored within your Git configuration project and no additional storage configuration is required. This also allows you to leverage your existing Git setup used for MDM cloud deployments (see MDM Ataccama Cloud Deployment and MDM Custom Ataccama Cloud Deployment).
However, changes to lookup files require restarting the MDM Server to update.
As such, the static approach is best suited for:
-
Small lookup files (under 5 MB).
-
Reference data that doesn’t change very often.
-
Simple setups (such as for testing or development purposes) where occasional server downtime for updates is acceptable.
Keep in mind that larger files can also significantly increase the size of your project, leading to longer server startup times and slower performance in ONE Desktop.
Dynamic approach
The dynamic approach relies on MinIO for storing lookup files and the Versioned File System (VFS) component for managing updates, allowing for more flexibility and scalability as lookup files can be updated without restarting the MDM server.
While this approach can be used for any lookup file, it is particularly beneficial for:
-
Large lookup files (over 5 MB).
-
Reference data that changes frequently.
-
Production environments where uptime is critical.
In addition, MinIO provides built-in backup and recovery capabilities, enhancing data integrity and resilience.
Using both approaches
You can use both static and dynamic approaches within the same MDM environment.
To combine these approaches effectively, take into account the following:
-
Keep Git and MinIO lookup files in separate folders. This facilitates maintenance by reducing user error and might also speed up VFS reloads as only some folders have to be reloaded.
To determine which files should go where, compare Static approach and Dynamic approach.
-
Do not store the same file (same path and filename) in both Git and MinIO. This can lead to MDM startup issues and goes against performance optimization best practices.
Working with the static approach
Folder structure and path variables
Lookup files are typically stored in the mdm/Files/data/ext/lkp
folder within your project.
For easy reference, the path to lookup files is defined by the EXT
path variable (the absolute path being /srv/conf/server/mdm/Files/data/ext
).
Path variable | Folder | ONE Desktop project relative path |
---|---|---|
|
|
|
For example, to reference a lookup file in the Lookup step within your ONE plans using the EXT
path variable, the syntax would be:
pathvar://EXT/lkp/your_lookup_file.lkp
Update lookup files in Git
After you create a new version of your lookup file using the Lookup Builder step, commit your changes to the Git repository. Restart the MDM server to load the updated lookup file.
Since updates require server restarts, schedule lookup file updates during maintenance windows when possible. |
Working with the dynamic approach
Folder structure and path variables
With the dynamic approach, lookup files are typically stored in the following locations.
The variables point to the root folder of the respective lkp
folders where the actual lookup data and files are stored: /srv/data/ext
for EXT_DYNAMIC
and /minio-in/ext
for MINIO_EXT_IN
.
The root folder of EXT_DYNAMIC
must be defined in the VFS component as well.
Path variable | Folder | ONE Desktop project relative path |
---|---|---|
|
|
|
|
|
|
|
|
|
Keep in mind that ONE plans for building lookups should be still kept in the project’s Git repository and not MinIO. |
Dynamically managed lookup files can be referenced in plans in the following ways:
-
To reference the local version of the lookup files, downloaded from MinIO:
pathvar://EXT_DYNAMIC/lkp/your_lookup_file.lkp
-
To directly reference the MinIO version of lookup files:
pathvar://MINIO_EXT_IN/lkp/your_lookup_file.lkp
Lookup initialization process
To make files stored in MinIO available to the local MDM Server file system, the MDM Server goes over all files located in the source location (a MinIO bucket) and its subfolders and attempts to copy them to the target folder, which is a local directory. We refer to this process as initializing and building lookup files.
At startup, all lookup files used in the solution must be available. If MinIO cannot be accessed or retrieving files from MinIO fails, the MDM Server fails to start. |
An exception to this are lookup files used only in workflow-initiated batch job plans.
If a file already exists in the target location, the file is not copied.
To have all files copied regardless of whether they already exist in the target location, configure override-existing-files=true in MDM Server Application Properties > Lookups management.
|
If you restart the MDM Server from the MDM Web App Admin Center, you also need to run the Synchronize Lookups option manually. Otherwise, lookup files are not initialized. |
The source and target locations are defined using these properties:
-
Source: The path is constructed by concatenating the values of the following properties:
-
ataccama.one.mdm.lookup.storage.minio.bucket-in
-
ataccama.one.mdm.lookup.storage.remote-prefix
-
-
Target: The path is specified in the following property:
-
ataccama.one.mdm.lookup.storage.local-directory
-
Lookup files can be synchronized from the remote to the local storage via REST API. See REST API > Lookups management. |
Versioned File System configuration
The Versioned File System (VFS) component enables reloading new versions of lookup files without a server restart.
In the component definition, you specify one or more local folders where VFS files (in our case, lookup files) are stored. After you run the Reload Versioned File System (RVSF) task, the VFS picks up any changes made to the lookup files and refreshes them.
Take note that this does not happen automatically and you need to rerun the task manually as needed.
Use the vfs_dynamic_reload.ewf workflow in the General MDM project - CDI example as a template for reloading VFS files.
|
Using multiple VFS folders
You can define multiple folders as a VFS source.
However, keep in mind that the Reload VFS task operates only on lookup files referenced in your plans. Lookup files present in the VFS folders but not used in any plans are skipped.
For example, if you have two VFS folders, lookups
and ext_dynamic
, with a gender.lkp
file in each folder, and a plan referencing the lookups/gender.lkp
lookup file, only the lookups/gender.lkp
file will be reloaded after the VFS is refreshed.
To have the ext_dynamic/gender.lkp
file reloaded as well, you would need to reference it in a plan and restart the server.
As folders are scanned recursively for lookup files, we recommend configuring the VFS component to point only to folders used exclusively for lookup files for better performance.
Dynamic approach best practices
Corrupted or missing lookup files can cause issues during MDM Server startup. To avoid such problems, follow these best practices:
- Long-running operations for building lookups should output their results into a temporary folder
-
This way, you can copy or move the files once they are built, keeping the MDM Server running smoothly in the meantime.
- Keep a working backup of lookup files on the MDM Server.
-
Having a working, local version of lookup files helps with easier recovery from missing or corrupted lookup issues. Use the same workflow for reloading the VFS (
vfs_dynamic_reload.ewf
) to copy valid lookup files into a local backup folder. - Back up lookup files in MinIO.
-
We recommend keeping your lookup files in a MinIO instance with hourly or daily backups.
Managing lookup files
Retrieve lookup files from MinIO
Before you start, prepare your lookup source data or lookup files and place it in the corresponding MinIO bucket.
Typically, this is minio-in/ext
(referenced by the MINIO_EXT_IN
path variable).
To download the files from MinIO to the MDM Server, use the minio_ext_to_mdm
workflow in the General MDM project - CDI example as a template and edit it as needed.
Run the workflow for reloading the VFS (vfs_dynamic_reload.ewf
) to initialize and build the downloaded lookup files.
You can now proceed with running plans.
Upload updated lookup files to MinIO
After you have created a new version of a lookup file in MDM, you can upload it to MinIO using the minio_mdm_to_ext
workflow in the General MDM project - CDI example.
Typically, the files are uploaded to the following location:
MinIO upload location |
---|
|
|
Migrating to the dynamic approach
To successfully migrate from the static to the dynamic approach for managing lookup files, follow these steps as a general guideline.
-
Move lookup files or source data from Git to MinIO. Refer to Working with the dynamic approach for details about folder structure and other best practices.
-
Update path variables in all steps using lookups (Lookup Builder, Lookups). Typically, this means replacing
EXT
path variables withEXT_DYNAMIC
. -
Rebuild lookups (optional but recommended) and reload the Virtual File System (
vfs_dynamic_reload.ewf
workflow).
Detailed example
-
In your Git repository, find the lookup files you want to migrate. For the purpose of this example, our static lookup file is
pathvar://EXT/lkp/___email_tld.lkp
. -
Locate the plan for building this lookup file (for example,
___email_tld.plan
). -
Verify that the MDM Server application properties relevant for lookup storage are correctly set, namely
ataccama.one.mdm.lookup.storage.local-directory
property. For details, refer to MDM Server Application Properties > Lookups management. -
Verify that the folder specified in the
ataccama.one.mdm.lookup.storage.local-directory
property is also correctly defined in the Versioned File System (VFS) component configuration. -
If you are using a self-managed deployment, configure a connection to S3 or your storage of choice in the server runtime configuration.
Keep in mind that the connection
name
must match the name used in path variables pointing to MinIO. For example, if your S3 connection is namedminioIn
, the path variables should be defined as follows:<pathVariable name="MINIO_EXT_IN" value="resource://**minioIn**/ext"/>
.Skip this step if you are using Ataccama Cloud, as MinIO connection is preconfigured.
-
Update the path variables accordingly in your Lookup Builder step.
For example, if the previous location was
pathvar://**DATA/ext**/lkp/___email_tld.lkp
, the new one would bepathvar://**EXT_DYNAMIC**/lkp/___email_tld.lkp
. -
Update all other references to the lookup file in the Lookup step of your plans.
-
Prepare a workflow for retrieving lookup files and reloading the VFS. For example, use the
vfs_dynamic_reload_.ewf
workflow in the General MDM project - CDI example as a template. -
Copy the original lookup file from Git to MinIO.
The MinIO bucket must be a combination of values specified in
ataccama.one.mdm.lookup.storage.minio.bucket-in
andataccama.one.mdm.lookup.storage.remote-prefix
. Typically, this ismdm-in/ext/lkp
. -
Remove the original lookup file from Git and commit the changes.
-
Restart the MDM Server.
If everything is set up correctly, the MDM Server should start without errors about missing lookup files. To double-check, look for a confirmation in the log that the
FetchLookups
action finished successfully, for example:actionId=mdm-maintenance-task status=FINISH task-name=FetchLookup
. -
To back up the file to a local folder and MinIO and synchronize the file system, rerun the workflow from step 8 (
vfs_dynamic_reload.ewf
).
HUB_RD_LKP path variable
The HUB_RD_LKP
path variable is used in MDM load and cleansing plans for referencing internal lookup files.
By default, it points to a single folder that serves a dual purpose:
-
A source folder where lookup files are built.
-
An output folder for working with lookups.
Files stored in this folder are managed using the static approach.
If you want to manage lookup files dynamically, use the following two variables instead:
-
HUB_RD_LKP_BUILD
(alternative:EXT_DYNAMIC_HUB_RD_LKP_BUILD
) -
HUB_RD_LKP_USAGE
(alternative:EXT_DYNAMIC_HUB_RD_LKP_USAGE
)
The variables are defined as Referenced Data Dictionaries. Depending on which folder you want to adjust (for build or usage), provide a new path for the variable and update the relevant plans accordingly.
Troubleshooting
The following section covers common issues related to corrupted or missing lookup files and how to resolve them.
In general, you can use the following checklist to verify your setup:
-
You have created the relevant MinIO buckets and they are correctly referenced from relevant path variables and workflows.
-
Path variables for lookups are correctly configured.
-
You have prepared a workflow for reloading the VFS before starting the server.
-
You have configured plans for building new lookup files.
-
Your VFS is correctly configured, as described in Versioned File System configuration.
Corrupted or missing MinIO lookup files on server startup
The following issues might occur during MDM Server startup.
- Problem
-
A lookup file is missing and the MDM Server logs contain a similar error message:
[ERROR] File 'pathvar://DATA/ext/lkp/___email_tld.lkp' must exist.(Validate Email 3)
- Solution
-
Try following these steps:
-
Prepare a valid lookup file from a backup, another environment, or a local lookup file (empty or populated). You can also recreate the file manually in ONE Desktop.
If you are using static approach, check older Git commits, too. * If you are using Git: Commit the corrected file to the repository. * If you are using MinIO: Place the new lookup file inside the path defined by
MINIO_EXT_IN
path variable (typicallyminio-in/ext
) in MinIO storage. -
Restart the MDM Server and check the logs again.
When dynamic approach is used, such issues can occur if the file is not downloaded before starting the MDM Server (see Lookup initialization process). This is particularly important when adding a new plan with a Lookup step for reading lookup files which are stored on MinIO.
-
- Problem
-
A lookup file is corrupted and the MDM Server logs contain a similar error message:
[ERROR] Lookup file pathvar://DATA/ext/lkp/___email_tld.lkp has invalid format.(Validate Email 3)
- Solution
-
The following options are available:
-
Manually delete the problematic lookup file from the MDM versioned file system and replace it as described in the solution for a missing lookup file.
-
Run Synchronize Lookups in the MDM Web App Admin Center.
-
If it’s a recurrent problem, turn on overwriting existing files during the lookup initialization (see Lookup initialization process).
-
Corrupted lookup files during runtime
- Problem
-
The Reload VFS task fails with the
FINISHED_FAILURE
state or errors in the log.In this case, the MDM Server continues using the original (non-corrupt) lookup file until the server is restarted. If no action is taken, this might lead to the corrupted lookup file issue.
- Solution
-
We recommend fixing the issue as soon as possible using one of the following approaches:
-
If you have a backup version of the lookup file, restore it to the working folder (from
pathvar://EXT_DYNAMIC/lkp_backup
topathvar://EXT_DYNAMIC/lkp
).To easily restore a local backup, use the predefined workflow lookups_load_local_backup.ewf
in the General MDM project - CDI example as a template.If your backup is stored in MinIO, copy the file to the appropriate location in MinIO (as defined by the path variable
MINIO_EXT_IN
, by defaultminio-in/ext
) and run theminio_ext_to_mdm.ewf
workflow from the CDI example to retrieve the file. Then reload the VFS again.This approach also works in case you need to build a new lookup file from scratch using valid source data.
-
Build a new lookup using the related plan and reload the VFS again. While simpler in theory, this approach can be complicated in practice, as the same lookup file might be used in many different plans or you might need to run the MDM Server as soon as possible.
-
Was this page useful?