Melissa Integration
Melissa is an external address standardization and verification service provider. The vendor provides online API (and batch) access to the system with a pay per request business model. Integration with Melissa could be utilized for Data Verification and Data Enrichment. A detailed description of the system interface is available here.
MDM supports various options on how to utilize the standardized address values from Melissa. Below we will describe two approaches, Asynchronous and Synchronous.
Asynchronous
The first approach leverages the Ataccama proprietary binary Lookup format. There are two lookups - one that stores all the original address attributes (in the form of a hash) and a reference to the Melissa record PK, and another that stores the Melissa PK and the cleansed, enriched values. The lookups can be used during data processing in MDM (online, batch, stream) or even to build custom address suggestions or a Data Quality Firewall.
The Address Lookup Cache needs to be asynchronously rebuilt in order to update the missed records (new addresses that were not found in the LKP Cache). The rebuilding process can either synchronously call the Melissa API (via REST API) or use a batch interface (potentially via CSV file) to get the missed records and update the LKP Cache.
Rebuilding should ideally be carried out on a daily basis. |
To apply the cleansed values on the missed records a partial reprocess operation is needed.
Below is a sample schema showing the standard operations and the rebuilding process.
Advantages
-
The MDM system is decoupled from the Melissa system; loose dependency.
-
Address Lookup Cache can be further utilized to expose a custom service (i.g. a client specific database of “known” addresses).
-
Good performance (throughput) of the address cleansing solution.
-
Same address can be reused without the need to call Melissa (cost saving, data reusability).
-
The asynchronous nature of this pattern allows to shift the communication to Melissa to off-peak hours synchronization (e.g. during the night).
Disadvantages
-
The MDM system needs to reprocess “missed” records.
-
Missed records represent temporary unverified data with a potentially low quality, and these records are available for consumption by downstream systems and used in their processes.
-
Higher design and orchestration complexity compared to alternative integration options.
Detailed integration design
Click here to expand…
Example design description based on Asynchronous approach. The solution uses several workflows, components, files and database tables that roughly map to the activity diagram of the proposed approach.
See the individual components and how they relate to the steps in the activity diagram.
=== Workflows
==== Rebuild all reference data
This is a wrapper workflow allowing flexibility for potential future orchestration of other MDM reference data.
The logic here will trigger refreshing of a given logical group or set of lookups defined in the solution, e.g. hub_reference_data.comp
can be considered as one group of lookups (core validation and translation lookups in MDM) and the Melissa lookup cache as another.
The workflow directly triggers (synchronously) the Rebuild Melissa Lookup Cache.
==== Rebuild Melissa lookup cache
This is the main workflow responsible for orchestration of the background rebuild process of the Melissa Lookup Cache for the missed records. It allows only 1 running instance at the same time to avoid conflicts in data processing. The workflow always clears the table with missed records before it calls the delta export in order to have a clean (and smaller) data set.
At the end it directly triggers the Reprocess Melissa Missed Records workflow (asynchronously).
Relates to Step: Reprocess Missed Records
=== Components
==== Melissa lookup address delta
The component reads all changed instance Address records since the last export (using delta export) operation. It calculates an address hash column that represents an identifier for a given address including its state of completeness. The hash algorithm, e.g. SHA512, should ideally ignore some minor, non-relevant data characteristics like case-sensitivity, repeated spaces, accented letters, etc. to avoid repeated calls for a semantically matching address.
coding.HMacSHA512("hash",
squeezeSpaces(
upper(
removeAccents(
src_address_line_1 + '☃' +
src_city + '☃' +
src_zip
)
)
)
)
Relates to Step: Select Missed Records
==== Melissa lookup updater
The component reads prepared missed records from the MCACHE_MISSED
table and builds a JSON request using the SRC columns and the API mapping.
Then it calls the Melissa address validation API.
For identified addresses parses, the Melissa Address Key (MAK, see the Address Object Reference Guide) is used to check for new vs. existing records against the melissa_clean_address.lkp
file.
The component finally updates the MCACHE_SOURCE
records by their address hash and populates the MAK attribute for further processing.
Also, it populates the MCACHE_CLEAN
table with records that are recognized as new additions to the cache.
Relates to Step: Call API for each Missed Record, Process Response & Prepare Data
==== Build Melissa lookup
The component reads lookup source data from the MCACHE_SOURCE and MCACHE_CLEAN and fully rebuilds the target lookup files – melissa_clean_address.lkp
and melissa_source_address.lkp
.
The lookups need to be rebuilt in a different location (using Path Variable redirection) than the monitored folder under the Version File System.
It is up to the Rebuild Melissa Lookup Cache workflow to properly move the complete lookups to the correct location and refresh the VFS.
Relates to Step: Rebuild Lookup Cache File
=== Tables
The Melissa lookup cache uses the following tables as the data store for the lookup building process.
MCACHE_MISSED
The structure of the table mostly corresponds to the instance Address entity in the MDM hub model - all the SRC columns should be there, additionally, it contains an address hash column, and a column for the Melissa Address Key identifier. The respective data types are subject to implementation detail with specific MDM model and Melissa Reference Documentation.
The table is periodically truncated at the beginning of each rebuild process to work with current data.
MCACHE_SOURCE
The table holds the address hash column and a corresponding reference to the MAK column. The table is never truncated, and contains unique hash values. Optionally the table can include additional audit columns such as created date, or list of Melissa cleansing result codes.
MCACHE_CLEAN
The table holds the validated, clean address as received from the Melissa API call. It contains all the cleansed data attributes, and the MAK identifier. The table is never truncated and contains unique records based on MAK identifier.
=== Files
The following Ataccama binary lookup files are produced for fast check against the cache.
Melissa Clean Address Lookup (melissa_clean_address.lkp
)
The lookup structure corresponds to the Melissa Address Object, excluding the cleansing result codes. It contains clean, validated addresses, and it’s stored in the MDM lookup folder (typically the folder path is Files/data/ext/lkp). All references to the lookup file should be through path variables, either EXT or LKP.
Melissa Source Address Lookup (melissa_source_address.lkp
)
The lookup structure is very simple as it can only contain the two identifiers - the address hash and the MAK. It is also stored in the MDM lookup folder (typically the folder path is Files/data/ext/lkp). All references to the lookup file should be through path variables, either EXT or LKP.
Synchronous
In this scenario we present an approach that uses direct API calls to the Melissa system without any local record caching. The MDM system calls Melissa’s API whenever a record is being cleansed. The response is immediately processed, and attributes are mapped to the record on the fly.
Was this page useful?