Lead your team forward
OCT 24 / 9AM ET Register nowAI Matching Overview
The AI Matching combines user input with AI capabilities to improve results of rule-based record matching, leading to quicker and more efficient elimination of duplicate records in the data catalog. User feedback is provided during model training and when resolving matching proposals.
The AI Matching is used for record matching.
The AI model compares record matchings that were done in ONE MDM with AI-based matchings.
These results are then used to help improve existing matching results in the form of MERGE
and SPLIT
proposals for records in ONE MDM.
In the current version, the feature is experimental. As a result, it might not scale appropriately to large numbers of records. |
For more information about how to configure the AI Matching feature, see AI Matching Configuration. |
How Ataccama AI Matching works
AI Matching feature supports data stewards in their decisions by providing automatically generated proposals for additional matches or splits on top of selected master data domains. These proposals provide various information about the potential changes in matched groups and are resolved as individual tasks available to assigned users or user groups.
On a high level, the process can be split into two phases: training and evaluation. The underlying machine learning (ML) model is trained using all previous user decisions (that is, manual merges and splits done by data stewards) but it can also be trained actively via a dedicated user interface where selected pairs of records are presented to the user who confirms if those pairs should be matched or not.
The training uses Chvatal’s greedy set-cover algorithm and regularized logistic regression with a disagreement-based active learning approach. |
Once the ML model gathers enough information (that is, a sufficient number of user decisions or pairs from active training), it can be evaluated to provide proposals for additional matches or splits. And as new information comes in, including resolutions of previously generated proposals, the training of the model continues to provide more precise suggestions based on the actual data and related actions of data stewards.
The evaluation also uses Chvatal’s greedy set-cover algorithm and regularized logistic regression together with hierarchical clustering with centroid linkage. |
Manager and Worker Microservices Overview
The Manager microservice has a WSGI server for reporting its liveness and readiness, a gRPC client for fetching data from ONE MDM, as well as a gRPC server for processing all commands.
The Manager can process multiple matching instances at once and each matching instance is identified by its matching identifier, which consists of two strings: entity name
and layer name
.
Similarly to other ONE microservices, the Manager reports that it is ready after all of its internal dependencies have started and are available for processing. In case the gRPC server sends a request while the microservice is still waiting on a dependency, an exception is reported using the gRPC channel.
The Worker microservice has a WSGI server for reporting its liveness and readiness, and a gRPC client for fetching data from ONE MDM. The Manager uses the Worker to initialize the AI model, compute the results, and generate proposals or extraction rules.
Sizing Guidelines
Due to the nature of implemented AI algorithms, the resource consumption of AI Matching depends on the volume and complexity of data that is processed. The more complicated the matching process is, the more resources are required.
The Worker microservice has the possibility of leveraging multiple threads or processes to improve the performance speed. These options are provided through several machine learning libraries, namely OpenMP and BLAS, that can each be individually configured. The number of threads can be set to one of the following values:
-
null
: No value is provided. In this case, machine learning libraries use their default settings. -
0
: All physical CPU cores are used, without hyper-threads. -
n
: Then
number of CPU cores is used. Hyper-threading is applied as needed. -
-n
: In this case, the number of CPU cores used is reduced by the value of-n
. For example, if your machine has 6 CPUs, setting this option to -2 means that 4 CPU cores are employed.
For some of the libraries used, it is possible to specify the preferred number of parallel calculations using the property ataccama.one.apyc.parallelism.jobs
.
In addition, the libraries use parallelism in low-level operations as well, which can be controlled through the property ataccama.one.apyc.parallelism.omp
.
In cases where more specific tuning is necessary, the following options are available for lower-level parallel processing:
-
ataccama.one.apyc.parallelism.blas
: Has a higher priority than OpenMP but can be ignored by some libraries depending on the compilation options of the OpenBLAS dependency. -
ataccama.one.apyc.parallelism.threads
: Has a higher priority than OpenMP and BLAS but comes with a higher overhead than these two since it uses the dynamic API.
We strongly caution against adjusting the values of The same applies to other parallelism properties. Ataccama’s performance tests clearly show an added benefit of using parallel processing, however, as it consumes more resources, the option is disabled by default. Therefore, enabling this option should only be done after carefully considering users' needs. |
Record Matching
As mentioned in the introduction, AI Matching is used for record matching. For record matching, the AI Matching creates a closed-box model that computes possible matchings for input data, which are then compared to the matchings done based on manually configured rules in ONE MDM. The disagreement between the two computation types is presented as follows:
-
MERGE
suggestions: The AI Matching would merge two records that are not merged in ONE MDM. -
SPLIT
suggestions: The AI Matching would not merge two records that are merged in ONE MDM.
Understanding the Matching Process
The matching process is done in stages that require user interaction in order to proceed to the following one.
On the other hand, the Manager and Worker microservices work in phases. Some phases, such as initialization, training, and generating proposals coincide with the stages and require user input. Others, such as those described in the Computing Results stage, are automatically triggered once the previous one finishes and are always completed together. |
-
Initializing the AI Model
In this stage, the data is prepared for further training and computation process. This includes selecting which attributes are taken into account when matching records. The names of these attributes are prefixed with
mat_
, although users can choose themselves which attributes are used.In the current version, all data types are treated as strings and share the same matching metrics, which could affect the model accuracy depending on the actual data type. The Worker then retrieves a sample of records from ONE MDM. By default, the sample size is set to 1,000,000 records, although it can be configured (see AI Matching Configuration). The sample is used to initialize the underlying active learner algorithm that preselects appropriate blocking rules (see Computing Results, section Blocking Records). If there are any previously evaluated training pairs, they are loaded into the model as well.
-
Training the AI Model
The underlying AI model is trained using previous user decisions and sets of labeled record pairs that are chosen by the user or suggested by the active learner algorithm. The algorithm selects pairs that would provide it with the most information. If any filters were previously used on master layers, only the records corresponding to the filter are proposed.
Each training pair is a pair of potentially matching records for which users need to decide whether the records match and should therefore be merged, or not, in which case they should remain separate. As a minimum, users are required to provide three positive and three negative responses during model training, although resolving a higher number of training pairs improves the accuracy of the model. This information is then used to train the AI Matching model.
The feature uses active learning to interactively request labeled data points, as opposed to passive learning, where the labeled data is received at once. Once users give their expert feedback on each pair, the model uses it not only to learn how to further finetune its matching suggestions, but it also selects which pair should be labeled next. The latter is done based on the amount of information the model expects to obtain from a pair; the more information a pair can produce, the higher is its labeling priority.
The training consists of the following steps:
-
Preselected blocking rules are applied to all records in the sample and training blocked groups are formed.
-
For each pair of records in each blocked group, the model calculates a numeric representation for each attribute provided in the initialization stage. This matrix of features for all record pairs is fed to the model for training.
-
The model outputs classification results for each pair of records within each blocked group.
-
The results provided by the matching model are compared to those obtained by applying blocking rules.
-
In cases where the matching model and the blocking rules disagree on whether the records match, user feedback is requested.
-
-
Model Quality Estimation
The goal of this stage is to try to evaluate the current model and predict how useful we expect the final results will be after running the full computation in the next steps.
Model quality estimation stage is triggered by:
-
Finish training in the AI training window.
-
Update in case of Outdated quality indicator value in the AI Matching overview.
-
As a part of model evaluation.
Model quality is estimated in following steps:
-
In case there are not at least 3 labeled pairs from each category (match and non-match) the quality is estimated as Insufficient.
-
Confusion matrix is determined using cross-validation on the model training set.
-
In case there are any false positives, model quality is estimated as Low.
-
In case there are no false positives, likelihood of the model sensitivity is estimated based on the confusion matrix observed in the cross-validation.
-
Lower bound of real sensitivity is estimated as 90% confidence interval of the sensitivity likelihood.
-
Estimated lower bound of real sensitivity is then translated to a Low, Medium, or High model quality:
-
Low: < 50%
-
Medium: 50-80%
-
High: > 80%
Once the evaluation is completed, the model quality is indicated in the user interface. The quality can be improved by evaluating more training pairs. The process can be repeated until the model is sufficiently accurate.
-
-
-
Computing Results
In this stage, the trained model is used to evaluate all data retrieved from ONE MDM. The computation is done asynchronously as it might take hours or even days depending on the number of records. In the meantime, the status of computation is indicated in the ONE MDM web application.
Once a computation has been scheduled or is in progress, it can be canceled, for example, if changes were made to the data in ONE MDM or to the training pairs used in the training process. The Worker microservice progresses through the following phases:
-
Model Quality Estimation
Model quality is re-estimated based on a set of training pairs valid at the time the model evaluation was scheduled. If the quality is insufficient, evaluation does not continue, and further model training is required.
-
Fetching Records
All available data is fetched from ONE MDM and subsequently used for blocking and matching.
-
Blocking Records
The data retrieved from ONE MDM is blocked based on a set of blocking rules. These rules might differ from the ones preselected for training as they need to cover the whole data set instead of a sample.
If records were compared each to one another, the process would quickly become impractical in terms of resource consumption and computation time. For this reason, records are first blocked into groups based on the similarity of their traits using blocking rules, which are analog to ONE MDM key rules, and the matching is then done only within these blocks. This means that two records that are not blocked together by at least one blocking rule are not mutually compared in the following steps and are therefore considered distinct.
The AI model uses its own set of blocking rules that might not correspond to the ones stored and used in ONE MDM.
Overview of Blocking Rules
As explained previously, blocking rules help determine which records are selected for further matching. Two records are considered to belong to the same block if at least one blocking rule groups them together. Each blocking rule consists of one or two predicates connected by AND, meaning that both conditions need to be fulfilled in order for the blocking rule to apply. A predicate resolves to
true
if the sets of values that it computed for two records share at least one item.As a preparation step before blocking records, records are stripped of punctuation and all whitespace characters are replaced by a single space character, which helps identify words. The following table displays some of the commonly used predicates:
Predicate Output wholeFieldPredicate
The whole record field
alphaNumericPredicate
A set of consecutive strings of alphanumeric characters
tokenFieldPredicate
A set of all words in the field
firstTokenPredicate
The first word in the field
commonIntegerPredicate
A set of consecutive integers
firstIntegerPredicate
The first integer in the field
nearIntegersPredicate
A set of integers augmented by N-1 and N+1
hundredIntegersOddPredicate
A set of integers rounded to the nearest hundred, that is, last two digits
hundredIntegerPredicate
A set of integers rounded to the nearest hundred, that is, last two digits
commonTwoTokens
A set of pairs of subsequent words
commonThreeTokens
A set of triplets of subsequent words
fingerprint
A string of all characters without whitespaces, sorted in alphabetical order
oneGramFingerprint
A set of all characters without whitespaces, sorted in alphabetical order
twoGramFingerprint
A set of all bigrams, that is, subsequent pairs of characters, without whitespaces, sorted in alphabetical order
commonFourGram
A set of all 4-grams, that is, substrings of 4 characters, without whitespaces
commonSixGram
A set of all 6-grams, that is, substrings of 6 characters, without whitespaces
sameThreeCharStartPredicate
The first three characters without whitespaces
sameFiveCharStartPredicate
The first five characters without whitespaces
sameSevenCharStartPredicate
The first seven characters without whitespaces
suffixArray
A set of suffixes of N-4 characters or shorter, where N is the length of the field
sortedAcronym
A string containing the sorted first letters of each word
doubleMetaphone
A string encoding the field based on its pronunciation
metaphoneToken
A set of pronunciation tokens for each word in the field
-
Scoring Pairs
The model is applied to pairs of records in each blocked group. The goal is to calculate matching probabilities for each record pair, which are then used to represent the level of confidence for each pair. This probability of proposal being a MERGE or a SPLIT provided by the classification model is called confidence score.
-
Clustering Records
Using the matching probabilities of record pairs from the previous phase, similar records are clustered so that they form groups of records that are considered to be duplicates of the same record. This information forms the basis for the suggestions in the following stage.
-
-
Generating Proposals
To generate suggestions, the clusters with the highest confidence score from the Worker are compared with the groups from ONE MDM. The MDM groups are selected based on manually created rules. When there is a disagreement between an AI Matching cluster and an MDM group, one of the following proposal types is created:
-
MERGE
proposals: If a pair of records is matched in the Worker and not matched in ONE MDM, the pair of records should be merged into one record. -
SPLIT
proposals: If a pair that is matched in ONE MDM is not matched in the Worker, the two records should remain separate.The suggestions made by the AI Matching can then be reviewed by users and either accepted and applied, thus overriding the current state of the record pair in ONE MDM, or rejected, which means that the current state of the record pair is preserved. The correct information can be fed back into the AI model so that it learns from it and adapts its predictions to the actual ground truth.
-
Security and permissions
Public access to the AI Matching component must not be allowed. MDM data permission is not applied on the AI Matching, therefore, each authenticated user can see all AI Matching data. |
Users that train AI or review training must have full view permissions on the training entity both for columns and rows. The absence of column permissions can cause inadequate matching decisions, and the absence of row permissions can result in error. |
Telemetry and Monitoring
It is possible to download the AI Matching metadata to share it with Ataccama for further analysis. For more details, see MDM AI Matching Telemetry and Monitoring.
Was this page useful?