User Community Service Desk Downloads
If you can't find the product or version you're looking for, visit support.ataccama.com/downloads

AI Matching Overview

The AI Matching combines user input with AI capabilities to improve results of rule-based record matching, leading to quicker and more efficient elimination of duplicate records in the data catalog. User feedback is provided during model training and when resolving matching proposals.

The AI Matching covers the following use cases:

  • Record Matching. The AI model compares record matchings that were done in ONE MDM with AI-based matchings. These results are then used to help improve existing matching results in the form of MERGE and SPLIT proposals for records in ONE MDM.

  • Rule Suggestions. The AI model is capable of suggesting a set of rules that could be used for matching in ONE MDM. This is typically applied in cases when there are no manually created rules in ONE MDM.

In the current version, the feature is experimental. As a result, it might not scale appropriately to large numbers of records.
For more information about how to configure the AI Matching feature, see AI Matching Configuration.

Manager and Worker Microservices Overview

The Manager microservice has a WSGI server for reporting its liveness and readiness, a gRPC client for fetching data from ONE MDM, as well as a gRPC server for processing all commands. The Manager can process multiple matching instances at once and each matching instance is identified by its matching identifier, which consists of two strings: entity name and layer name.

Similarly to other ONE microservices, the Manager reports that it is ready after all of its internal dependencies have started and are available for processing. In case the gRPC server sends a request while the microservice is still waiting on a dependency, an exception is reported using the gRPC channel.

The Worker microservice has a WSGI server for reporting its liveness and readiness, and a gRPC client for fetching data from ONE MDM. The Manager uses the Worker to initialize the AI model, compute the results, and generate proposals or extraction rules.

Sizing Guidelines

Due to the nature of implemented AI algorithms, the resource consumption of AI Matching depends on the volume and complexity of data that is processed. The more complicated the matching process is, the more resources are required.

The Worker microservice has the possibility of leveraging multiple threads or processes to improve the performance speed. These options are provided through several machine learning libraries, namely OpenMP and BLAS, that can each be individually configured. The number of threads can be set to one of the following values:

  • null: No value is provided. In this case, machine learning libraries use their default settings.

  • 0: All physical CPU cores are used, without hyper-threads.

  • n: The n number of CPU cores is used. Hyper-threading is applied as needed.

  • -n: In this case, the number of CPU cores used is reduced by the value of -n. For example, if your machine has 6 CPUs, setting this option to -2 means that 4 CPU cores are employed.

For some of the libraries used, it is possible to specify the preferred number of parallel calculations using the property ataccama.one.apyc.parallelism.jobs. In addition, the libraries use parallelism in low-level operations as well, which can be controlled through the property ataccama.one.apyc.parallelism.omp. In cases where more specific tuning is necessary, the following options are available for lower-level parallel processing:

  • ataccama.one.apyc.parallelism.blas: Has a higher priority than OpenMP but can be ignored by some libraries depending on the compilation options of the OpenBLAS dependency.

  • ataccama.one.apyc.parallelism.threads: Has a higher priority than OpenMP and BLAS but comes with a higher overhead than these two since it uses the dynamic API.

We strongly caution against adjusting the values of ataccama.one.apyc.parallelism.blas and ataccama.one.apyc.parallelism.threads properties without prior consultation with Ataccama as their use case is specific and changes might lead to unexpected results.

The same applies to other parallelism properties. Ataccama’s performance tests clearly show an added benefit of using parallel processing, however, as it consumes more resources, the option is disabled by default. Therefore, enabling this option should only be done after carefully considering users' needs.

Comparing AI Matching use cases

As mentioned in the introduction, AI Matching is used for record matching. For record matching, the AI Matching creates a closed-box model that computes possible matchings for input data, which are then compared to the matchings done based on manually configured rules in ONE MDM. The disagreement between the two computation types is presented as follows:

  • MERGE suggestions: The AI Matching would merge two records that are not merged in ONE MDM.

  • SPLIT suggestions: The AI Matching would not merge two records that are merged in ONE MDM.

This model can be more accurate compared to rule suggestions, although it is difficult to interpret as the internal decision process is complex and fuzzy.

In order to provide rule suggestions, the AI Matching uses the model applied in record matching and attempts to approximate it through matching rules that are more interpretable than the model. However, this comes at a potential cost of lower quality. The process includes the following:

  1. A closed-box model is trained on the labeled user data.

  2. The model is used to evaluate all the pairs. These first two steps are applied in record matching as well.

  3. The high-confidence pairs are identified, both for matching and distinct pairs. The confidence thresholds are defined using the parameters min_match_confidence and min_distinct_confidence. Extracted pairs are called positive (matching) and negative (distinct) pairs.

  4. A sequence of matching rules is extracted such that they fulfill the following conditions:

    1. Rules do not match any negative pairs.

    2. Rules match as many positive pairs as possible from those positive pairs that were not matched by the rules suggested earlier in the sequence.

  5. The extracted matching rules, along with the used blocking rules (that is, those proposed by the closed-box portion of the algorithm), are returned to the user as rule suggestions.

    It is likely that not all MERGE/SPLIT proposals can be explained using rules appropriately. Some of the difficult to explain proposals may not have associated rules, thus reducing the overall total coverage of the rules. This is indicated as part of the output from the model (as to what % of positive pairs the total rules were able to cover).

Understanding the Matching Process

The matching process is done in stages that require user interaction in order to proceed to the following one.

On the other hand, the Manager and Worker microservices work in phases. Some phases, such as initialization, training, and generating proposals coincide with the stages and require user input. Others, such as those described in the Computing Results stage, are automatically triggered once the previous one finishes and are always completed together.
  1. Initializing the AI Model

    In this stage, the data is prepared for further training and computation process. This includes selecting which attributes are taken into account when matching records. The names of these attributes are prefixed with mat_, although users can choose themselves which attributes are used.

    In the current version, all data types are treated as strings and share the same matching metrics, which could affect the model accuracy depending on the actual data type.

    The Worker then retrieves a sample of records from ONE MDM. By default, the sample size is set to 1,000,000 records, although it can be configured (see AI Matching Configuration). The sample is used to initialize the underlying active learner algorithm that preselects appropriate blocking rules (see Computing Results, section Blocking Records). If there are any previously evaluated training pairs, they are loaded into the model as well.

  2. Training the AI Model

    The underlying AI model is trained using previous user decisions and sets of labeled record pairs that are chosen by the user or suggested by the active learner algorithm. The algorithm selects pairs that would provide it with the most information. If any filters were previously used on master layers, only the records corresponding to the filter are proposed.

    Each training pair is a pair of potentially matching records for which users need to decide whether the records match and should therefore be merged, or not, in which case they should remain separate. As a minimum, users are required to provide three positive and three negative responses during model training, although resolving a higher number of training pairs improves the accuracy of the model. This information is then used to train the AI Matching model.

    The feature uses active learning to interactively request labeled data points, as opposed to passive learning, where the labeled data is received at once. Once users give their expert feedback on each pair, the model uses it not only to learn how to further finetune its matching suggestions, but it also selects which pair should be labeled next. The latter is done based on the amount of information the model expects to obtain from a pair; the more information a pair can produce, the higher is its labeling priority.

    The training consists of the following steps:

    1. Preselected blocking rules are applied to all records in the sample and training blocked groups are formed.

    2. For each pair of records in each blocked group, the model calculates a numeric representation for each attribute provided in the initialization stage. This matrix of features for all record pairs is fed to the model for training.

    3. The model outputs classification results for each pair of records within each blocked group.

    4. The results provided by the matching model are compared to those obtained by applying blocking rules.

    5. In cases where the matching model and the blocking rules disagree on whether the records match, user feedback is requested.

    6. Once the training has been completed, the model quality is indicated in the user interface. If the quality is too low, users can add and evaluate more training pairs. The process can be repeated until the model is sufficiently accurate.

  3. Computing Results

    In this stage, the trained model is used to evaluate all data retrieved from ONE MDM. The computation is done asynchronously as it might take hours or even days depending on the number of records. In the meantime, the status of computation is indicated in the ONE MDM web application.

    Once a computation has been scheduled or is in progress, it can be canceled, for example, if changes were made to the data in ONE MDM or to the training pairs used in the training process.

    The Worker microservice progresses through the following phases:

    1. Fetching Records

      All available data is fetched from ONE MDM and subsequently used for blocking and matching.

    2. Blocking Records

      The data retrieved from ONE MDM is blocked based on a set of blocking rules. These rules might differ from the ones preselected for training as they need to cover the whole data set instead of a sample.

      If records were compared each to one another, the process would quickly become impractical in terms of resource consumption and computation time. For this reason, records are first blocked into groups based on the similarity of their traits using blocking rules, which are analog to ONE MDM key rules, and the matching is then done only within these blocks. This means that two records that are not blocked together by at least one blocking rule are not mutually compared in the following steps and are therefore considered distinct.

      The AI model uses its own set of blocking rules that might not correspond to the ones stored and used in ONE MDM.

      Overview of Blocking Rules

      As explained previously, blocking rules help determine which records are selected for further matching. Two records are considered to belong to the same block if at least one blocking rule groups them together. Each blocking rule consists of one or two predicates connected by AND, meaning that both conditions need to be fulfilled in order for the blocking rule to apply. A predicate resolves to true if the sets of values that it computed for two records share at least one item.

      As a preparation step before blocking records, records are stripped of punctuation and all whitespace characters are replaced by a single space character, which helps identify words. The following table displays some of the commonly used predicates:

      Predicate Output

      wholeFieldPredicate

      The whole record field

      alphaNumericPredicate

      A set of consecutive strings of alphanumeric characters

      tokenFieldPredicate

      A set of all words in the field

      firstTokenPredicate

      The first word in the field

      commonIntegerPredicate

      A set of consecutive integers

      firstIntegerPredicate

      The first integer in the field

      nearIntegersPredicate

      A set of integers augmented by N-1 and N+1

      hundredIntegersOddPredicate

      A set of integers rounded to the nearest hundred, that is, last two digits

      hundredIntegerPredicate

      A set of integers rounded to the nearest hundred, that is, last two digits

      commonTwoTokens

      A set of pairs of subsequent words

      commonThreeTokens

      A set of triplets of subsequent words

      fingerprint

      A string of all characters without whitespaces, sorted in alphabetical order

      oneGramFingerprint

      A set of all characters without whitespaces, sorted in alphabetical order

      twoGramFingerprint

      A set of all bigrams, that is, subsequent pairs of characters, without whitespaces, sorted in alphabetical order

      commonFourGram

      A set of all 4-grams, that is, substrings of 4 characters, without whitespaces

      commonSixGram

      A set of all 6-grams, that is, substrings of 6 characters, without whitespaces

      sameThreeCharStartPredicate

      The first three characters without whitespaces

      sameFiveCharStartPredicate

      The first five characters without whitespaces

      sameSevenCharStartPredicate

      The first seven characters without whitespaces

      suffixArray

      A set of suffixes of N-4 characters or shorter, where N is the length of the field

      sortedAcronym

      A string containing the sorted first letters of each word

      doubleMetaphone

      A string encoding the field based on its pronunciation

      metaphoneToken

      A set of pronunciation tokens for each word in the field

    3. Scoring Pairs

      The model is applied to pairs of records in each blocked group. The goal is to calculate matching probabilities for each record pair, which are then used to represent the level of confidence for each pair. This probability of proposal being a MERGE or a SPLIT provided by the classification model is called confidence score.

    4. Clustering Records

      Using the matching probabilities of record pairs from the previous phase, similar records are clustered so that they form groups of records that are considered to be duplicates of the same record. This information forms the basis for the suggestions in the following stage.

  4. Generating Proposals

    To generate suggestions, the clusters with the highest confidence score from the Worker are compared with the groups from ONE MDM. The MDM groups are selected based on manually created rules. When there is a disagreement between an AI Matching cluster and an MDM group, one of the following proposal types is created:

    • MERGE proposals: If a pair of records is matched in the Worker and not matched in ONE MDM, the pair of records should be merged into one record.

    • SPLIT proposals: If a pair that is matched in ONE MDM is not matched in the Worker, the two records should remain separate.

      The suggestions made by the AI Matching can then be reviewed by users and either accepted and applied, thus overriding the current state of the record pair in ONE MDM, or rejected, which means that the current state of the record pair is preserved. The correct information can be fed back into the AI model so that it learns from it and adapts its predictions to the actual ground truth.
  5. Extracting Rules In this case, the model applies a greedy algorithm to the likely matches (positive pairs) obtained during clustering in the computation stage, with the aim of finding a set of rules that best capture most of the matching pairs while excluding all distinct pairs above a certain threshold of confidence. The suggested rules are applied in the order in which they are proposed, based on the assumption that subsequent rules need to cover only the pairs not accounted for by the previously used rules. The sequence of rules with the best overall coverage is then suggested to the user.

Matching rules are applied on the candidate pairs that have been preselected by at least one blocking rule.

Overview of Matching Rules

The following matching rules are used:

Matching Rule Description

AlwaysMatch

Matches every pair regardless of the column or the field. This means that the blocking rules successfully matched all pairs on their own and it is not necessary to use any matching rules or that they were too specific and did not leave any distinct pairs for the rules extraction algorithm.

Equality(column1, match_empty)

Matches all identical records in column1. If match_empty is set to true, it also matches pairs of records where at least one record is empty.

Distance(column1, match_empty, threshold)

Matches all records whose distance does not exceed a predefined threshold. If match_empty is set to true, it also matches pairs of records where at least one record is empty.

Overview of distance functions

These matching rules use distance as a measure of similarity between two strings, which is obtained using distance functions that compare two strings and calculate the distance between them. The lower this value is, the more similar the strings are.

  • Damerau-Levenshtein distance. Expressed as a number between zero and infinity. The value reflects the minimum number of operations needed to transform one string into the other. These operations include insertion, deletion, substitution of a single character, transposition (swapping) of two adjacent characters. The following table contains some examples of the Damerau-Levenshtein metric applied to pairs of strings.

String 1 String 2 Affine gap distance Explanation

""

""

0

When comparing two empty strings, the distance is 0.

1234

1234

0

When comparing identical strings of length n, the distance is 0.

AB__CD

AB_1_CD

1

Since one insertion or deletion is needed, the distance is 1.

AB_12_CD

AB__CD

2

Since two insertions or deletions are needed, the distance is 2.

AB_1_CD

AB_X_CD

1

Since one substitution is needed, the distance is 1.

AB_12_CD

AB_21_CD

1

Since one transposition is needed, the distance is 1.

ABCDEFG

BADEFGHH

4

Since four operations (1 transposition, 1 deletion, 2 insertions) are needed, the distance is 4.

Composition(rule1, rule2, …​ ruleN)

Matches all records that are matched by the listed rules, that is, all records that are matched by rule1 AND rule2 AND …​ ruleN.

Was this page useful?