User Community Service Desk Downloads
If you can't find the product or version you're looking for, visit support.ataccama.com/downloads

Data Quality Scoring

Data quality scoring means evaluating data with attribute-specific rules and providing the result in terms of a numeric score with the purpose of separating good quality records from bad quality ones.

Scoring provides a single measure (indicator) for both evaluating the data quality of a particular attribute (or record) and encoding types of errors (cleansing operations) together with their severity.

Data quality scoring can be performed in almost all steps, including the specifically designed Simple Scoring and Scoring steps.

Best practices

Follow these recommendations to make the most out of data quality scoring:

  1. Categorize the cleansing score into four scoring (quantification) levels:

    Scoring level Scoring description Scoring result

    0

    No modification was done to the input value.

    VALID

    < 10 000

    A small modification was done to the input value (such as trim, squeeze spaces, uppercase, remove unsupported characters, safe replacements).

    VALID/CORRECTED

    < 10 000 000

    An important modification was done to the input value (for example, not found in the dictionary, does not meet official rules, major error, unsafe replacements).

    UNSAFE/UNKNOWN

    > 10 000 000

    Input value or precleansed value (after a small modification) is null.

    NOT VALID

    Each cleansing operation (like those in the Scoring Description column) has a corresponding score value and explanation code. For example, the following table contains rules, scores, and explanation codes for scoring a Social Security number in the < 10 000 scoring level:

    Rule Score Explanation code

    There is a no-digit-comment before and/or after SIN.

    1000

    SIN_COMMENT

    There is a digit in a comment before and/or after SIN.

    1000

    SIN_COMMENT_WITH_DIGIT

    There are spaces at unusual positions inside the SIN number.

    300

    SIN_SPACES_AT_UNUSUAL_POSITIONS

    Separators between the 3rd-4th and 6th-7th digit are not consistent.

    200

    SIN_DIFFERENT_SEPARATORS

    Input SIN is in the usual format with dashes or spaces as separators.

    100

    SIN_REMOVED_SEPARATORS

  2. Store scores and explanation codes in separate columns for each attribute: sco_[attribute], exp_[attribute]. The explanation field can aggregate multiple codes separated by a whitespace. To learn more about naming conventions, see column-naming-conventions.adoc.

    The following table illustrates scoring results for several Social Security numbers might look like the following ([attribute]=sin):

    src_sin out_sin exp_sin sco_sin Scoring result

    573896836

    573896836

    SIN_INVALID_CHECK

    10000

    NOT VALID

    417 208 303

    417208303

    SIN_REMOVED_SEPARATORS SIN_VALID

    100

    VALID/CORRECTED

    417-208303

    417208303

    SIN_DIFFERENT_SEPARATORS SIN_VALID

    200

    VALID/CORRECTED

    sin 417 - 208 -303

    417208303

    SIN_COMMENT SIN_DIFFERENT_SEPARATORS SIN_VALID

    1200

    VALID/CORRECTED

    4 172 083 0 3

    417208303

    SIN_SPACES_AT_UNUSUAL_POSITIONS SIN_VALID

    300

    VALID/CORRECTED

    aaabbbccc

    aaabbbccc

    SIN_NOT_PARSED

    1000000

    NOT VALID

  3. Specify scoring rules in such a way that within a given range, the sum of scores for their corresponding operations does not exceed the upper bound, no matter how many operations are performed.

    Scoring level Scoring result Explanation codes Scoring and description Sum of operation scores

    > 10 000 000

    NOT VALID

    SIN_EMPTY

    Scored 10000000. Input SIN is empty.

    10 000 000

    < 10 000 000

    UNSAFE/UNKNOWN

    SIN_NOT_PARSED

    Scored 1000000. Input string is not in the supported format.

    1 110 000

    SIN_INVALID_DIGIT_COUNT

    Scored 100000. Input contains only digits, but the digit count is not 9.

    SIN_INVALID_CHECK

    Scored 10000. Parsed SIN did not pass the validity test (Luhn algorithm).

    < 10 000

    VALID/CORRECTED

    SIN_COMMENT

    Scored 1000. There is a no-digit-comment before and/or after SIN.

    2 500

    SIN_COMMENT_WITH_DIGIT

    Scored 1000. There is a digit in the comment before and/or after SIN.

    SIN_SPACES_AT_UNUSUAL_POSITIONS

    Scored 300. There are spaces at unusual positions inside the SIN number.

    SIN_DIFFERENT_SEPARATORS

    Scored 200. Separators between the 3rd-4th and 6th-7th digit are not consistent.

    SIN_REMOVED_SEPARATORS

    Scored 100. Input SIN is in the usual format with dashes or spaces as separators.

    0

    VALID

    SIN_VALID

    Scored 0. Input SIN is valid.

    0

  4. Use a factor of 10 to gradate rules by importance, for example, 1 000 vs. 10 000 (like some of the rules in the previous table).

Attribute scoring vs. record scoring

Scoring rules are applied to individual attributes; however, it is also possible to aggregate scoring results on the record level, for example, for the purposes of matching and merging.

When it comes to record scoring, try to apply the following:

  1. Score individual attributes using Best practices.

  2. Optionally, divide attributes into two groups:

    1. Important attributes (mandatory attributes).

    2. Less important attributes.

      For example, we do not want to count score instance using gender and phone number quality score because of low quality level and minor importance for matching purposes.

      Attribute Value Importance

      out_first_name

      John

      IMPORTANT

      out_last_name

      Rasmusen

      IMPORTANT

      out_gender

      1

      NOT IMPORTANT

      out_birthdate_day

      16.11.1978

      IMPORTANT

      out_credit_card

      720412410

      IMPORTANT

      out_email

      j.rasmusen@email.com

      IMPORTANT

      out_phone_number

      12 012 004 548

      NOT IMPORTANT

      out_sin

      720412410

      IMPORTANT

  3. Use two columns for summed up record scores (or use sco_instance at least):

    1. sco_instance - Sum of all attribute scores.

    2. sco_instance_imp_att - Sum of all important attributes scores.

Was this page useful?