Data Quality Scoring
Data quality scoring means evaluating data with attribute-specific rules and providing the result in terms of a numeric score with the purpose of separating good quality records from bad quality ones.
Scoring provides a single measure (indicator) for both evaluating the data quality of a particular attribute (or record) and encoding types of errors (cleansing operations) together with their severity.
Data quality scoring can be performed in almost all steps, including the specifically designed Simple Scoring and Scoring steps.
Best practices
Follow these recommendations to make the most out of data quality scoring:
-
Categorize the cleansing score into four scoring (quantification) levels:
Scoring level Scoring description Scoring result 0
No modification was done to the input value.
VALID
< 10 000
A small modification was done to the input value (such as trim, squeeze spaces, uppercase, remove unsupported characters, safe replacements).
VALID/CORRECTED
< 10 000 000
An important modification was done to the input value (for example, not found in the dictionary, does not meet official rules, major error, unsafe replacements).
UNSAFE/UNKNOWN
> 10 000 000
Input value or precleansed value (after a small modification) is
null
.NOT VALID
Each cleansing operation (like those in the Scoring Description column) has a corresponding score value and explanation code. For example, the following table contains rules, scores, and explanation codes for scoring a Social Security number in the < 10 000 scoring level:
Rule Score Explanation code There is a no-digit-comment before and/or after SIN.
1000
SIN_COMMENT
There is a digit in a comment before and/or after SIN.
1000
SIN_COMMENT_WITH_DIGIT
There are spaces at unusual positions inside the SIN number.
300
SIN_SPACES_AT_UNUSUAL_POSITIONS
Separators between the 3rd-4th and 6th-7th digit are not consistent.
200
SIN_DIFFERENT_SEPARATORS
Input SIN is in the usual format with dashes or spaces as separators.
100
SIN_REMOVED_SEPARATORS
-
Store scores and explanation codes in separate columns for each attribute: sco_[attribute], exp_[attribute]. The explanation field can aggregate multiple codes separated by a whitespace. To learn more about naming conventions, see column-naming-conventions.adoc.
The following table illustrates scoring results for several Social Security numbers might look like the following ([attribute]=sin):
src_sin out_sin exp_sin sco_sin Scoring result 573896836
573896836
SIN_INVALID_CHECK
10000
NOT VALID
417 208 303
417208303
SIN_REMOVED_SEPARATORS SIN_VALID
100
VALID/CORRECTED
417-208303
417208303
SIN_DIFFERENT_SEPARATORS SIN_VALID
200
VALID/CORRECTED
sin 417 - 208 -303
417208303
SIN_COMMENT SIN_DIFFERENT_SEPARATORS SIN_VALID
1200
VALID/CORRECTED
4 172 083 0 3
417208303
SIN_SPACES_AT_UNUSUAL_POSITIONS SIN_VALID
300
VALID/CORRECTED
aaabbbccc
aaabbbccc
SIN_NOT_PARSED
1000000
NOT VALID
-
Specify scoring rules in such a way that within a given range, the sum of scores for their corresponding operations does not exceed the upper bound, no matter how many operations are performed.
Scoring level Scoring result Explanation codes Scoring and description Sum of operation scores > 10 000 000
NOT VALID
SIN_EMPTY
Scored 10000000. Input SIN is empty.
10 000 000
< 10 000 000
UNSAFE/UNKNOWN
SIN_NOT_PARSED
Scored 1000000. Input string is not in the supported format.
1 110 000
SIN_INVALID_DIGIT_COUNT
Scored 100000. Input contains only digits, but the digit count is not 9.
SIN_INVALID_CHECK
Scored 10000. Parsed SIN did not pass the validity test (Luhn algorithm).
< 10 000
VALID/CORRECTED
SIN_COMMENT
Scored 1000. There is a no-digit-comment before and/or after SIN.
2 500
SIN_COMMENT_WITH_DIGIT
Scored 1000. There is a digit in the comment before and/or after SIN.
SIN_SPACES_AT_UNUSUAL_POSITIONS
Scored 300. There are spaces at unusual positions inside the SIN number.
SIN_DIFFERENT_SEPARATORS
Scored 200. Separators between the 3rd-4th and 6th-7th digit are not consistent.
SIN_REMOVED_SEPARATORS
Scored 100. Input SIN is in the usual format with dashes or spaces as separators.
0
VALID
SIN_VALID
Scored 0. Input SIN is valid.
0
-
Use a factor of 10 to gradate rules by importance, for example, 1 000 vs. 10 000 (like some of the rules in the previous table).
Attribute scoring vs. record scoring
Scoring rules are applied to individual attributes; however, it is also possible to aggregate scoring results on the record level, for example, for the purposes of matching and merging.
When it comes to record scoring, try to apply the following:
-
Score individual attributes using Best practices.
-
Optionally, divide attributes into two groups:
-
Important attributes (mandatory attributes).
-
Less important attributes.
For example, we do not want to count score instance using gender and phone number quality score because of low quality level and minor importance for matching purposes.
Attribute Value Importance out_first_name
John
IMPORTANT
out_last_name
Rasmusen
IMPORTANT
out_gender
1
NOT IMPORTANT
out_birthdate_day
16.11.1978
IMPORTANT
out_credit_card
720412410
IMPORTANT
out_email
j.rasmusen@email.com
IMPORTANT
out_phone_number
12 012 004 548
NOT IMPORTANT
out_sin
720412410
IMPORTANT
-
-
Use two columns for summed up record scores (or use sco_instance at least):
-
sco_instance - Sum of all attribute scores.
-
sco_instance_imp_att - Sum of all important attributes scores.
-
Was this page useful?