Data Quality Scoring
Data quality scoring means evaluating data with attribute-specific rules and providing the result in terms of a numeric score with the purpose of separating good quality records from bad quality ones.
Scoring provides a single measure (indicator) for both evaluating the data quality of a particular attribute (or record) and encoding types of errors (cleansing operations) together with their severity.
Data quality scoring can be performed in almost all steps, including the specifically designed Simple Scoring and Scoring steps.
Best practices
Follow these recommendations to make the most out of data quality scoring:
- 
Categorize the cleansing score into four scoring (quantification) levels: Scoring level Scoring description Scoring result 0 No modification was done to the input value. VALID < 10 000 A small modification was done to the input value (such as trim, squeeze spaces, uppercase, remove unsupported characters, safe replacements). VALID/CORRECTED < 10 000 000 An important modification was done to the input value (for example, not found in the dictionary, does not meet official rules, major error, unsafe replacements). UNSAFE/UNKNOWN > 10 000 000 Input value or precleansed value (after a small modification) is null.NOT VALID Each cleansing operation (like those in the Scoring Description column) has a corresponding score value and explanation code. For example, the following table contains rules, scores, and explanation codes for scoring a Social Security number in the < 10 000 scoring level: Rule Score Explanation code There is a no-digit-comment before and/or after SIN. 1000 SIN_COMMENT There is a digit in a comment before and/or after SIN. 1000 SIN_COMMENT_WITH_DIGIT There are spaces at unusual positions inside the SIN number. 300 SIN_SPACES_AT_UNUSUAL_POSITIONS Separators between the 3rd-4th and 6th-7th digit are not consistent. 200 SIN_DIFFERENT_SEPARATORS Input SIN is in the usual format with dashes or spaces as separators. 100 SIN_REMOVED_SEPARATORS 
- 
Store scores and explanation codes in separate columns for each attribute: sco_[attribute], exp_[attribute]. The explanation field can aggregate multiple codes separated by a whitespace. To learn more about naming conventions, see column-naming-conventions.adoc. The following table illustrates scoring results for several Social Security numbers might look like the following ([attribute]=sin): src_sin out_sin exp_sin sco_sin Scoring result 573896836 573896836 SIN_INVALID_CHECK 10000 NOT VALID 417 208 303 417208303 SIN_REMOVED_SEPARATORS SIN_VALID 100 VALID/CORRECTED 417-208303 417208303 SIN_DIFFERENT_SEPARATORS SIN_VALID 200 VALID/CORRECTED sin 417 - 208 -303 417208303 SIN_COMMENT SIN_DIFFERENT_SEPARATORS SIN_VALID 1200 VALID/CORRECTED 4 172 083 0 3 417208303 SIN_SPACES_AT_UNUSUAL_POSITIONS SIN_VALID 300 VALID/CORRECTED aaabbbccc aaabbbccc SIN_NOT_PARSED 1000000 NOT VALID 
- 
Specify scoring rules in such a way that within a given range, the sum of scores for their corresponding operations does not exceed the upper bound, no matter how many operations are performed. Scoring level Scoring result Explanation codes Scoring and description Sum of operation scores > 10 000 000 NOT VALID SIN_EMPTY Scored 10000000. Input SIN is empty. 10 000 000 < 10 000 000 UNSAFE/UNKNOWN SIN_NOT_PARSED Scored 1000000. Input string is not in the supported format. 1 110 000 SIN_INVALID_DIGIT_COUNT Scored 100000. Input contains only digits, but the digit count is not 9. SIN_INVALID_CHECK Scored 10000. Parsed SIN did not pass the validity test (Luhn algorithm). < 10 000 VALID/CORRECTED SIN_COMMENT Scored 1000. There is a no-digit-comment before and/or after SIN. 2 500 SIN_COMMENT_WITH_DIGIT Scored 1000. There is a digit in the comment before and/or after SIN. SIN_SPACES_AT_UNUSUAL_POSITIONS Scored 300. There are spaces at unusual positions inside the SIN number. SIN_DIFFERENT_SEPARATORS Scored 200. Separators between the 3rd-4th and 6th-7th digit are not consistent. SIN_REMOVED_SEPARATORS Scored 100. Input SIN is in the usual format with dashes or spaces as separators. 0 VALID SIN_VALID Scored 0. Input SIN is valid. 0 
- 
Use a factor of 10 to gradate rules by importance, for example, 1 000 vs. 10 000 (like some of the rules in the previous table). 
Attribute scoring vs. record scoring
Scoring rules are applied to individual attributes; however, it is also possible to aggregate scoring results on the record level, for example, for the purposes of matching and merging.
When it comes to record scoring, try to apply the following:
- 
Score individual attributes using Best practices. 
- 
Optionally, divide attributes into two groups: - 
Important attributes (mandatory attributes). 
- 
Less important attributes. For example, we do not want to count score instance using gender and phone number quality score because of low quality level and minor importance for matching purposes. Attribute Value Importance out_first_name John IMPORTANT out_last_name Rasmusen IMPORTANT out_gender 1 NOT IMPORTANT out_birthdate_day 16.11.1978 IMPORTANT out_credit_card 720412410 IMPORTANT out_email j.rasmusen@email.com IMPORTANT out_phone_number 12 012 004 548 NOT IMPORTANT out_sin 720412410 IMPORTANT 
 
- 
- 
Use two columns for summed up record scores (or use sco_instance at least): - 
sco_instance - Sum of all attribute scores. 
- 
sco_instance_imp_att - Sum of all important attributes scores. 
 
- 
Was this page useful?