Data Quality Scoring

Data quality scoring means evaluating data with attribute-specific rules and providing the result in terms of a numeric score with the purpose of separating good quality records from bad quality ones.

Scoring provides a single measure (indicator) for both evaluating the data quality of a particular attribute (or record) and encoding types of errors (cleansing operations) together with their severity.

Data quality scoring can be performed in almost all steps, including the specifically designed Simple Scoring and Scoring steps.

Best practices

Follow these recommendations to make the most out of data quality scoring:

Categorize the cleansing score into four scoring (quantification) levels:

Scoring level Scoring description Scoring result

Scoring level	Scoring description	Scoring result
0	No modification was done to the input value.	VALID
< 10 000	A small modification was done to the input value (such as trim, squeeze spaces, uppercase, remove unsupported characters, safe replacements).	VALID/CORRECTED
< 10 000 000	An important modification was done to the input value (for example, not found in the dictionary, does not meet official rules, major error, unsafe replacements).	UNSAFE/UNKNOWN
> 10 000 000	Input value or precleansed value (after a small modification) is `null`.	NOT VALID

No modification was done to the input value.

VALID

< 10 000

A small modification was done to the input value (such as trim, squeeze spaces, uppercase, remove unsupported characters, safe replacements).

VALID/CORRECTED

< 10 000 000

An important modification was done to the input value (for example, not found in the dictionary, does not meet official rules, major error, unsafe replacements).

UNSAFE/UNKNOWN

> 10 000 000

Input value or precleansed value (after a small modification) is null.

NOT VALID

Each cleansing operation (like those in the Scoring Description column) has a corresponding score value and explanation code. For example, the following table contains rules, scores, and explanation codes for scoring a Social Security number in the < 10 000 scoring level:

Rule	Score	Explanation code
There is a no-digit-comment before and/or after SIN.	1000	SIN_COMMENT
There is a digit in a comment before and/or after SIN.	1000	SIN_COMMENT_WITH_DIGIT
There are spaces at unusual positions inside the SIN number.	300	SIN_SPACES_AT_UNUSUAL_POSITIONS
Separators between the 3rd-4th and 6th-7th digit are not consistent.	200	SIN_DIFFERENT_SEPARATORS
Input SIN is in the usual format with dashes or spaces as separators.	100	SIN_REMOVED_SEPARATORS

Rule

Score

Explanation code

There is a no-digit-comment before and/or after SIN.

1000

SIN_COMMENT

There is a digit in a comment before and/or after SIN.

1000

SIN_COMMENT_WITH_DIGIT

There are spaces at unusual positions inside the SIN number.

300

SIN_SPACES_AT_UNUSUAL_POSITIONS

Separators between the 3rd-4th and 6th-7th digit are not consistent.

200

SIN_DIFFERENT_SEPARATORS

Input SIN is in the usual format with dashes or spaces as separators.

100

SIN_REMOVED_SEPARATORS

Store scores and explanation codes in separate columns for each attribute: sco_[attribute], exp_[attribute]. The explanation field can aggregate multiple codes separated by a whitespace. To learn more about naming conventions, see column-naming-conventions.adoc.

The following table illustrates scoring results for several Social Security numbers might look like the following ([attribute]=sin):

src_sin	out_sin	exp_sin	sco_sin	Scoring result
573896836	573896836	SIN_INVALID_CHECK	10000	NOT VALID
417 208 303	417208303	SIN_REMOVED_SEPARATORS SIN_VALID	100	VALID/CORRECTED
417-208303	417208303	SIN_DIFFERENT_SEPARATORS SIN_VALID	200	VALID/CORRECTED
sin 417 - 208 -303	417208303	SIN_COMMENT SIN_DIFFERENT_SEPARATORS SIN_VALID	1200	VALID/CORRECTED
4 172 083 0 3	417208303	SIN_SPACES_AT_UNUSUAL_POSITIONS SIN_VALID	300	VALID/CORRECTED
aaabbbccc	aaabbbccc	SIN_NOT_PARSED	1000000	NOT VALID

Specify scoring rules in such a way that within a given range, the sum of scores for their corresponding operations does not exceed the upper bound, no matter how many operations are performed.

Scoring level	Scoring result	Explanation codes	Scoring and description	Sum of operation scores
> 10 000 000	NOT VALID	SIN_EMPTY	Scored 10000000. Input SIN is empty.	10 000 000
< 10 000 000	UNSAFE/UNKNOWN	SIN_NOT_PARSED	Scored 1000000. Input string is not in the supported format.	1 110 000
SIN_INVALID_DIGIT_COUNT	Scored 100000. Input contains only digits, but the digit count is not 9.
SIN_INVALID_CHECK	Scored 10000. Parsed SIN did not pass the validity test (Luhn algorithm).
< 10 000	VALID/CORRECTED	SIN_COMMENT	Scored 1000. There is a no-digit-comment before and/or after SIN.	2 500
SIN_COMMENT_WITH_DIGIT	Scored 1000. There is a digit in the comment before and/or after SIN.
SIN_SPACES_AT_UNUSUAL_POSITIONS	Scored 300. There are spaces at unusual positions inside the SIN number.
SIN_DIFFERENT_SEPARATORS	Scored 200. Separators between the 3rd-4th and 6th-7th digit are not consistent.
SIN_REMOVED_SEPARATORS	Scored 100. Input SIN is in the usual format with dashes or spaces as separators.
0	VALID	SIN_VALID	Scored 0. Input SIN is valid.	0

Scoring level

Scoring result

Explanation codes

Scoring and description

Sum of operation scores

> 10 000 000

NOT VALID

SIN_EMPTY

Scored 10000000. Input SIN is empty.

10 000 000

< 10 000 000

UNSAFE/UNKNOWN

SIN_NOT_PARSED

Scored 1000000. Input string is not in the supported format.

1 110 000

SIN_INVALID_DIGIT_COUNT

Scored 100000. Input contains only digits, but the digit count is not 9.

SIN_INVALID_CHECK

Scored 10000. Parsed SIN did not pass the validity test (Luhn algorithm).

< 10 000

VALID/CORRECTED

SIN_COMMENT

Scored 1000. There is a no-digit-comment before and/or after SIN.

2 500

SIN_COMMENT_WITH_DIGIT

Scored 1000. There is a digit in the comment before and/or after SIN.

SIN_SPACES_AT_UNUSUAL_POSITIONS

Scored 300. There are spaces at unusual positions inside the SIN number.

SIN_DIFFERENT_SEPARATORS

Scored 200. Separators between the 3rd-4th and 6th-7th digit are not consistent.

SIN_REMOVED_SEPARATORS

Scored 100. Input SIN is in the usual format with dashes or spaces as separators.

VALID

SIN_VALID

Scored 0. Input SIN is valid.

Use a factor of 10 to gradate rules by importance, for example, 1 000 vs. 10 000 (like some of the rules in the previous table).

Attribute scoring vs. record scoring

Scoring rules are applied to individual attributes; however, it is also possible to aggregate scoring results on the record level, for example, for the purposes of matching and merging.

When it comes to record scoring, try to apply the following:

Score individual attributes using Best practices.

Optionally, divide attributes into two groups:

Important attributes (mandatory attributes).

Less important attributes.

For example, we do not want to count score instance using gender and phone number quality score because of low quality level and minor importance for matching purposes.

Attribute	Value	Importance
out_first_name	John	IMPORTANT
out_last_name	Rasmusen	IMPORTANT
out_gender	1	NOT IMPORTANT
out_birthdate_day	16.11.1978	IMPORTANT
out_credit_card	720412410	IMPORTANT
out_email	j.rasmusen@email.com	IMPORTANT
out_phone_number	12 012 004 548	NOT IMPORTANT
out_sin	720412410	IMPORTANT

Attribute

Value

Importance

out_first_name

John

IMPORTANT

out_last_name

Rasmusen

IMPORTANT

out_gender

NOT IMPORTANT

out_birthdate_day

16.11.1978

IMPORTANT

out_credit_card

720412410

IMPORTANT

out_email

j.rasmusen@email.com

IMPORTANT

out_phone_number

12 012 004 548

NOT IMPORTANT

out_sin

720412410

IMPORTANT

Use two columns for summed up record scores (or use sco_instance at least):
1. sco_instance - Sum of all attribute scores.
2. sco_instance_imp_att - Sum of all important attributes scores.

Was this page useful?