Prepare Clean Reference Data for Lookups
Clean and load data into reference data tables using a standalone transformation plan — filtering out obsolete records, standardizing formats, removing duplicates, and scheduling recurring refreshes so your reference data stays trustworthy. Centralized cleansing logic means the same standards apply on every load, and updates flow through to all future refreshes without rebuilding the pipeline.
Why clean reference data matters
Reference data tables provide the standardized values your organization uses for lookups and validations, such as country codes, product categories, and status codes. When source data has quality issues, loading it directly into reference tables spreads those problems across your organization:
-
Inconsistent formats: Codes appear in different cases and with extra spaces, so lookups can’t match values reliably.
-
Missing required fields: Records without key values can’t be used for validation and break referential checks downstream.
-
Duplicates from multiple sources: The same code appears multiple times under slightly different forms, splitting lookups across rows.
-
Invalid or inactive records: Outdated data that shouldn’t be used still gets matched against incoming data.
-
Manual cleansing doesn’t scale: Errors multiply as data volumes grow, and the same problems resurface every time the source refreshes.
Everyone who uses those reference tables inherits the quality problems.
How standalone plans prepare reference data
Use standalone transformation plans to prepare reference data:
-
Read data from source catalog items.
-
Filter out invalid or incomplete records.
-
Standardize formats: uppercase codes, trim whitespace, normalize dates.
-
Remove duplicates using group aggregation.
-
Load clean data to reference data tables.
-
Schedule recurring refreshes to keep reference data current.
Example: Prepare product category reference data
The source table product_categories_raw has the following structure:
| category_code | category_name | effective_date | status | last_modified |
|---|---|---|---|---|
ELEC |
Electronics |
2020-01-01 |
ACTIVE |
2024-01-15 |
elec |
electronics |
2020-01-01 |
ACTIVE |
2023-06-10 |
FURN |
Furniture |
2020-01-01 |
ACTIVE |
2024-02-01 |
LEGCY |
Legacy Products |
2015-03-01 |
|
2022-11-30 |
SOFT |
Software |
2021-06-01 |
ACTIVE |
2024-03-01 |
Unnamed |
2023-01-01 |
ACTIVE |
2024-01-01 |
|
DISC |
Discontinued Line |
2018-01-01 |
|
2021-05-01 |
Issues:
-
category_codeis missing for one record and mixed case across others (ELECandelecare the same category from two source systems). -
statusvaluesLEGCYandDISCareRETIREDand should not be loaded into reference data tables. -
category_namehas inconsistent casing.
Expected output:
| category_code | category_name | effective_date |
|---|---|---|
ELEC |
Electronics |
2020-01-01 |
FURN |
Furniture |
2020-01-01 |
SOFT |
Software |
2021-06-01 |
The plan handles this in four stages: filter — standardize — deduplicate — load.
Step 1: Create the transformation plan
-
Go to Data Quality > Data transformations
-
Select Create transformation plan > Standalone plan. A standalone plan is the right choice here because it runs on a schedule, reads from source catalog items, and writes output to a destination such as a reference data table.
-
Name your plan — for example,
Prepare Product Categories -
Add a description explaining what the plan does and which source it reads from
-
Select Create
The plan opens in Draft state with an empty canvas.
Step 2: Add input from source catalog items
Add a Catalog item input step and select your source catalog item. If you have multiple sources to combine, add additional input steps and use Union streams to merge them into a single flow before filtering.
Step 3: Filter out invalid records
Add a Filter step to remove records that should not be in reference data tables:
status = "ACTIVE" AND category_code is not NULL AND category_code <> ""
| In any expression field, you can use AI assistance — describe what you want in plain language and it will generate the syntax. For example: "Keep only records where status is ACTIVE and category_code is not empty." |
This removes RETIRED records (LEGCY, DISC) and the record with no category_code.
Records with formatting issues but a valid code and ACTIVE status should not be filtered here — the next step will fix those.
| Only use the Filter step to remove records that are intentionally out of scope (retired, discontinued). Do not filter out records just because they have quality issues — those records would be excluded from the output even though they could be fixed. Fix quality issues in the standardization step instead. |
Step 4: Standardize formats
Add a Transform data step to normalize inconsistent values:
| Attribute | Expression | Purpose |
|---|---|---|
|
|
Normalize to uppercase, remove extra spaces |
|
|
Remove leading and trailing spaces |
After this step, ELEC and elec both become ELEC and are ready for deduplication.
Step 5: Remove duplicates
Add a Group aggregator step to keep one record per category code.
Group by category_code and configure aggregations for the remaining attributes:
| Aggregation expression | Output attribute | Why |
|---|---|---|
|
|
Duplicates from casing inconsistencies will have identical names after standardization — |
|
|
Same reasoning — same source record, same date |
Step 6: Load to reference data table
Add a Reference data output step and configure:
-
Target table: Select your reference data table or create a new one.
-
Update method: Choose one of the following.
-
Upsert: Updates existing records and inserts new ones. Existing reference data codes no longer in the source are left unchanged. Recommended for recurring loads.
-
Replace: Deletes all existing data and loads fresh. Use when the source is fully authoritative.
-
Append: Adds all records as new. Use only for initial one-time loads.
-
-
Matching attributes: For Upsert, use
category_code. -
Workflow stage: Draft for review before publishing, or Published to make records immediately available.
Step 7: Preview and validate
-
Select Compute data preview and check results at each step to confirm the filter, standardization, and deduplication worked as expected:
-
After Filter: Verify invalid records are removed.
-
After Transform data: Check that formats are standardized.
-
After Group aggregator: Confirm duplicates are eliminated.
-
-
Select Validate plan and fix any errors before proceeding.
Make it repeatable
Schedule recurring updates
Open the plan, go to the three-dot menu, select Schedule, and configure a recurrence — for example, daily at 2 AM after the source system refreshes. Schedule reference data loads during off-peak hours and after source system updates complete.
Handle incremental updates
For large reference tables that change infrequently, add this condition to your Filter step to process only recently changed records:
last_modified >= dateAdd(now(), -1, "DAY")
This improves performance by processing only new and updated records instead of the full table on every run.
Was this page useful?