Prepare Clean Reference Data for Lookups

Clean and load data into reference data tables using a standalone transformation plan — filtering out obsolete records, standardizing formats, removing duplicates, and scheduling recurring refreshes so your reference data stays trustworthy. Centralized cleansing logic means the same standards apply on every load, and updates flow through to all future refreshes without rebuilding the pipeline.

Why clean reference data matters

Reference data tables provide the standardized values your organization uses for lookups and validations, such as country codes, product categories, and status codes. When source data has quality issues, loading it directly into reference tables spreads those problems across your organization:

Inconsistent formats: Codes appear in different cases and with extra spaces, so lookups can’t match values reliably.
Missing required fields: Records without key values can’t be used for validation and break referential checks downstream.
Duplicates from multiple sources: The same code appears multiple times under slightly different forms, splitting lookups across rows.
Invalid or inactive records: Outdated data that shouldn’t be used still gets matched against incoming data.
Manual cleansing doesn’t scale: Errors multiply as data volumes grow, and the same problems resurface every time the source refreshes.

Everyone who uses those reference tables inherits the quality problems.

How standalone plans prepare reference data

Use standalone transformation plans to prepare reference data:

Read data from source catalog items.
Filter out invalid or incomplete records.
Standardize formats: uppercase codes, trim whitespace, normalize dates.
Remove duplicates using group aggregation.
Load clean data to reference data tables.
Schedule recurring refreshes to keep reference data current.

Example: Prepare product category reference data

The source table product_categories_raw has the following structure:

category_code	category_name	effective_date	status	last_modified
ELEC	Electronics	2020-01-01	ACTIVE	2024-01-15
elec	electronics	2020-01-01	ACTIVE	2023-06-10
FURN	Furniture	2020-01-01	ACTIVE	2024-02-01
LEGCY	Legacy Products	2015-03-01	`RETIRED`	2022-11-30
SOFT	Software	2021-06-01	ACTIVE	2024-03-01
	Unnamed	2023-01-01	ACTIVE	2024-01-01
DISC	Discontinued Line	2018-01-01	`RETIRED`	2021-05-01

Issues:

category_code is missing for one record and mixed case across others (ELEC and elec are the same category from two source systems).
status values LEGCY and DISC are RETIRED and should not be loaded into reference data tables.
category_name has inconsistent casing.

Expected output:

category_code	category_name	effective_date
ELEC	Electronics	2020-01-01
FURN	Furniture	2020-01-01
SOFT	Software	2021-06-01

category_code

category_name

effective_date

ELEC

Electronics

2020-01-01

FURN

Furniture

2020-01-01

SOFT

Software

2021-06-01

The plan handles this in four stages: filter — standardize — deduplicate — load.

Step 1: Create the transformation plan

Go to Data quality > Data transformations
Select Create transformation plan > Standalone plan. A standalone plan is the right choice here because it runs on a schedule, reads from source catalog items, and writes output to a destination such as a reference data table.
Name your plan — for example, Prepare Product Categories
Add a description explaining what the plan does and which source it reads from
Select Create

The plan opens in Draft state with an empty canvas.

Step 2: Add input from source catalog items

Add a Catalog item input step and select your source catalog item. If you have multiple sources to combine, add additional input steps and use Union streams to merge them into a single flow before filtering.

Step 3: Filter out invalid records

Add a Filter step to remove records that should not be in reference data tables:

status = "ACTIVE" AND category_code is not NULL AND category_code <> ""

In any expression field, you can use AI assistance — describe what you want in plain language and it will generate the syntax. For example: "Keep only records where status is ACTIVE and category_code is not empty."

This removes RETIRED records (LEGCY, DISC) and the record with no category_code. Records with formatting issues but a valid code and ACTIVE status should not be filtered here — the next step will fix those.

Only use the Filter step to remove records that are intentionally out of scope (retired, discontinued). Do not filter out records just because they have quality issues — those records would be excluded from the output even though they could be fixed. Fix quality issues in the standardization step instead.

Step 4: Standardize formats

Add a Transform data step to normalize inconsistent values:

Attribute Expression Purpose

Attribute	Expression	Purpose
`category_code`	`upper(trim(category_code))`	Normalize to uppercase, remove extra spaces
`category_name`	`trim(category_name)`	Remove leading and trailing spaces

category_code

upper(trim(category_code))

Normalize to uppercase, remove extra spaces

category_name

trim(category_name)

Remove leading and trailing spaces

After this step, ELEC and elec both become ELEC and are ready for deduplication.

Step 5: Remove duplicates

Add a Group aggregator step to keep one record per category code. Group by category_code and configure aggregations for the remaining attributes:

Aggregation expression Output attribute Why

Aggregation expression	Output attribute	Why
`first(category_name)`	`category_name`	Duplicates from casing inconsistencies will have identical names after standardization — `first()` is safe here
`first(effective_date)`	`effective_date`	Same reasoning — same source record, same date

first(category_name)

category_name

Duplicates from casing inconsistencies will have identical names after standardization — first() is safe here

first(effective_date)

effective_date

Same reasoning — same source record, same date

Step 6: Load to reference data table

Add a Reference data output step and configure:

Target table: Select your reference data table or create a new one.
Update method: Choose one of the following.
- Upsert: Updates existing records and inserts new ones. Existing reference data codes no longer in the source are left unchanged. Recommended for recurring loads.
- Replace: Deletes all existing data and loads fresh. Use when the source is fully authoritative.
- Append: Adds all records as new. Use only for initial one-time loads.
Matching attributes: For Upsert, use category_code.
Workflow stage: Draft for review before publishing, or Published to make records immediately available.

Step 7: Preview and validate

Select Compute data preview and check results at each step to confirm the filter, standardization, and deduplication worked as expected:
- After Filter: Verify invalid records are removed.
- After Transform data: Check that formats are standardized.
- After Group aggregator: Confirm duplicates are eliminated.
Select Validate plan and fix any errors before proceeding.

Step 8: Run the transformation

From the three-dot menu, select Run plan. Monitor execution in the Processing Center and verify record counts match the preview. If you loaded to Draft stage, review and publish the records in the reference data table.

Make it repeatable

Schedule recurring updates

Open the plan, go to the three-dot menu, select Schedule, and configure a recurrence — for example, daily at 2 AM after the source system refreshes. Schedule reference data loads during off-peak hours and after source system updates complete.

Handle incremental updates

For large reference tables that change infrequently, add this condition to your Filter step to process only recently changed records:

last_modified >= dateAdd(now(), -1, "DAY")

This improves performance by processing only new and updated records instead of the full table on every run.

Next steps

Was this page useful?

Prepare Clean Reference Data for Lookups

Why clean reference data matters

How standalone plans prepare reference data

Example: Prepare product category reference data

Step 1: Create the transformation plan

Step 2: Add input from source catalog items

Step 3: Filter out invalid records

Step 4: Standardize formats

Step 5: Remove duplicates

Step 6: Load to reference data table

Step 7: Preview and validate

Step 8: Run the transformation

Make it repeatable

Schedule recurring updates

Handle incremental updates

Next steps

Related scenarios