User Community Service Desk Downloads

Prepare Clean Reference Data for Lookups

Clean and load data into reference data tables using a standalone transformation plan — filtering out obsolete records, standardizing formats, removing duplicates, and scheduling recurring refreshes so your reference data stays trustworthy. Centralized cleansing logic means the same standards apply on every load, and updates flow through to all future refreshes without rebuilding the pipeline.

Why clean reference data matters

Reference data tables provide the standardized values your organization uses for lookups and validations, such as country codes, product categories, and status codes. When source data has quality issues, loading it directly into reference tables spreads those problems across your organization:

  • Inconsistent formats: Codes appear in different cases and with extra spaces, so lookups can’t match values reliably.

  • Missing required fields: Records without key values can’t be used for validation and break referential checks downstream.

  • Duplicates from multiple sources: The same code appears multiple times under slightly different forms, splitting lookups across rows.

  • Invalid or inactive records: Outdated data that shouldn’t be used still gets matched against incoming data.

  • Manual cleansing doesn’t scale: Errors multiply as data volumes grow, and the same problems resurface every time the source refreshes.

Everyone who uses those reference tables inherits the quality problems.

How standalone plans prepare reference data

Use standalone transformation plans to prepare reference data:

  • Read data from source catalog items.

  • Filter out invalid or incomplete records.

  • Standardize formats: uppercase codes, trim whitespace, normalize dates.

  • Remove duplicates using group aggregation.

  • Load clean data to reference data tables.

  • Schedule recurring refreshes to keep reference data current.

Example: Prepare product category reference data

The source table product_categories_raw has the following structure:

category_code category_name effective_date status last_modified

ELEC

Electronics

2020-01-01

ACTIVE

2024-01-15

elec

electronics

2020-01-01

ACTIVE

2023-06-10

FURN

Furniture

2020-01-01

ACTIVE

2024-02-01

LEGCY

Legacy Products

2015-03-01

RETIRED

2022-11-30

SOFT

Software

2021-06-01

ACTIVE

2024-03-01

Unnamed

2023-01-01

ACTIVE

2024-01-01

DISC

Discontinued Line

2018-01-01

RETIRED

2021-05-01

Issues:

  • category_code is missing for one record and mixed case across others (ELEC and elec are the same category from two source systems).

  • status values LEGCY and DISC are RETIRED and should not be loaded into reference data tables.

  • category_name has inconsistent casing.

Expected output:

category_code category_name effective_date

ELEC

Electronics

2020-01-01

FURN

Furniture

2020-01-01

SOFT

Software

2021-06-01

The plan handles this in four stages: filter — standardize — deduplicate — load.

Step 1: Create the transformation plan

  1. Go to Data Quality > Data transformations

  2. Select Create transformation plan > Standalone plan. A standalone plan is the right choice here because it runs on a schedule, reads from source catalog items, and writes output to a destination such as a reference data table.

  3. Name your plan — for example, Prepare Product Categories

  4. Add a description explaining what the plan does and which source it reads from

  5. Select Create

The plan opens in Draft state with an empty canvas.

Step 2: Add input from source catalog items

Add a Catalog item input step and select your source catalog item. If you have multiple sources to combine, add additional input steps and use Union streams to merge them into a single flow before filtering.

Step 3: Filter out invalid records

Add a Filter step to remove records that should not be in reference data tables:

status = "ACTIVE" AND category_code is not NULL AND category_code <> ""
In any expression field, you can use AI assistance — describe what you want in plain language and it will generate the syntax. For example: "Keep only records where status is ACTIVE and category_code is not empty."

This removes RETIRED records (LEGCY, DISC) and the record with no category_code. Records with formatting issues but a valid code and ACTIVE status should not be filtered here — the next step will fix those.

Only use the Filter step to remove records that are intentionally out of scope (retired, discontinued). Do not filter out records just because they have quality issues — those records would be excluded from the output even though they could be fixed. Fix quality issues in the standardization step instead.

Step 4: Standardize formats

Add a Transform data step to normalize inconsistent values:

Attribute Expression Purpose

category_code

upper(trim(category_code))

Normalize to uppercase, remove extra spaces

category_name

trim(category_name)

Remove leading and trailing spaces

After this step, ELEC and elec both become ELEC and are ready for deduplication.

Step 5: Remove duplicates

Add a Group aggregator step to keep one record per category code. Group by category_code and configure aggregations for the remaining attributes:

Aggregation expression Output attribute Why

first(category_name)

category_name

Duplicates from casing inconsistencies will have identical names after standardization — first() is safe here

first(effective_date)

effective_date

Same reasoning — same source record, same date

Step 6: Load to reference data table

Add a Reference data output step and configure:

  • Target table: Select your reference data table or create a new one.

  • Update method: Choose one of the following.

    • Upsert: Updates existing records and inserts new ones. Existing reference data codes no longer in the source are left unchanged. Recommended for recurring loads.

    • Replace: Deletes all existing data and loads fresh. Use when the source is fully authoritative.

    • Append: Adds all records as new. Use only for initial one-time loads.

  • Matching attributes: For Upsert, use category_code.

  • Workflow stage: Draft for review before publishing, or Published to make records immediately available.

Step 7: Preview and validate

  1. Select Compute data preview and check results at each step to confirm the filter, standardization, and deduplication worked as expected:

    • After Filter: Verify invalid records are removed.

    • After Transform data: Check that formats are standardized.

    • After Group aggregator: Confirm duplicates are eliminated.

  2. Select Validate plan and fix any errors before proceeding.

Step 8: Run the transformation

From the three-dot menu, select Run plan. Monitor execution in the Processing Center and verify record counts match the preview. If you loaded to Draft stage, review and publish the records in the reference data table.

Make it repeatable

Schedule recurring updates

Open the plan, go to the three-dot menu, select Schedule, and configure a recurrence — for example, daily at 2 AM after the source system refreshes. Schedule reference data loads during off-peak hours and after source system updates complete.

Handle incremental updates

For large reference tables that change infrequently, add this condition to your Filter step to process only recently changed records:

last_modified >= dateAdd(now(), -1, "DAY")

This improves performance by processing only new and updated records instead of the full table on every run.

Was this page useful?