Tutorial: Create reference data from deduplicated catalog columns
This tutorial shows how to extract and deduplicate values from existing catalog items to create new reference data tables.
Scenario overview
You have a catalog item with data that contains duplicates and variations in key attributes. You want to create a clean reference data table with deduplicated values that can be used for standardization and validation.
What you’ll learn
-
How to create reference data directly from existing catalog items.
-
How to use data transformations to deduplicate data during reference data creation.
-
How to configure the transformation pipeline with input and reference data output.
-
How to use Group Aggregator step for deduplication.
What you’ll need
For this tutorial, prepare:
-
Source catalog item: A catalog item containing data with duplicate or inconsistent values (for example, department names, product categories, or location codes).
-
Target attribute: Identify the specific attribute that contains values suitable for reference data.
-
Access to catalog item: Ensure you have permissions to create reference data from the catalog item.
Prerequisites
Before starting this tutorial, ensure you have:
-
Appropriate role on reference data tables:
-
Owner or Editor role to create and modify reference data tables.
-
Approver role to publish changes (if using approval workflows).
-
At minimum Viewer role to access table data.
-
-
Working knowledge of data transformation plans.
-
Sample data or catalog items to work with.
For information about roles and permissions, see User roles. For an introduction to data transformations, see Data Transformations Overview. |
Step-by-step instructions
-
Analyze the source data
-
Open your source catalog item.
-
Display the record preview.
-
Identify the attribute that contains duplicate or inconsistent values.
-
Note the variations and inconsistencies in the data.
-
-
Create reference data from catalog item
-
Go to your source catalog item in the data catalog.
-
Select the three dots menu.
-
Select Create reference data.
-
Configure the data loading settings as usual.
-
Check Open the dataset in data transformations to prepare it before loading.
-
-
Configure deduplication steps
-
The transformation opens with the input (your catalog item) and Reference data output already added.
-
In the transformation canvas:
-
Add a Group Aggregator step before any delete operations.
-
In the Group Aggregator configuration:
-
Set the Group by field to the target attribute (for example,
department
). -
Set Input attribute to your target attribute (for example,
department
). -
Set the Aggregation function to ANY_VALUE.
-
Set Output attribute to a cleaned name (for example,
department_deduplicated
).
-
-
(Optional) Add a Delete Attribute step after the Group Aggregator.
-
(Optional) Configure it to remove the original attribute.
-
-
-
Validate and execute
-
Validate the transformation plan configuration.
-
Check the data preview to ensure deduplication works correctly.
-
Publish the transformation plan.
-
Run the transformation.
-
-
Once the transformation completes, the output contains deduplicated values.
-
The reference data table is automatically created from the transformation output. Search for it or go to Manage reference data > Tables to find it.
-
Review the clean, deduplicated values in the newly created reference data table.
-
Verify the results
-
Check that duplicate values have been removed.
-
Verify that the reference data table contains only unique values.
-
Confirm the data quality and consistency of the results.
-
Expected outcome
You now have a clean reference data table with deduplicated values that can be used for validation, standardization, and other data quality processes.
Next steps
After completing this tutorial:
-
Export to database: Export to Database.
-
Validate data quality: Create Validation Rules.
-
Learn best practices: Best Practices.
-
Use published data: Work with Published Reference Data.
-
Troubleshoot issues: Troubleshooting.
Was this page useful?