Tutorial: Create reference data from deduplicated catalog columns
This tutorial shows how to extract and deduplicate values from existing catalog items to create new reference data tables.
Scenario overview
You have a catalog item with data that contains duplicates and variations in key attributes. You want to create a clean reference data table with deduplicated values that can be used for standardization and validation.
What you’ll learn
- 
How to create reference data directly from existing catalog items.
 - 
How to use data transformations to deduplicate data during reference data creation.
 - 
How to configure the transformation pipeline with input and reference data output.
 - 
How to use Group Aggregator step for deduplication.
 
What you’ll need
For this tutorial, prepare:
- 
Source catalog item: A catalog item containing data with duplicate or inconsistent values (for example, department names, product categories, or location codes).
 - 
Target attribute: Identify the specific attribute that contains values suitable for reference data.
 - 
Access to catalog item: Ensure you have permissions to create reference data from the catalog item.
 
Prerequisites
Before starting this tutorial, ensure you have:
- 
Appropriate role on reference data tables:
- 
Owner or Editor role to create and modify reference data tables.
 - 
Approver role to publish changes (if using approval workflows).
 - 
At minimum Viewer role to access table data.
 
 - 
 - 
Working knowledge of data transformation plans.
 - 
Sample data or catalog items to work with.
 
| For information about roles and permissions, see User roles. For an introduction to data transformations, see Data Transformations Overview. | 
Step-by-step instructions
- 
Analyze the source data
- 
Open your source catalog item.
 - 
Display the record preview.
 - 
Identify the attribute that contains duplicate or inconsistent values.
 - 
Note the variations and inconsistencies in the data.
 
 - 
 - 
Create reference data from catalog item
- 
Go to your source catalog item in the data catalog.
 - 
Select the three dots menu.
 - 
Select Create reference data.
 - 
Configure the data loading settings as usual.
 - 
Check Open the dataset in data transformations to prepare it before loading.
 
 - 
 - 
Configure deduplication steps
- 
The transformation opens with the input (your catalog item) and Reference data output already added.
 - 
In the transformation canvas:
- 
Add a Group Aggregator step before any delete operations.
 - 
In the Group Aggregator configuration:
- 
Set the Group by field to the target attribute (for example,
department). - 
Set Input attribute to your target attribute (for example,
department). - 
Set the Aggregation function to ANY_VALUE.
 - 
Set Output attribute to a cleaned name (for example,
department_deduplicated). 
 - 
 - 
(Optional) Add a Delete Attribute step after the Group Aggregator.
 - 
(Optional) Configure it to remove the original attribute.
 
 - 
 
 - 
 - 
Validate and execute
- 
Validate the transformation plan configuration.
 - 
Check the data preview to ensure deduplication works correctly.
 - 
Publish the transformation plan.
 - 
Run the transformation.
 
 - 
 
- 
Once the transformation completes, the output contains deduplicated values.
 - 
The reference data table is automatically created from the transformation output. Search for it or go to Manage reference data > Tables to find it.
 - 
Review the clean, deduplicated values in the newly created reference data table.
 
- 
Verify the results
- 
Check that duplicate values have been removed.
 - 
Verify that the reference data table contains only unique values.
 - 
Confirm the data quality and consistency of the results.
 
 - 
 
Expected outcome
You now have a clean reference data table with deduplicated values that can be used for validation, standardization, and other data quality processes.
Next steps
After completing this tutorial:
- 
Export to database: Export to Database.
 - 
Validate data quality: Create Validation Rules.
 - 
Learn best practices: Best Practices.
 - 
Use published data: Work with Published Reference Data.
 - 
Troubleshoot issues: Troubleshooting.
 
Was this page useful?