User Community Service Desk Downloads
If you can't find the product or version you're looking for, visit support.ataccama.com/downloads

Data Transformation Plans

Spark pushdown is not supported.

Transform data in ONE webapp using Data Transformation Plans. Plans are visual representations of data processing job definitions. Once you have configured your plan to process the input data as required and defined the data flow and output format, you can run the plan to execute the jobs. Each step represents basic algorithms, and data is processed record by record.

This facilitates two key use cases:

  1. Transformation of data quality monitoring project results. In this way, data transformation plans supersede post processing plans.

    When creating plans to transform monitoring project results, initiate the plan from the monitoring project itself to ensure correct processing. For more information, see Create plan to transform data quality monitoring results.
    high level visualization of plan to transform data quality monitoring project results
  2. Standalone data transformation of catalog items. See, Create plan to transform a catalog item.

    high level visualization of plan to transform catalog item
In addition to transformation plans, data remediation plans are available. Using data remediation plans, users can load data, metadata, and DQ results from monitoring projects into ONE Data for data remediation. The remediation step also stores DQ results in ONE Data, meaning you can view DQ results and also filter by DQ results. Although transformation plans can also write to ONE Data, only the data remediation output step writes the term and rules metadata from catalog items to ONE Data. The data remediation output is designed for the specific use case of import of issues for manual resolution.

Once you have configured your data input (catalog item or monitoring project results) you can then add a number of steps, such as filter, split, join or transform, and add or delete attributes. You can also embed plans within plans using the Embedded transformation plan step. Each step represents a data processing job. After transformation, data can be exported to ONE Data or to files in ONE Object Storage or be used in another plan. You can preview data and validate expressions in real-time when working with data transformation plans.

high level visualization of end-to-end transformation plan

Click anywhere on the blue banner to minimize a step. You can then access the three dots menu.

minimize transformation step
show three dots menu

For easier editing you can switch to Full screen mode

switch to full screen mode

Zoom in or out as needed, and use the compass to recenter when editing.

zoom or recenter plan

How is data transformed

To transform data using transformation plans:

  • Select your input data.

  • Build your plans using steps which can, for example, filter, transform or join the data. These steps are similar to those in ONE Desktop, which leverage ONE Expressions. Each step represents a data processing job.

  • Draw connections between steps to control the data flow.

  • Select your data output.

  • Run the transformation to trigger the flow of data through the plan.

  • For the full list of steps available, and how to use them, see Available steps.

Prerequisites

The data source connection of the catalog item you want to use in your transformation plan must have data export enabled. To verify this, go to Data Catalog > Sources > [your data source] > Connections. In the three dots menu ( ⋮ ) of your connection, select Show details. In Connection details, Enable exporting and loading of data should be enabled. Enable load and export

If not, use the three dots menu and select Edit.

Create plan to transform a catalog item

In this example, we will create a plan which inputs the catalog item Customers, adds attributes, removes other attributes, filters out selected records, and then writes the output to a ONE Data table. In ONE Data you can create & maintain reference data, as well as fix and remediate errors.

To create a plan for standalone transformation of a catalog item, in Data Quality > Transformation plans, select Create.

menu location

Alternatively, from the required catalog item, use the three dots menu and select Add transformation.

create transformation plan from catalog item

Provide a name for the new plan and select Save. In this example, we are creating a transformation plan for the catalog item customers, so let’s call the plan Customers transformation plan. After selecting Save you are redirected to the plan canvas.

new plan canvas

Select Add data input and then Catalog item input.

catalog item input step

Repeat with Add data output. For the purpose of this example, we are using the ONE Data writer step.

Draw a connection between the two steps.

example input and output step and connection

To create step connections, hover your mouse over the circle on the edge of the step. When a cross appears, click, and draw a connection to another step. image::data-transformation-plans-draw-connection.png[draw connection between steps, 400]

Configure input step

In Catalog item, select the catalog item you would like to use as an input. In our case, this is customers.

To see how to configure alternative input steps, see Available steps.
configure catalog item input step

Add new steps

To add more steps, use the plus icon in the connection, or select Add new step.

add new step

For the purpose of this example, we need to add the following steps:

  • Add attributes, to create a new attribute in the output, full name, which is a concatenation of the existing input attributes first name and last name.

  • Delete attributes, to remove surplus attributes, leaving us with just the key attributes in the output.

  • Filter, to filter out all records which are not from a selected country, in our case, this will be everything not from the United States.

steps needed for catalog item transformation example

Click on each step in turn to configure it, and follow the instructions below.

Configure transformation steps

First we want to create a new attribute, full_name, which is a concatenation of the existing input attributes contact first name and contact last name. For this, we use the Add attributes step. It is configured as follows:

  1. Name: A custom name for the step (optional).

  2. Name and Type: The attribute name and data type that will be used for the output. In our case this is full_name and String.

  3. Expression: A ONE expression which defined the new attribute. For our use case, we can use the expression trim (contactfirstname+' '+contactlastname). This also trims the inputs before joining them.

configure add attributes step

There are some attributes in our catalog item, customers, that we do not need in the output. To remove these we can use the Delete attributes step. It is configured as follows:

  1. Name: A custom name for the step (optional).

  2. Deleted attributes: Use Add and select attributes from the options available. In our case, this is creditlimit, customername, email, phone, postalcode, salesemployeenumber, src_sys_id.

configure delete attributes step

To filter out all records which are not from the United States, we can use the Filter step. It is configured as follows:

  1. Name: A custom name for the step (optional).

  2. Condition: The filter condition, defined using ONE Expressions. In our case this is going to be attribute name = 'value', so country = 'USA'.

configure filter data step
To see how to configure alternative transformation steps, see Available steps.

Preview data

At any point, you can preview the data that will be output as a result of the selected step, and verify it is functioning as you planned. Use the dropdown to switch between steps.

preview data

Validate plan

You can also validate the plan as you go, by selecting Validate plan in the bottom right corner of the canvas. You are alerted of any errors, for example, incorrect expression conditions. If validation issues are present, when you select Run transformation, the data processing job fails.

validate plan

Configure output step

We are writing the data to a new table in ONE Data, using the ONE Data writer step. It is configured as follows:

  1. Name: A custom name for the step (optional).

  2. ONE Data table: The name of the table you want to write into. Select an existing table, or create a new one:

    1. To create a new table, type the name you want to use for the new table, then select Add {table name}.

      Write new table
      If a catalog item with this name already exists, and you have sufficient permissions, the existing table is overridden by the outputs of the plan.
    2. To write into an existing table, select the table from the list available.

      Select existing table
Select Write all attributes to include all attributes that are inputs to the step, without having to add them manually.
configure one data writer step
Use Write all attributes or add attributes manually, these two methods can’t be used in combination.

Run transformation plan

Use the three dots menu and select Run transformation.

run transformation plan

A new job runs for the transformation plan. Data from the data source is read and processed on Ataccama DPE, with results stored in the configured plan output. You are alerted when the job is finished.

Information about transformation plan jobs can be found by going to Processing center > Base jobs > Transformation plan jobs, or by searching for job Type DQC.
check processing center

Create plan to transform data quality monitoring results

When creating a transformation that processes not only the source data but also needs results of data quality (passed rules, failed rules, etc.), the start of your transformation plan creation is the desired monitoring project. This will then ensure this plan is run with each run of the monitoring project, and that the output is available in the monitoring project Export tab.

In this example, we will create a plan which inputs results from a catalog item in a monitoring project, filters the results, and writes them to a .csv file.

Create new plan

Create transformation plans from monitoring project items to automatically apply transformations after data quality has been evaluated.

To create a new transformation plan from a monitoring project item, select the required project and navigate to the Configuration & Results tab. On the relevant catalog item, select the three dots menu and then Add post-processing transformation.

Add post processing transformation plan from monitoring project catalog item

Provide a name for your transformation plan.

new plan name

This opens the plan editor with pre-configured inputs and outputs.

monitoring project input and file output

By default, the output is a csv file named after the catalog item from which the project has been created. In this example, we will filter out all results which are not from the US, and all those without data quality issues, so we can rename the output file accordingly.

file output step

Add and configure transformation steps

To filter out all data without data quality issues and those that are not from the US, we can use the Filter step. To add a new step, use the plus icon on the connection, or select Add new step.

add new step

Select Filter. For our use case, we need the expression invalid_rules is not null and country = 'USA'.

configure filter data step

Run monitoring

Running the monitoring project will automatically trigger the transformation plan. After the jobs have finished, the results of the transformation plans are found in the monitoring project Export tab.

finished post processing transformation plan

Available steps

In the examples above, you can see how to configure a few selected steps. The full list of steps available for you to use and information on how to configure them can be seen below.

Catalog item input

Reads data from the selected catalog item.

  1. In Name, provide a custom name for the step (optional).

  2. In Catalog item, select the catalog item from the list available.

catalog item input step

Input

A generic input step which can be used to embed the plan into another.

  1. In Name, provide a custom name for the step (optional).

  2. Use Add attribute to add input attribute names and data types.

Typically, these would correspond with the output endpoints of another plan.
generic input step

ONE Data reader

Use ONE Data to create & maintain reference data, as well as fix and remediate errors. For more information, see Data Remediation with ONE Data.

Reads data from the selected ONE Data catalog item.

  1. In Name, provide a custom name for the step (optional).

  2. In ONE Data table, select the item from the list provided.

ONE data reader step

File output

Writes data into a text file in ONE object storage.

  1. In Name, provide a custom name for the step (optional).

  2. In File name, provide the name for the output file.

When writing to a text file, columns will be separated by delimiters.
file output step
Text files are only available to end-users when they are generated as part of a DQ monitoring project, i.e. as part of plans using the Monitoring project post-processing input step which are initialized from the monitoring project. You can find them in the Export tab of the relevant monitoring project.

ONE Data writer

Writes data into new or existing ONE Data catalog items.

  1. In Name, provide a custom name for the step (optional).

  2. In ONE Data table, select a table to write into, or create a new one:

    1. To create a new table, type the name you want to use for the new table, then select Add {table name}.

      Write new table
      If a catalog item with this name already exists, and you have sufficient permissions, the existing table is overridden by the outputs of the plan.
    2. To write into an existing table, select the table from the list available.

      Select existing table
  3. Use Add attribute and add all the attributes you want to be present in your ONE Data table. Map the attributes to your input data. Attribute names can match the source data, or you can provide alternative names.

    Select Write all attributes to include all attributes which are inputs to the step, without having to add them manually.
ONE data writer step

Output

A generic output step which can be used to embed the plan into another.

  1. In Name, provide a custom name for the step (optional).

  2. Select Enforce format to limit the visible fields in the output to the attributes defined. Otherwise, the format is defined by the inputs to this step.

  3. Using Add attribute, add input attribute names and data types.

    Typically, these would correspond with the input endpoints of another plan.
generic output step

Data remediation output

The data remediation output step should be used in data remediation plans. For more information, see Data Remediation plans.

What is the difference between the data remediation output step and the ONE Data writer step?

The data remediation output step writes the term and rules metadata from catalog items to ONE Data: ONE Data writer step does not copy information about applied terms or rules.

Due to this, there are fewer data transformation options in data remediation plans (which use the data remediation output step) than in general transformation plans, as it is not possible to change attribute structure.

The remediation step also stores DQ results in ONE Data, meaning you can view DQ results and also filter by DQ results.

The data remediation output is designed for the specific use case of import of issues for manual resolution.

Add attributes

Adds new attributes.

  1. In Name, provide a custom name for the step (optional).

  2. Provide the attribute name and data type, and define the attribute using ONE expressions. For example, to create a new attribute, full name, which concatenates two existing attributes, first name and last name, and also trims the value, you would use the expression trim(contactfirstname+' '+contactlastname).

    add attributes step
  3. Select Add Expression attribute to add additional attributes, and repeat step 2 as necessary.

Condition

Splits the data flow into two streams based on the ONE expression condition.

  1. In Name, provide a custom name for the step (optional).

  2. In Condition, define a condition to separate the data using ONE expressions. Data that satisfies the condition will be sent to the output out_true, data that does not satisfy the condition will be sent to the output out_false. From these outputs, create connections to further steps as required.

condition step

Delete attributes

Deletes selected attributes.

  1. In Name, provide a custom name for the step (optional).

  2. Select Add to add attributes to be deleted from the list provided.

    You will only see attributes from steps which have already been connected as inputs to the Delete attributes step.
delete attributes step

Filter

Filters out all records which don’t pass the ONE expression condition. To use:

  1. In Name, provide a custom name for the step (optional).

  2. Add a condition using ONE expressions. Only data that satisfies this condition will be output to the next step of the plan.

filter step

Embedded component

Inserts a component from ONE Desktop.

  1. In Name, provide a custom name for the step (optional).

  2. In Component, select the component from the list available.

    The components available here are those which have been created in ONE Desktop. They can be seen in ONE by selecting Data Quality > Components.

    components list ONE

Join

Joins two separate data flows into a single data flow (vertical merge).

vertical merge example
  1. In Name, provide a custom name for the step (optional).

  2. From other steps, create a connection to in_left and in_right, as required.

  3. Use leftKey and rightKey values to govern how the data records are paired up across the data input flows.

    1. In Left key add the column that contains key values (primary keys) of the in_left input flow

    2. In Right key add the column that contains key values (primary keys) of the in_right input flow.

  4. Select the required Join type:

    • Inner join: Only data that has a key that was found in both input flows are sent to output (records will be joined when the leftKey value matches the rightKey value).

    • Outer join: All records from both input flows are sent to output (even those with a key that was not found in both flows).

    • Left join: All records from the left input flow are sent to output (even those with a key that was not found in the right flow). Records in the right data frame which lack a match in the left data flow are dropped.

    • Right join: All records from the right input flow are sent to output (even those with a key that was not found in the left flow). Records in the left data flow which lack a match in the right data flow are dropped.

  5. Specify the columns that should be present after the join.

Embedded transformation plan

Inserts an existing transformation plan created in ONE.

  1. In Name, provide a custom name for the step (optional).

  2. In Transformation plan, select the plan from the list available.

embedded transformation plan step

Split

Splits the data flow into three streams based on the ONE expression condition.

  1. In Name, provide a custom name for the step (optional).

  2. In Condition, define a condition to split the data using ONE expressions. Data that satisfies the condition will be sent to the output out_true, data that does not satisfy the condition will be sent to the output out_false. All input data will be output to out_all. From these outputs, create connections to further steps as required.

split step

Transform data

Transforms attribute values using ONE expression language.

  1. In Name, provide a custom name for the step (optional).

  2. Provide the attribute you want to transform and define the transformation using ONE expressions.

    transform data step

Union streams

Merges input formats of two defined inputs into a single output format (horizontal merge). Data is read from the individual inputs (in_a or in_b) and written to the output. Therefore, in the output there are merged-format data records, where each row contains data read from either in_a or in_b input.

horizontal merge example
  1. In Name, provide a custom name for the step (optional).

  2. From other steps, create a connection to in_a and in_b, as required.

  3. Select Add Union mapping and specify:

    1. Destination: destination column name.

    2. In a: data to be input from in_a.

    3. In b: data to be input from in_b.

  4. Repeat as necessary.

union streams step

If the structure or data type of the inputs are not altered by intermediate steps, it is also possible to mimic Union stream by mapping two compatible inputs to your output, for example the ONE Data writer step. This means you can make use of the Write all attributes feature, rather than having to manually define your outputs in the Union stream step. This functions like the Union Same step in ONE Desktop.

union same mimic

Monitoring project post processing input

Inputs the data and DQ evaluation results from a monitoring project.

When using the monitoring project post processing input step, initiate the plan from the monitoring project itself to ensure correct processing: don’t add the step in the plan editor using Add data input or Add step.
  1. Locate the required catalog item in the Configuration & Results tab of a monitoring project.

  2. Select the three dots menu and then Add post-processing transformation.

  3. Provide a name for the plan and select Confirm.

  4. Your input step is ready to use.

Data preview

select step for preview

When transformation plans are run, the data gets stored in internal tables in ONE Data (these are not visible in the application). Select a step to see a preview of the resultant data. If the step has multiple outputs, you can further select for which output you want to view the data, for example, out_true or out_false.

The preview will be retained if you close the plan and re-open it. Likewise, after editing a plan, the preview is not updated unless you manually recompute it: to do this, select Recompute preview.

recompute preview

You can also view two different previews side-by-side, using the Show secondary preview toggle.

recompute preview

Preview is a separate job. If the job fails, check the Processing center for more details. Transformation plan preview jobs can be found by going to Processing center > Base jobs > Transformation plan preview jobs, or by searching for job Type DQC_PREVIEW.

Was this page useful?