DPM and DPE

The Data Processing Module (DPM) is the interface between the Data Processing Engine (DPE) and the rest of the Ataccama ONE Platform. While DPE connects to one or more data sources, DPM distributes data processing jobs between the available DPE instances. To efficiently allocate the workload to different DPEs, DPM has the information about which data sources each DPE has access to and how many jobs it can process simultaneously.

Since multiple DPEs can work with the same data source, DPM functions as a form of load balancer allowing you to scale DPEs horizontally. Once a DPE instance has completed the job, the results are sent to DPM that in turn makes the data available to other ONE modules, namely Metadata Management Module (MMM).

Communication between DPM and DPE

Data Processing Engine connects directly to Data Processing Module, which can be done in one of two ways.

Two-way communication

By default, DPM and DPE use two-way communication where DPE informs DPM of its availability and DPM pushes jobs to DPE. The main advantage over one-way communication is that DPE does not sit idle while a job waits for it in DPM, which makes job processing quicker.

One-way communication

This communication option is best suitable for cases where DPE is located in a network that should not receive inbound traffic. For example, this is useful when Ataccama ONE Platform as a Service (PaaS) solution works with data that should not be sent into the corporate network for security reasons, or when working with a data source containing highly sensitive data located in an extremely secure part of the corporate network.

With one-way communication, DPE reaches out to DPM to request new data processing jobs. Since requests are sent periodically, it might seem like Ataccama ONE is slow to respond to user requests because DPM is not polled as soon as a new job is created but only after several minutes.

Communication between DPE and data sources

DPE is able to connect to many different data sources using JDBC connections or dedicated connections for sources like Spark, Databricks, and Snowflake. The default mode is to use SSL/TLS-encrypted connections for any data source that supports that option. However, this is not required and you can still use DPE to connect to unencrypted data sources, allowing you to connect to legacy systems just as easily as to any modern one.

How DPE collects data?

Once DPE connects to your data source, it waits for instructions about how to process the data. This can be profiling the data source, applying DQ rules, and so on.

All data processing happens only in DPE, during which DPE produces metadata describing your data source. The resulting metadata does not contain any of your data.

With the default configuration, DPE also collects a small sample of records that do not pass DQ rules. DPE then sends this sample to DPM from where relevant parts are transferred to other components of ONE as needed.

To summarize, data can be split in two categories. The first category is the metadata created as a result of data processing. As such, it describes both the data source and the catalog items in it and includes information like the number of catalog items, the number of records and attributes (columns), assigned business terms, DQ results.

The second category corresponds to the optional sampling of records that do not satisfy DQ rules. These samples are used to help data stewards quickly see in what way the invalid records do not conform to the applied DQ rules. This information makes it much easier for data stewards to spot patterns and resolve DQ issues without having to query the data source again to retrieve all invalid records for a particular DQ rule.

Was this page useful?