Snowflake Pushdown Processing
There are two options regarding the processing of data quality in ONE.
-
The first option is for all data quality processing (DQ evaluation jobs and, during profiling, monitoring projects, and data observability) to be handled by Ataccama ONE Data Processing Engines (DPEs). Depending on the scale of the project, you might need multiple DPEs, and some transfer between your data source and DPE is required.
-
The second option is to use pushdown processing: this is supported for big data sources when Spark is enabled or using Snowflake.
This article outlines the benefits and limitations of using Snowflake pushdown processing in ONE and provides instructions to enable it.
This page is relevant for DQ processing configured and run from ONE, not ONE Desktop. Snowflake pushdown is not supported for plans. |
Why use pushdown processing?
Using pushdown profiling can lead to improvements in performance, security and infrastructure, and subsequently, scalability.
-
Performance. Evaluate millions of records in seconds.
-
Security. Data processing is executed directly in Snowflake, and results (and invalid samples, if configured) are returned. This means there is no transfer of data required or any external services or servers; all processes run on secured and governed Snowflake warehouses.
From the user perspective, the results are stored and appear the same as when DPE handles processing. -
Infrastructure. Utilize your existing infrastructure to evaluate data—no need to set up and manage new processing servers.
Prerequisites
There are two key parts when it comes to enabling pushdown processing. You must:
-
Configure Snowflake to allow Ataccama to transfer functions. Snowflake configuration is via worksheet which defines the working database and grants access to defined user roles.
-
Enable pushdown profiling on the connection.
It is also necessary to make sure that DPE allows for automatic synchronization of user-defined functions. See DPE Configuration.
Configure Snowflake pushdown processing
Pushdown processing is enabled on the connection level.
To create a new Snowflake connection, follow the instructions in Snowflake Connection. To enable pushdown profiling on an existing connection:
-
In Knowledge Catalog > Sources, select your Snowflake source.
-
Select the Connections tab.
-
For the required connection, use the three dots menu and select Edit.
-
In the Pushdown processing section, select Enable pushdown processing.
After enabling pushdown processing on a new or existing source:
-
In Working database, provide the name of the database you want to use for storing the stage, functions, and temporary tables created during profiling. You have two options:
-
Use the database that is being processed (default).
-
Create a custom database (recommended).
Additional setup in Snowflake is required. See Create the working database in Snowflake.
-
-
Select Add JDBC role to add a JDBC role if you need to share the Snowflake pushdown setting across different credentials (for example, to a common underlying role).
Create the working database in Snowflake
In ONE, you only need to enter the working database name, as described in the previous section. However, you need to execute a script in Snowflake independently before proceeding, so that Snowflake allows Ataccama to transfer functions.
In Snowflake, select + Worksheet, and follow these instructions. The following script creates the working database and grants access to defined user roles.
-
Working database name (represented by
<working_db>
placeholder in the example script): The name of the temporary database that is created to store Ataccama domain lookups and other temporary data. The database can be deleted after you finish using ONE.Multiple users can reuse the same database if they have access to it.
-
Roles (represented by
<pushdown_role>
in the example script): Specify the Snowflake roles that can access the Ataccama working database. If no roles are specified here, all users have the access to the database. -
Users (represented by
<sample_user>
and<another_user>
in the example script): Assign the created role to users as required.
-- Create working database and stage
CREATE DATABASE IF NOT EXISTS <working_db>;
CREATE STAGE IF NOT EXISTS _ATC_ONE_STAGE;
-- Create role
CREATE ROLE IF NOT EXISTS <pushdown_role>;
-- Assign role to user
GRANT ROLE <pushdown_role> TO USER <sample_user>;
GRANT ROLE <pushdown_role> TO USER <another_user>;
-- Grant access to database
GRANT USAGE ON DATABASE <working_db> TO ROLE <pushdown_role>;
-- Grant access to schema
GRANT USAGE ON SCHEMA public TO ROLE <pushdown_role>;
GRANT CREATE TABLE ON SCHEMA public TO ROLE <pushdown_role>;
GRANT CREATE SEQUENCE ON SCHEMA public TO ROLE <pushdown_role>;
GRANT CREATE FUNCTION ON SCHEMA public TO ROLE <pushdown_role>;
-- Grant access to stage
GRANT READ ON STAGE _ATC_ONE_STAGE TO ROLE <pushdown_role>;
GRANT WRITE ON STAGE _ATC_ONE_STAGE TO ROLE <pushdown_role>;
----
Granting privileges on the public schema of the working database is mandatory: no other schema can be used for pushdown. |
If you use a custom stage name (defined using plugin.snowflake.ataccama.one.stage-name ) instead of _ATC_ONE_STAGE created by the script, you need to add GRANT CREATE STAGE ON SCHEMA public TO ROLE <pushdown_role>; to your script.
|
Processing in Snowflake
Some limitations exist when working with pushdown processing in Snowflake, and some user actions are required.
The data quality processing configurations available in ONE are translated to Snowflake SQL language, either by utilizing native SQL functions or creating new functions. This is done by DPE so at least one DPE is still required to handle the translation of these configurations to Snowflake functions. |
Where is it used?
Pushdown processing in Snowflake can be utilized for the following data quality functions:
-
Profiling
-
Data observability (excluding term detection and AI-powered anomaly detection)
-
DQ evaluation
-
Monitoring projects
A number of limitations apply (see Current limitations).
How does it compare to processing in ONE?
There is mostly no visual indication in ONE that processing is executed externally.
The exception is when you use a configuration that is not supported. In this instance you see the following warning: This condition can’t be applied on Snowflake with enabled pushdown.
If you want to use the Snowflake processing, you need to define an alternative configuration. For full details of functions which are not yet supported, see Current limitations.
Equally, in monitoring projects, if you try and apply an incompatible rule to a catalog item, you see a warning.
When you run full profiling in ONE, classification jobs (domain detection and fingerprints for term suggestions) are also triggered. If Snowflake pushdown profiling is enabled, by default, these classification jobs run on DPE on a sample of the data.
If you want to stop the classification jobs being triggered, or ensure they run in pushdown (on the full dataset), change the value of plugin.profiling.ataccama.one.profiling.features.pushdown.SNOWFLAKE.domain-detection
in mmm-backend/etc/application.properties
:
-
plugin.profiling.ataccama.one.profiling.features.pushdown.SNOWFLAKE.domain-detection=SAMPLE
: Classification jobs run on a sample of the data on DPE. -
plugin.profiling.ataccama.one.profiling.features.pushdown.SNOWFLAKE.domain-detection=NONE
: No classification jobs run. -
plugin.profiling.ataccama.one.profiling.features.pushdown.SNOWFLAKE.domain-detection=PUSHDOWN
: Classification jobs run in pushdown.
Current limitations
Listed in this section are the known limitations when working with pushdown processing.
-
Performance issues, such as excessive memory and time consumption, can be expected when working using Snowflake pushdown with views (abstractions on the level of the database defined by SQL).
-
Post-processing plans are not supported.
-
Only simple rules are supported. Component and aggregation rules are not supported.
-
Even when working with simple rules, some rule expressions are not supported. You are notified about this in the web application if you use an expression which is not supported.
-
The size of Snowflake warehouse used determines the performance, so processing could be impacted if a small warehouse is used.
-
Load components are not supported.
-
Snowflake pushdown processing can’t be used for sample profiling. Sample profiling runs on Ataccama DPEs.
-
DQ filters in monitoring projects can’t be used when pushdown is enabled.
-
When profiling:
-
Quantiles are calculated only for numeric (integer, long, float), date and datetime attribute types.
-
Frequency groups are not calculated.
-
Was this page useful?