Snowflake Lineage Scanner

Scanned and supported objects

Supported statement types and SQL syntax

The following types of statements are supported:

CREATE TABLE AS SELECT
CREATE VIEW
CREATE MATERIALIZED VIEW
INSERT
MERGE
SELECT
UPDATE

The scanner also supports the following Snowflake objects for Design Time Lineage:

Views
Materialized views

Lineage is also supported for procedures (SQL, JAVA, or Python). If lineage information was captured from a procedure, the procedure name is present in the lineage. Currently, the procedure name is displayed as a comment in the SQL code preview.

Lineage from procedures and other Snowflake objects is supported only indirectly based on the captured queries in Snowflake QUERY_HISTORY table.

Limitations

The following statement types that could be relevant for lineage are not supported:

SWAP tables
CLONE table

Statements containing the following SQL constructs will not be included in the lineage diagram (in Snowflake Enterprise Edition supported without expression details):

Values clause
Match recognize
Column rename
Rare table functions (such as Identifier)

In addition, the following features are currently not supported:

Lineage from Snowflake shares
Lineage from/to stage (for example, the COPY command for loading files from S3 storage into a Snowflake table)
Streams (in Snowflake Enterprise Edition supported without expression details)
Dynamic tables (in Snowflake Enterprise Edition supported without expression details)

Supported connectivity

Connector type: JDBC.
Authentication method: Key Pair Authentication.

Snowflake permissions

The lineage scanner uses metadata from views located in the ACCOUNT_USAGE schema. For details, see the official Snowflake documentation: Account Usage.

The IMPORTED PRIVILEGES role can access all required views from the ACCOUNT_USAGE schema. This role doesn’t have access to any data.

View name Database role Used in lineage toolkit Enterprise Edition only Purpose Latency

View name	Database role	Used in lineage toolkit	Enterprise Edition only	Purpose	Latency
TABLES	`OBJECT_VIEWER`	Yes	No	Database metadata	45 minutes
COLUMNS	`OBJECT_VIEWER`	Yes	No	Database metadata	90 minutes
VIEWS	`OBJECT_VIEWER`	Yes	No	Database metadata	90 minutes
QUERY_HISTORY	`GOVERNANCE_VIEWER`	Yes	No	SQL history	90 minutes
SESSIONS	`SECURITY_VIEWER`	Yes, partially	No	Application name extraction	3 hours
ACCESS_HISTORY	`GOVERNANCE_VIEWER`	Yes	Yes	Lineage information	3 hours

TABLES

OBJECT_VIEWER

Yes

Database metadata

45 minutes

COLUMNS

OBJECT_VIEWER

Yes

Database metadata

90 minutes

VIEWS

OBJECT_VIEWER

Yes

Database metadata

90 minutes

QUERY_HISTORY

GOVERNANCE_VIEWER

Yes

SQL history

90 minutes

SESSIONS

SECURITY_VIEWER

Yes, partially

Application name extraction

3 hours

ACCESS_HISTORY

GOVERNANCE_VIEWER

Yes

Lineage information

3 hours

To allow other roles to access the database and schemas and query the views, run the following command with the ACCOUNTADMIN role:

USE ROLE ACCOUNTADMIN;
GRANT IMPORTED PRIVILEGES ON DATABASE snowflake TO ROLE ata_mde_role;

Scanner configuration

Online (JDBC) based extraction connects to Snowflake and extracts the necessary metadata and query log from the database.

All fields marked with an asterisk (*) are mandatory.

Property Description

name*

Unique name for the scanner job.

sourceType*

Specifies the source type to be scanned. Must contain SNOWFLAKE.

description*

A human-readable description of the scan.

oneConnections

List of Ataccama ONE connection names for future automatic pairing.

inputDataCatalogFilePath

Path to a catalog file for initialization. Otherwise, an empty database is used.

Used only for troubleshooting.

accountIdentifier*

Snowflake account identifier. Displayed as the connection name on the Import Lineage screen in ONE.

See the official Snowflake documentation: Account name in your organization.

generateApplicationImpactLineage

Generates application impact lineage for each application that executed SELECT statements in the defined period.

Can be used to see Snowflake objects referenced by unknown applications (such as reverse ETL tools, BI tools without lineage support, machine learning tools).

Possible values: true, false.

createUsageStatisticsReports

Generates a CSV file for each accessed column. See [Usage statistics export].

Possible values: true, false.

jdbcUrl*

Full JDBC connection string. If account and all related connection attributes are configured, leave it empty.

account*

Refers to the part preceding 'snowflakecomputing.com' in the URL of your Snowflake connection.

For instance, if your access URL is https://myorg-identifier.us-west-1.snowflakecomputing.com/, the corresponding account name would be myorg-identifier.us-west-1. If jdbcUrl is set, leave this empty.

username

Snowflake username.

warehouse

Snowflake warehouse. If not provided, the default warehouse assigned to the username is used.

role

Snowflake role. If not provided, the default role assigned to the username is used.

privateKey

Snowflake private key used for key-pair authentication.

lineageFromPastNDays*

Lineage history in days.

For example, a value of 10 means that lineage will contain only statements executed during the last 10 days with a lag of three hours. This lag means that lineage from the last three hours is not included due to the query_history and session views latency, see Account Usage.

At first run, set this to a reasonably small value (1-2 days). This will help you estimate how long the lineage calculation would take for a longer period.

Currently, the maximum value is limited to 100 days.

icebergTablesExternalCatalog

Scans external Iceberg tables. Currently allowed value: AWS_GLUE.

Set the property to this value when using Snowflake together with AWS Glue.

includeDatabases

Restricts the lineage extraction to a specified list of databases. All source database objects (TABLE, VIEW) and the target database objects of an SQL query must be within the included databases list to have lineage from a query.

includeDatabases and excludeDatabases settings are mutually exclusive. Configure only one of them, or neither.

excludeDatabases

List of excluded databases. If one of the source database objects (TABLE, VIEW) or the target database objects of an SQL query is within the excluded database list, the lineage for the query will be not included.

includeDatabases and excludeDatabases settings are mutually exclusive. Configure only one of them, or neither.

includeSchemas

Restricts the lineage extraction to a specified list of schemas. All source database objects (TABLE, VIEW) and the target database objects of an SQL query must be within the included schemas list to have lineage from a query.

includeSchemas and excludeSchemas settings are mutually exclusive. Configure only one of them, or neither.

excludeSchemas

List of excluded schemas. If one of the source schemas objects (TABLE, VIEW) or the target database objects of an SQL query is within the excluded schemas list, the lineage for the query will be not included.

includeSchemas and excludeSchemas settings are mutually exclusive. Configure only one of them, or neither.

includeApplications

List of included applications. Can be used to limit lineage extraction to specific applications.

Example values: dbt, Talend, PythonConnector, Tableau.

excludeApplications

List of excluded applications. Can be used to limit lineage extraction to specific applications.

Example values: dbt, Talend, PythonConnector, Tableau.

excludeQueriesFromDbt

Turns off extracting queries from dbt. This is useful when you plan to use the dbt scanner for creating more detailed lineage from dbt, which includes also metadata from dbt.

Default value: true.

skipTableColumnsExtraction

If set to true, data dictionary metadata is not extracted. Set it to true only for troubleshooting or testing purposes, or in specific proof-of-concept use cases.

Default value: false.

skipViewsExtraction

If set to true, views metadata is not extracted. The lineage export will contain lineage only from the views present in the query history view. That is, if you set lineageFromPastNDays to five days, only views created or recreated within the past five days will have lineage.

Default value: false.

Snowflake online (JDBC) extraction example configuration

{
   "scannerConfigs": [
      {
         "name": "snowflake-lineage",
         "accountIdentifier": "myorg-identifier",
         "sourceType": "SNOWFLAKE",
         "description": "Scan lineage demo export files",
         "generateApplicationImpactLineage": false,
         "createUsageStatisticsReports": false,
         "lineageFromPastNDays": 5,
         "icebergTablesExternalCatalog": "AWS_GLUE",
         "includeDatabases": [
            "STAGE_DEV",
            "BI_DEV"
         ],
         "excludeDatabases": [],
         "includeSchemas": [""],
         "excludeSchemas": [
            "MY_SANDBOX"
         ],
         "excludeQueriesFromDbt": true,
         "excludeApplications": [
            "Talend"
         ],
         "connection": {
            "account": "myorg-identifier.us-west-1",
            "username": "ATA_LINEAGE_EXTRACTION_USER",
            "privateKey": "@@ref:ata:[SNOWFLAKE_KEY]",
            "warehouse": "SMALL_WAREHOUSE"
         }
      }
   ]
}

Application impact lineage

A table-level impact lineage is created for each referenced Snowflake table and application combination. An application here can be any tool querying Snowflake tables or views in the extracted period (for example, last 10 days):

BI tools (such as Tableau, Power BI, Metabase)
Machine learning applications
Reverse ETL tools

To turn on generating application impact lineage, set generateApplicationImpactLineage to true in the JSON configuration file. By default, this type of lineage is not generated to speed up lineage extraction. == Usage statistics export Usage statistics can be exported in the CSV format. In the following example, you can see what usage statistics look like for Ataccama Datawarehouse most queried columns (maximum of three columns per table): image::usage-statistics.png[] The CSV files can be found in the folder holding the scanner execution output files (exec_<date>_<id>/<scanner-name>). Look for ColumnUsage.csv and TableUsage.csv files: image::csv-files-location.png[] == SQL DQ (anomaly) detection In addition to usage statistics, results of the following SQL validations are exported into CSV files as well:

Detection results of unreferenced tables are available in the NotUsedFromSources.csv file:

Detection results of orphaned WITH clauses are available in the NotReferencedWithClauses.csv file:

The CSV files can be found in the folder holding the scanner execution output files (exec_<date>_<id>/<scanner-name>). image::image-2024-09-17-15-38-48-498.png[] == Troubleshooting

No active warehouse selected error

Problem

You received the following error when running the scanner:

No active warehouse selected in the current session.
Select an active warehouse with the 'use warehouse' command.

Solution

Try setting the warehouse in the connection configuration section of the JSON configuration file, then rerun the command.

JDBC driver communication error

Problem

You received the following error when running the scanner:

JDBC driver encountered communication error.
Message: HTTP status=403.

Solution

This can indicate that the configuration parameter account contains only the Snowflake accountIdentifier (for example, myorg-identifier) instead of accountIdentifier together with the cloudRegion (for example, myorg-identifier.us-west-1). Try updating the account parameter in the JSON configuration file, then rerun the command.

Was this page useful?