From Query History to Lineage

Motivation

This document explains how to construct lineage from a query log in Snowflake, addressing various challenges and employing specific methodologies to ensure accurate and relevant lineage representation.

Challenges of runtime Lineage

Lineage construction from runtime data involves several challenges:

Identification of relevant queries that contribute to lineage.
Exclusion of lineage generated by ETL/ELT tools for which specialized scanners are available or will be provided, such as dbt, known as "Operation Lineage."

Frequency of ETL operations

Some ETLs can be scheduled to run monthly, quarterly, while others can run even hourly or more frequently.
ETL operations may also vary in ingestion frequency:
- Batch
- Micro-batch
- Real-time

Design time vs. runtime lineage

Traditional lineage tools typically focus on design time lineage, derived from database object definitions. For Snowflake, this includes:

Views.
Tasks.
Procedures and functions.
Streams.

Constructing Snowflake Lineage

Our Snowflake lineage is derived from:

Runtime lineage (query log), augmented with design time objects lineage.
- Views and materialized Views definition lineage is the only design time lineage we apply.
- Views definition can be stable in time - so it will be not present in the query log (for the extracted query log time range - i.e.: past 5 days).
Deterministic logic to construct relevant lineage, considering:
- Time relevance and current validity.
- Existence of objects (not deleted) during the time of extraction.
- Exclusion of temporary objects (i.e. tables) and outdated lineage (most challenging part).

Runtime Lineage sources

Runtime lineage is extracted from the query log, not directly from Snowflake’s runtime lineage features.
Future plans include utilizing Snowflake’s ACCESS_HISTORY for queries not currently parsed.
- No expressions, only direct lineage.
We currently don’t support lineage versioning (multiple lineage version). However, this is not a Snowflake scanner limitation.
No AI is utilized (as of now).

Lineage validity

Maintaining lineage relevance involves:

Filtering based on object (database, schema) and database relevancy - i.e. exclude non-production, sandbox databases.
- Example: exclude ad-hoc backups of production tables.
  create table my_db.my_schema.dim_customer_backup as select * from dwh_prod.core.dim_customer;
Excluding queries from specific applications that produced the query log to avoid duplicated lineage.

Identifying Lineage-relevant statements

In Snowflake, "query" and "statement" are interchangeable terms. Relevant statements for lineage include:

CREATE TABLE AS SELECT (CTAS)
MERGE
INSERT (specifically INSERT-SELECT: inserting from other DB objects)
UPDATE
CREATE VIEW

DELETE is relevant for impact analysis, but it doesn’t affect lineage analysis - so it is excluded.

Excluding non-relevant Lineage sources

Certain applications and operations are excluded:

Ataccama ONE
dbt, as there’s a dedicated dbt scanner, and we want to avoid multiple lineages for single database objects
Fivetran
Ad-hoc queries, such as production table backups (CTAS statements)

Identifying statements from the latest ETL run

This is the most challenging part as we construct runtime lineage that should be valid by the time of extraction. If we have daily or hourly ETL jobs, we are interested to see the lineage from the most recent execution of these jobs.

Identifying the most recent statements relevant to the current lineage involves:

Analyzing the last execution of procedures.

Focusing on the last relevant statement execution for the target object.

Planned enhancements of this algorithm involve improving this logic so that it can be applied also for Snowflake Tasks (scheduled user-defined object - that can contain SQL statements and procedure calls), as well as the last execution of tasks that will have the highest priority.

Did you know that Databricks Unity Catalog does not make any "effort" to identify the valid lineage at the time of lineage extraction? Instead, it returns lineage from all queries, jobs, and notebooks within a certain time range (Last 90 days) - along with the timestamp for each lineage edge (from column → to column relationship). (Valid on 11/02/2024)

Procedure level deduplication

Snowflake procedure calls are deduplicated based on the most recent execution , ensuring that only relevant and current lineage is included.

The Snowflake query log includes 2 calls (invocations) of the procedure LOAD_TARGET_TABLE. In this example, both procedure calls would result in the same lineage, as we have recently refactored our code by replacing the combination of INSERT and UPDATE statements with a MERGE statement.

Without this deduplication, all 3 statements would be part of the lineage, resulting in duplicate lineage transformations in ONE.

Example

CREATE PROCEDURE LOAD_TARGET_TABLE()
RETURNS VARCHAR
LANGUAGE SQL
AS
BEGIN
    MERGE INTO TARGET T USING SOURCE S ON T.ID = S.ID
    WHEN MATCHED THEN UPDATE SET T.NAME = S.NAME, T.AGE = S.AGE
    WHEN NOT MATCHED THEN INSERT (T.ID, T.NAME, T.AGE) VALUES (S.ID, S.NAME, S.AGE);
    RETURN 'Done';
END;

Statement level deduplication

The deduplication at the statement level is crucial for maintaining the accuracy and relevance of lineage.

By applying deduplication, we ensure that only the most recent and relevant versions of statements are included in the lineage, particularly for operations that frequently update or transform data.

This approach helps to avoid clutter in the lineage visualization and provides users with clear and actionable insights.

Supported statements

CREATE TABLE AS SELECT (CTAS)
CREATE VIEW

Simple example for CTAS statement deduplication:

Snowflake query log contains 2 different statements to create table TAB_1

create table db_1.schema_1.tab_1 as
select * from customer
where category = 'XYZ';

The latest query from November 2nd will be included in the lineage diagram.

Unsupported statements

For these statements (query types) we are currently not able to find the latest statement for a specified target object:

MERGE
INSERT (specifically INSERT-SELECT: inserting from other DB objects)
UPDATE

no deduplication lineage unique statement

Example for INSERT

Snowflake query log contains 2 different INSERT statements for TAB_1

Both insert will be included in the lineage. This is example is trivial, but in real world scenarios, i.e. when more than 15 inserts are loading a single target table (each from a different source system), when some inserts are executed hourly, some daily, to determine the lineage valid at the time of extraction is not straightforward.

insert into db_1.schema_1.tab_1 as
select * from customer
where category = 'XYZ';

Limitations

Currently only procedures with no arguments are supported.
Currently leaf procedures execution is evaluated. If procedure PROC_OUTER invokes procedure PROC_INNER, the procedure PROC_INNER last execution is taken into account.

Future directions

The evolution of lineage extraction could potentially leverage AI to refine the accuracy and relevance of lineage data, especially in complex scenarios where traditional methods fall short.

Was this page useful?