Databricks Lineage Scanner
The Databricks scanner supports two modes of lineage extraction:
-
Online Databricks scanning using Databricks Unity Catalog.
-
File-based Databricks lineage processing or offline processing of a pre-exported CSV file from metadata tables of Databricks Unity Catalog.
Online Databricks scanning
The online scanner uses the Unity Catalog System tables.
A JDBC connection is established to query the lineage from the system table system.access.column_lineage
.
Lineage metadata enrichment
To enrich the lineage data with transformation information from Databricks Notebook, the respective API is called. For details, see the Databricks official documentation: List contents.
Lineage post-processing
The lineage information from Unity Catalog system tables has several shortcomings that the scanner handles by post-processing it.
Non-existing database objects detection
The following Data Dictionary tables are accessed to determine whether the lineage database objects (tables, views, columns) existed at the time of lineage extraction:
-
system.information_schema.tables
-
system.information_schema.columns
Lineage for external table referenced via cloud storage path
External tables referenced using their cloud storage path are not resolved: See the official Databricks documentation: Troubleshooting external table queries.
To support this use case, the scanner uses the Unity Catalog REST API.
Both statements insert data into the same external table. For the first one, the scanner invokes the Unity Catalog REST API to look up the external table (by its cloud storage path).
INSERT INTO delta.`abfss://ataccama-testing@ataccamadatalakegen2.dfs.core.windows.net/customer`
SELECT c.* FROM stage_customer c;
INSERT INTO my_catalog.my_schema.customer
SELECT c.* FROM stage_customer c;
Online scanning prerequisites
The following prerequisites must be in place before you try to run the Databricks online lineage scanner:
-
The system schema
ACCESS
must be enabled. Databricks system table schemas are disabled by default. For lineage extraction, the schemaACCESS
is required and must be enabled.For instructions, see the official Databricks documentation: Enable system tables.
-
Databricks cluster or Databricks SQL warehouse for metadata extraction. For metadata extraction via JDBC, the scanner requires a Databricks all-purpose cluster or a Databricks SQL warehouse (see the configuration parameter
httpPath
).We recommend using a serverless SQL warehouse. Size
X-SMALL
should be sufficient for most use cases. If the metadata extraction step takes a very long amount of time (for example, more than 30 minutes), consider using a larger SQL warehouse. -
Access of relevant data dictionary tables.
-
Access to Unity Catalog REST APIs.
-
Access to Databricks REST APIs.
Online scanner configuration
Fields that act as include or exclude filters (includeCatalogs
, excludeCatalogs
, includeSchemas
, excludeSchemas
) are case-insensitive and support SQL-like wildcards, such as %
and _
.
You will be asked for the master password and the value of DATABRICKS_TOKEN when trying to run the Databricks scanner.
|
All fields marked with an asterisk (*
) are mandatory.
Property | Description |
---|---|
|
Unique name for the scanner job. |
|
Specifies the source type to be scanned.
Must contain |
|
A human-readable description of the scan. |
|
List of Ataccama ONE connection names for future automatic pairing. |
|
Lineage history in days.
For example, if set to |
|
Databricks host domain |
|
Databricks compute resources URL. To find out your HTTP path, see the official Databricks documentation: Get connection details for a Databricks compute resource. See also Determine HTTP path for an SQL warehouse. |
|
Databricks workspace URL. |
|
Personal Access Token (PAT) used for JDBC connection and REST API authentication. |
|
Restricts the lineage extraction to a specified list of catalogs.
All source catalog objects ( |
|
List of excluded catalogs.
If one of the source catalog objects ( |
|
Restricts the lineage extraction to a specified list of schemas. |
|
List of excluded schemas.
If one of the source catalog objects ( |
{
"baseOutputFolderName":null,
"scannerConfigs":[
{
"sourceType":"DATABRICKS",
"name":"DatabricksScan",
"description":"Scan Databricks for us-east region",
"lineageFromPastNDays":30,
"host":"adb-4013958699284341.1.azuredatabricks.net",
"httpPath":"/sql/1.0/warehouses/31525eab5763a6ed",
"token":"@@ref:ata:[DATABRICKS_TOKEN]",
"workspaceUrl":"https://adb-4013958699284341.1.azuredatabricks.net",
"includeCatalogs":[
"prod_stage",
"prod_bi"
],
"excludeCatalogs":[
""
],
"includeSchemas":[
"silver",
"gld%"
],
"excludeSchemas":[
""
]
}
]
}
File-based Databricks lineage processing
Databricks' system tables can be manually exported into a CSV file, which can then be processed to extract lineage information without a direct connection to Databricks.
Before executing the query change the default time range to 7 days. This means that only lineage from the last 7 days will be extracted. |
SQL query to export Lineage metadata into CSV
WITH filtered_column_lineage AS (
SELECT
account_id,
metastore_id,
workspace_id,
entity_type,
entity_id,
entity_run_id,
source_table_full_name,
source_table_catalog,
source_table_schema,
source_table_name,
source_path,
source_type,
lower(source_column_name) AS source_column_name,
target_table_full_name,
target_table_catalog,
target_table_schema,
target_table_name,
target_path,
target_type,
lower(target_column_name) AS target_column_name,
event_time
FROM system.access.column_lineage l
WHERE source_column_name IS not null
AND target_column_name IS not null
AND event_time > current_date() - INTERVAL 7 DAYS
QUALIFY
(source_table_name is null OR target_table_name is null OR
(row_number() OVER (PARTITION BY source_table_full_name, target_table_full_name, lower(source_column_name), lower(target_column_name) ORDER BY event_time DESC) = 1))
)
SELECT lin.*,
dense_rank() OVER (ORDER BY lin.source_table_full_name, lin.target_table_full_name, lin.entity_id, lin.entity_run_id, lin.event_time DESC) as unique_transformation_id,
src_tab.table_name as source_table_name_from_data_dict,
lower(src_col.column_name) as source_column_name_from_data_dict,
trg_tab.table_name as target_table_name_from_data_dict,
lower(trg_col.column_name) as target_column_name_from_data_dict
FROM filtered_column_lineage lin
LEFT JOIN system.information_schema.tables src_tab ON
src_tab.table_catalog = lin.source_table_catalog AND
src_tab.table_schema = lin.source_table_schema AND
src_tab.table_name = lin.source_table_name
LEFT JOIN system.information_schema.columns src_col ON
src_col.table_catalog = lin.source_table_catalog AND
src_col.table_schema = lin.source_table_schema AND
src_col.table_name = lin.source_table_name AND
lower(src_col.column_name) = lin.source_column_name
LEFT JOIN system.information_schema.tables trg_tab ON
trg_tab.table_catalog = lin.target_table_catalog AND
trg_tab.table_schema = lin.target_table_schema AND
trg_tab.table_name = lin.target_table_name
LEFT JOIN system.information_schema.columns trg_col ON
trg_col.table_catalog = lin.target_table_catalog AND
trg_col.table_schema = lin.target_table_schema AND
trg_col.table_name = lin.target_table_name AND
lower(trg_col.column_name) = lin.target_column_name;
In the case of file-based processing, Databricks Notebook metadata enrichment is not available. |
File-based scan toolkit configuration
Property | Description |
---|---|
|
Unique name of the scanner job |
|
Must contain DATABRICKS |
|
Human readable description |
|
List of Ataccama ONE connection names for future automatic pairing |
|
Path to a column lineage CSV file. Can be absolute or relative path. |
Legend: *mandatory
{
"scannerConfigs": [
{
"sourceType": "DATABRICKS",
"name": "DatabricksScan File based",
"description": "Scan Databricks for us-east region",
"columnLineageFile": "work/databricks_column_lineage.csv"
}
]
}
What metadata is extracted from Databricks
-
Column lineage (
system.access.column_lineage
)-
Schema: Lineage system table schema
-
-
Workspaces, Directories, and Notebooks (using REST), fields:
-
Path
-
Language
-
Creation/modification timestamps
-
Object id
-
Resource id
-
-
Additional data dictionary metadata (using Unity Catalog REST API):
-
Catalogs
-
Schemas
-
Tables
-
Known limitations
Scanner limitations (online and file)
-
Cross lineage from/to a cloud storage path (S3, Azure Data Lake storage) is not currently supported, including scenarios such as:
-
Loading data from cloud storage files
-
Unloading data from Databricks to cloud storage files
-
Example of a not-supported statement:
Reading a CSV file on S3 into a Databricks table:create table test_table as SELECT * FROM read_files( 's3://<bucket>/<path>/<file>.csv', format => 'csv', header => true);
-
-
Missing job information for notebooks executed from jobs. When lineage originated from a notebook that has been executed from a job’s task, the Unity Catalog’s lineage table contains only the identification of the notebook entity. This is a limitation of Unity Catalog system tables that contain only the Databricks entity that is the direct source of the lineage event (the notebook).
FAQ
-
What if we are using Databricks throughout multiple cloud regions - will cross lineage be available in Ataccama ONE? How to configure the scanner?
As there is a single Unity instance (and single Metastore) governing all catalogs, schemas, and tables within an Azure/AWS/GCP region, a single Databricks connection can capture lineage only from a single cloud region. Customers having UC in multiple cloud regions need a separate lineage scan for each cloud region.
-
I don’t have lineage in the Unity Catalog for some notebooks, what can be the reason?
One reason can be that in the case of a load process the intermediary tables are cleaned. Due to the absence of those temp/worker tables, Databricks might not be able to generate lineage in UC.
Was this page useful?