User Community Service Desk Downloads
If you can't find the product or version you're looking for, visit support.ataccama.com/downloads

Databricks Lineage Scanner

The Databricks scanner supports two modes of lineage extraction:

Online Databricks scanning

The online scanner uses the Unity Catalog System tables. A JDBC connection is established to query the lineage from the system table system.access.column_lineage.

Lineage metadata enrichment

To enrich the lineage data with transformation information from Databricks Notebook, the respective API is called. For details, see the Databricks official documentation: List contents.

Lineage post-processing

The lineage information from Unity Catalog system tables has several shortcomings that the scanner handles by post-processing it.

Non-existing database objects detection

The following Data Dictionary tables are accessed to determine whether the lineage database objects (tables, views, columns) existed at the time of lineage extraction:

  • system.information_schema.tables

  • system.information_schema.columns

Lineage for external table referenced via cloud storage path

External tables referenced using their cloud storage path are not resolved: See the official Databricks documentation: Troubleshooting external table queries.

To support this use case, the scanner uses the Unity Catalog REST API.

Both statements insert data into the same external table. For the first one, the scanner invokes the Unity Catalog REST API to look up the external table (by its cloud storage path).

INSERT INTO delta.`abfss://ataccama-testing@ataccamadatalakegen2.dfs.core.windows.net/customer`
SELECT c.* FROM stage_customer c;

INSERT INTO my_catalog.my_schema.customer
SELECT c.* FROM stage_customer c;

Online scanning prerequisites

The following prerequisites must be in place before you try to run the Databricks online lineage scanner:

  1. The system schema ACCESS must be enabled. Databricks system table schemas are disabled by default. For lineage extraction, the schema ACCESS is required and must be enabled.

    For instructions, see the official Databricks documentation: Enable system tables.

  2. Databricks cluster or Databricks SQL warehouse for metadata extraction. For metadata extraction via JDBC, the scanner requires a Databricks all-purpose cluster or a Databricks SQL warehouse (see the configuration parameter httpPath).

    We recommend using a serverless SQL warehouse. Size X-SMALL should be sufficient for most use cases. If the metadata extraction step takes a very long amount of time (for example, more than 30 minutes), consider using a larger SQL warehouse.

  3. Access of relevant data dictionary tables.

  4. Access to Unity Catalog REST APIs.

  5. Access to Databricks REST APIs.

Online scanner configuration

Fields that act as include or exclude filters (includeCatalogs, excludeCatalogs, includeSchemas, excludeSchemas) are case-insensitive and support SQL-like wildcards, such as % and _.

You will be asked for the master password and the value of DATABRICKS_TOKEN when trying to run the Databricks scanner.

All fields marked with an asterisk (*) are mandatory.

Property Description

name*

Unique name for the scanner job.

sourceType*

Specifies the source type to be scanned. Must contain DATABRICKS.

description*

A human-readable description of the scan.

oneConnections

List of Ataccama ONE connection names for future automatic pairing.

lineageFromPastNDays*

Lineage history in days. For example, if set to 10, the lineage extract will contain lineage from transformations executed in the past 10 days.

host*

Databricks host domain

httpPath*

Databricks compute resources URL. To find out your HTTP path, see the official Databricks documentation: Get connection details for a Databricks compute resource.

workspaceUrl*

Databricks workspace URL.

token*

Personal Access Token (PAT) used for JDBC connection and REST API authentication.

includeCatalogs

Restricts the lineage extraction to a specified list of catalogs. All source catalog objects (TABLE, VIEW) and the target catalog objects must be within the included databases list to have lineage from a query.

excludeCatalogs

List of excluded catalogs. If one of the source catalog objects (TABLE, VIEW) or the target catalog objects are within the excluded catalog list, the lineage will be not included.

includeSchemas

Restricts the lineage extraction to a specified list of schemas.

excludeSchemas

List of excluded schemas. If one of the source catalog objects (TABLE, VIEW) or the target catalog objects are within the excluded schemas list, the lineage will be not included.

Databricks online scanner example configuration
{
   "baseOutputFolderName":null,
   "scannerConfigs":[
      {
         "sourceType":"DATABRICKS",
         "name":"DatabricksScan",
         "description":"Scan Databricks for us-east region",
         "lineageFromPastNDays":30,
         "host":"adb-4013958699284341.1.azuredatabricks.net",
         "httpPath":"/sql/1.0/warehouses/31525eab5763a6ed",
         "token":"@@ref:ata:[DATABRICKS_TOKEN]",
         "workspaceUrl":"https://adb-4013958699284341.1.azuredatabricks.net",
         "includeCatalogs":[
            "prod_stage",
            "prod_bi"
         ],
         "excludeCatalogs":[
            ""
         ],
         "includeSchemas":[
            "silver",
            "gld%"
         ],
         "excludeSchemas":[
            ""
         ]
      }
   ]
}

Determine HTTP path for an SQL warehouse

The following is the format of the JDBC URL that is constructed from the job configuration:

/sql/1.0/warehouses/31525eab5763a6ed
HTTP Path determination

File-based Databricks lineage processing

Databricks' system tables can be manually exported into a CSV file, which can then be processed to extract lineage information without a direct connection to Databricks.

Before executing the query change the default time range to 7 days. This means that only lineage from the last 7 days will be extracted.

SQL query to export Lineage metadata into CSV

WITH filtered_column_lineage AS (
  SELECT
    account_id,
    metastore_id,
    workspace_id,
    entity_type,
    entity_id,
    entity_run_id,
    source_table_full_name,
    source_table_catalog,
    source_table_schema,
    source_table_name,
    source_path,
    source_type,
    lower(source_column_name) AS source_column_name,
    target_table_full_name,
    target_table_catalog,
    target_table_schema,
    target_table_name,
    target_path,
    target_type,
    lower(target_column_name) AS target_column_name,
    event_time
  FROM system.access.column_lineage l
  WHERE source_column_name IS not null
    AND target_column_name IS not null
    AND event_time > current_date() - INTERVAL 7 DAYS
  QUALIFY
    (source_table_name is null OR target_table_name is null OR
    (row_number() OVER (PARTITION BY source_table_full_name, target_table_full_name, lower(source_column_name), lower(target_column_name) ORDER BY event_time DESC) = 1))
)
SELECT lin.*,
  dense_rank() OVER (ORDER BY lin.source_table_full_name, lin.target_table_full_name, lin.entity_id, lin.entity_run_id, lin.event_time DESC) as unique_transformation_id,
  src_tab.table_name as source_table_name_from_data_dict,
  lower(src_col.column_name) as source_column_name_from_data_dict,
  trg_tab.table_name as target_table_name_from_data_dict,
  lower(trg_col.column_name) as target_column_name_from_data_dict
FROM filtered_column_lineage lin
LEFT JOIN system.information_schema.tables src_tab ON
  src_tab.table_catalog = lin.source_table_catalog AND
  src_tab.table_schema = lin.source_table_schema AND
  src_tab.table_name = lin.source_table_name
LEFT JOIN system.information_schema.columns src_col ON
  src_col.table_catalog = lin.source_table_catalog AND
  src_col.table_schema = lin.source_table_schema AND
  src_col.table_name = lin.source_table_name AND
  lower(src_col.column_name) = lin.source_column_name
LEFT JOIN system.information_schema.tables trg_tab ON
  trg_tab.table_catalog = lin.target_table_catalog AND
  trg_tab.table_schema = lin.target_table_schema AND
  trg_tab.table_name = lin.target_table_name
LEFT JOIN system.information_schema.columns trg_col ON
  trg_col.table_catalog = lin.target_table_catalog AND
  trg_col.table_schema = lin.target_table_schema AND
  trg_col.table_name = lin.target_table_name AND
  lower(trg_col.column_name) = lin.target_column_name;
In the case of file-based processing, Databricks Notebook metadata enrichment is not available.

File-based scan toolkit configuration

Property Description

*name

Unique name of the scanner job

*sourceType

Must contain DATABRICKS

*description

Human readable description

oneConnections

List of Ataccama ONE connection names for future automatic pairing

*columnLineageFile

Path to a column lineage CSV file. Can be absolute or relative path.

Legend: *mandatory

Example configuration
{
  "scannerConfigs": [
    {
      "sourceType": "DATABRICKS",
      "name": "DatabricksScan File based",
      "description": "Scan Databricks for us-east region",
      "columnLineageFile": "work/databricks_column_lineage.csv"
    }
  ]
}

What metadata is extracted from Databricks

  • Column lineage (system.access.column_lineage)

  • Workspaces, Directories, and Notebooks (using REST), fields:

    • Path

    • Language

    • Creation/modification timestamps

    • Object id

    • Resource id

  • Additional data dictionary metadata (using Unity Catalog REST API):

    • Catalogs

    • Schemas

    • Tables

Known limitations

Limitations of the Unity catalog itself

Scanner limitations (online and file)

  • Cross lineage from/to a cloud storage path (S3, Azure Data Lake storage) is not currently supported, including scenarios such as:

    • Loading data from cloud storage files

    • Unloading data from Databricks to cloud storage files

    • Example of a not-supported statement:

      Reading a CSV file on S3 into a Databricks table:
      create table test_table as
      SELECT * FROM read_files(
      's3://<bucket>/<path>/<file>.csv',
      format => 'csv',
      header => true);
  • Missing job information for notebooks executed from jobs. When lineage originated from a notebook that has been executed from a job’s task, the Unity Catalog’s lineage table contains only the identification of the notebook entity. This is a limitation of Unity Catalog system tables that contain only the Databricks entity that is the direct source of the lineage event (the notebook).

File-based scanner limitations

  • Lack of detailed Notebook metadata, which affects the completeness of the lineage information.

  • Lineage from/to external tables, when they are specified using the "delta" keyword instead of the external table name, is not captured accurately.

FAQ

  1. What if we are using Databricks throughout multiple cloud regions - will cross lineage be available in Ataccama ONE? How to configure the scanner?

    As there is a single Unity instance (and single Metastore) governing all catalogs, schemas, and tables within an Azure/AWS/GCP region, a single Databricks connection can capture lineage only from a single cloud region. Customers having UC in multiple cloud regions need a separate lineage scan for each cloud region.

  2. I don’t have lineage in the Unity Catalog for some notebooks, what can be the reason?

    One reason can be that in the case of a load process the intermediary tables are cleaned. Due to the absence of those temp/worker tables, Databricks might not be able to generate lineage in UC.

Was this page useful?