User Community Service Desk Downloads
If you can't find the product or version you're looking for, visit support.ataccama.com/downloads

dbt Lineage Scanner

Introduction

This guide details the integration of dbt scanners with Snowflake, focusing on supported methods, configurations, and metadata extraction. It is crucial for users to ensure compatibility with dbt version 1.0 or higher.

Integration methods

The dbt scanner facilitates metadata extraction through two primary integration methods:

File-based dbt scanner

Description

The file-based dbt scanner caters primarily to dbt Core users but is also functional for dbt Cloud customers. This method is particularly useful for environments where direct API access is limited or unavailable.

Limitations

File-based integration is limited in certain capabilities compared to Cloud API integration, including:

  • Separate configuration requirements for each dbt project

  • Inability to support runtime metadata extraction

Toolkit configuration

Property Description

*name

unique name of the scanner job

*sourceType

must contain DBT

*description

Human readable description

oneConnections

List of Ataccama ONE connection names for future automatic pairing

*file.path

Path to dbt artefacts

*file.manifest

dbt manifest file. Contains model, source, tests and lineage data.

Link to dbt documentation: docs.getdbt.com/reference/artifacts/manifest-json

file.catalog

dbt catalog file. It is optional. Contains schema data.

Link to dbt documentation: docs.getdbt.com/reference/artifacts/catalog-json

Legend: *mandatory

Example configuration
{
  "scannerConfigs": [
    {
      "name": "My dbt on-prem project",
      "sourceType": "DBT",
      "description": "Scan dbt on-prem project",
      "file": {
        "path": "c:\\dbt_input_files",
        "manifest": "manifest.json",
        "catalog": "catalog.json"
      }
    }
  ]
}

Cloud API-based dbt scanner

Description

The Cloud API-based scanner provides comprehensive metadata and lineage extraction from all dbt projects that the provided dbt API token can access.

API choices

Choosing between DISCOVERY_API and ADMIN_API is pivotal based on the available dbt Cloud deployments and their respective features as detailed below.

dbt cloud deployments

For the most current information, check the dbt Cloud documentation here: dbt Cloud documentation.

Toolkit Configuration

Property Description

*name

unique name of the scanner job

*sourceType

must contain DBT

*description

Human readable description

oneConnections

List of Ataccama ONE connection names for future automatic pairing

*connection.type

Possible values: DISCOVERY_API or ADMIN_API

*connection.adminApiUrl

dbt cloud Admin (Rest) API URL. Possible URLs (as of 20.12.2023) :

Production (US) : cloud.getdbt.com

Production (Europe): emea.dbt.com

Production (AU) : au.dbt.com

Link to dbt documentation: docs.getdbt.com/docs/dbt-cloud-apis/admin-cloud-api

*connection.discoveryApiUrl

Mandatory only for connection type DISCOVERY_API. Your dbt discovery (GraphQl) API URL endpoint. See: docs.getdbt.com/docs/dbt-cloud-apis/service-tokens

Link to dbt documentation: docs.getdbt.com/docs/dbt-cloud-apis/discovery-querying

*connection.accessToken

dbt service account token. Refer to dbt doc how to create one: docs.getdbt.com/docs/dbt-cloud-apis/service-tokens

Link to dbt documentation: docs.getdbt.com/docs/dbt-cloud-apis/service-tokens

*onlyProductionEnvironments

Boolean value. If set to "true", lineage is extracted only for dbt jobs executed on dbt environments marked as production environments. If omitted it is evaluated to value "true".

Link to dbt documentation: docs.getdbt.com/docs/deploy/deploy-environments#set-as-production-environment

includeProjects

List of included dbt projects. It can be used to limit lineage extraction to the specified list of dbt projects.

đź’ˇ Configure includeProjects or excludeProjects or none of them. When you set includeProjects setting excludeProjects make no sense as these filters are mutually exclusive.

excludeProjects

List of excluded dbt projects.

Legend: *mandatory

Example configuration
{
  "scannerConfigs": [
    {
      "name": "dbt ATA prod - Discovery API",
      "sourceType": "DBT",
      "description": "Scan dbt sources",
      "onlyProductionEnvironments": true,
      "includeProjects": ["Dwh", "Customer Mart"],
      "excludeProjects": [],
      "connection": {
        "type": "DISCOVERY_API",
        "adminApiUrl": "https://cloud.getdbt.com",
        "discoveryApiUrl": "https://metadata.cloud.getdbt.com/graphql",
        "accessToken":  "@@ref:ata:[DBT_TOKEN]"
      }
    }
  ]
}

Metadata extraction details

Metadata extracted from dbt projects is critical for maintaining up-to-date and accurate data landscapes. The following table outlines the specific types of metadata available for extraction from dbt projects, particularly when using dbt Cloud:

Metadata Type Supported in dbt Version Description

Runtime metadata

dbt Cloud only

Includes dynamic metadata viewable in diagrams, such as dbt model descriptions, job names, and execution times.

dbt model description

All versions

Detailed descriptions of each model, which can include information like the purpose of the model, its design, and key considerations.

dbt job name

All versions

Name of the dbt job in which the model was last executed, which is crucial for tracking and managing dbt jobs.

Last run timestamp

All versions

Timestamp indicating when the model was last run, useful for monitoring model freshness and scheduling updates.

Last successful run timestamp

Optional

Timestamp of the last successful run, providing insights into the reliability and stability of the dbt processes.

Last run status

All versions

Current status of the last run, which helps in identifying successes or failures in recent executions.

Last run duration

All versions

Duration of the last run, important for performance analysis and optimization.

Last run processed rows

Certain databases

Number of rows processed in the last run, applicable only to specific databases and providing a measure of the volume of data handled.

Identification of hardcoded source object names

Backend only

Identifies dbt models where source object names are hardcoded, which is a development anti-pattern.

Identification of orphan CTEs

Backend only

Identifies dbt models with unreferenced WITH clauses, potentially indicating performance issues or incorrect impact analysis.

This detailed metadata extraction is instrumental in maintaining an efficient, transparent, and optimized data management environment. The information enables teams to monitor dbt project health, track changes, and plan improvements based on empirical data.

Supported database platforms

Database Platform dbt Cloud Support Datastore Level Lineage Attribute Level Lineage Python Models Support in dbt

Snowflake

Yes

Yes

Yes*

Yes

PostgreSQL

Yes

Yes

Partially*

No

Amazon Redshift

Yes

Yes

Partially*

No

Microsoft Fabric

Yes

Yes

Partially*

Yes

Databricks

Yes

Yes

Partially*

Yes

Google BigQuery

Yes

Yes

In Next Scanner Release

Yes

Apache Spark

Yes

No

No

No

Starburst/Trino

Yes

No

No

No

Azure Synapse

No

Yes

Partially*

No

Other

No

No

No

No

Details on attribute level lineage and future support plans are explained in the footnotes.

  • Yes - Attribute level lineage means:

    • For dbt python models, only Datastore Level Lineage is supported. Column lineage for Python models is planned.

  • Partially - Attribute lineage is available only when:

    • dbt documentation generation is enabled (so “catalog.json” is available).

    • The dbt model is an SQL model (not Python) and uses ANSI SQL syntax.

    • Other DB technologies using ANSI SQL syntax can be supported with minimal effort (i.e., Oracle, MySQL).

After the availability of a specific internal database technology scanner (i.e., for Azure Synapse) → the full attribute level will be available.

Unsupported features and future developments

This section highlights the dbt features that are currently not supported by the scanner, as well as insights into future developments and planned enhancements.

Currently unsupported dbt features

  • dbt Semantic Layer utilizing MetricFlow

  • Cross project references - as part of dbt Mesh

  • dbt artifacts (projects, models) are not available as catalog items.

  • Enrichment of Catalog Items with metadata defined in dbt.

    • For instance, custom metadata or tags defined on a table or column level in dbt will not be available on catalog items (as platforms like Atlan support this).

  • dbt tests are currently not supported.

Future Developments

Planning for future releases includes enhancements in these areas:

  • Full attribute level lineage for Azure Synapse, pending the development of a specific internal database technology scanner.

  • Support for column lineage in dbt python models, which is currently planned but not yet implemented.

We want to keep users informed about the limitations of the current integration and what to expect in upcoming updates. With this clarity, users can better plan their dbt implementation strategies and understand how future changes might improve their operations.

Scanner special features

dbt scanner supports the same SQL "anomaly" detection checks as the Snowflake scanner. See the SQL "DQ" (Anomaly) detections for more details.

FAQ

These are some common questions related to dbt scanner integration:

  • How much time will it take to scan lineage from dbt with around 500 models?

    • File input: 1 minute

    • Discovery API: 2 minutes

    • Admin API: 6 minutes

Was this page useful?