dbt Lineage Scanner

Scanned and supported objects

Resources scanned

The dbt scanner works together with the underlying database technology scanner. See the specific database technology scanner in order to determine the supported scope.

Supported versions

Both dbt Core and Cloud are supported.

Limitations

The following dbt features are not supported:

dbt Semantic Layer using MetricFlow (introduced in dbt v1.6).
Cross-project references, as part of dbt Mesh (introduced in dbt v1.6)

dbt artifacts (projects, models) are not available as catalog items.
Enriching catalog items with metadata defined in dbt.

For instance, custom metadata or tags defined on a table or column level in dbt aren’t available on catalog items.
dbt tests are currently not supported.

Supported connectivity

Cloud API-based dbt scanner: See Cloud API-based dbt scanner.
- Connector type: API.
- Authentication method: Service Account Token.
File-based scanner: File-based dbt scanner.
- Local folder with extracted metadata.

Cloud API-based dbt scanner

The Cloud API-based scanner provides comprehensive metadata and lineage extraction from all dbt projects that the provided dbt API token can access.

Discovery vs Admin API

Deciding between DISCOVERY_API and ADMIN_API is essential as it determines based on the available dbt Cloud deployments and their respective features as detailed below.

For the most current information, check the dbt Cloud documentation here: dbt Cloud documentation.

Scanner Configuration

Property Description

name*

Unique name for the scanner job.

sourceType*

Specifies the source type to be scanned. Must contain DBT.

description*

A human-readable description of the scan.

oneConnections

List of Ataccama ONE connection names for future automatic pairing.

connection.type*

Specifies which API to use.

Possible values: DISCOVERY_API or ADMIN_API.

connection.adminApiUrl*

dbt cloud Admin (REST) API URL.

Possible values:

Production (US): https://cloud.getdbt.com
Production (Europe): https://emea.dbt.com
Production (AU): https://au.dbt.com

See the official dbt documentation: Administrative API.

connection.discoveryApiUrl*

Your dbt discovery (GraphQl) API URL endpoint. Required only if you are using DISCOVERY_API connection type.

See the official dbt documentation: Query the Discovery API.

connection.accessToken*

dbt service account token.

To create a token, see Service account tokens.

onlyProductionEnvironments*

If set to true, lineage is extracted only for dbt jobs executed on dbt environments marked as production environments. If left empty, it defaults to true.

See the official dbt documentation: Set as production environment.

includeProjects

Restricts lineage extraction to the specified list of dbt projects.

includeProjects and excludeProjects are mutually exclusive.

Set only one of them or neither.

excludeProjects

List of excluded dbt projects.

dbt online scanner example configuration

{
   "scannerConfigs": [
      {
         "name": "dbt ATA prod - Discovery API",
         "sourceType": "DBT",
         "description": "Scan dbt sources",
         "onlyProductionEnvironments": true,
         "includeProjects": [
            "Dwh",
            "Customer Mart"
         ],
         "excludeProjects": [
         ],
         "connection": {
            "type": "DISCOVERY_API",
            "adminApiUrl": "https://cloud.getdbt.com",
            "discoveryApiUrl": "https://metadata.cloud.getdbt.com/graphql",
            "accessToken": "@@ref:ata:[DBT_TOKEN]"
         }
      }
   ]
}

File-based dbt scanner

The file-based dbt scanner caters primarily to dbt Core but can also be used for dbt Cloud. This method is particularly useful for environments where direct API access is limited or unavailable.

Limitations

The file-based integration is limited in certain capabilities compared to the Cloud API integration, including:

Separate configuration requirements for each dbt project.
No support for runtime metadata extraction.

File-based scanner configuration

Property Description

Property	Description
`name*`	Unique name for the scanner job.
`sourceType*`	Specifies the source type to be scanned. Must contain `DBT`.
`description*`	A human-readable description of the scan.
`oneConnections`	List of Ataccama ONE connection names for future automatic pairing.
`file.path*`	Path to dbt artifacts.
`file.manifest*`	dbt manifest file. Contains model, source, tests, and lineage data.
`file.catalog`	dbt catalog file (optional). Contains schema data.

name*

Unique name for the scanner job.

sourceType*

Specifies the source type to be scanned. Must contain DBT.

description*

A human-readable description of the scan.

oneConnections

List of Ataccama ONE connection names for future automatic pairing.

file.path*

Path to dbt artifacts.

file.manifest*

dbt manifest file. Contains model, source, tests, and lineage data.

file.catalog

dbt catalog file (optional). Contains schema data.

dbt file-based scanner example configuration

{
   "scannerConfigs": [
      {
         "name": "My dbt on-prem project",
         "sourceType": "DBT",
         "description": "Scan dbt on-prem project",
         "file":{
            "path": "c:\\dbt_input_files",
            "manifest": "manifest.json",
            "catalog": "catalog.json"
         }
      }
   ]
}

Metadata extraction details

Metadata extracted from dbt projects is critical for maintaining up-to-date and accurate data landscapes. The following table outlines the specific types of metadata available for extraction from dbt projects, particularly when using dbt Cloud:

Metadata Type Supported in dbt version Description

Metadata Type	Supported in dbt version	Description
Runtime metadata	dbt Cloud only	Includes dynamic metadata viewable in diagrams, such as dbt model descriptions, job names, and execution times.
dbt model description	All versions	Detailed descriptions of each model, which can include information like the purpose of the model, its design, and key considerations.
dbt job name	All versions	Name of the dbt job in which the model was last executed, which is crucial for tracking and managing dbt jobs.
Last run timestamp	All versions	Timestamp indicating when the model was last run, useful for monitoring model freshness and scheduling updates.
Last successful run timestamp	Optional	Timestamp of the last successful run, providing insights into the reliability and stability of the dbt processes.
Last run status	All versions	Current status of the last run, which helps in identifying successes or failures in recent executions.
Last run duration	All versions	Duration of the last run, important for performance analysis and optimization.
Last run processed rows	Certain databases	Number of rows processed in the last run, applicable only to specific databases and providing a measure of the volume of data handled.
Identification of hardcoded source object names	Backend only	Identifies dbt models where source object names are hardcoded, which is a development anti-pattern.
Identification of orphan CTEs	Backend only	Identifies dbt models with unreferenced `WITH` clauses, potentially indicating performance issues or incorrect impact analysis.

Runtime metadata

dbt Cloud only

Includes dynamic metadata viewable in diagrams, such as dbt model descriptions, job names, and execution times.

dbt model description

All versions

Detailed descriptions of each model, which can include information like the purpose of the model, its design, and key considerations.

dbt job name

All versions

Name of the dbt job in which the model was last executed, which is crucial for tracking and managing dbt jobs.

Last run timestamp

All versions

Timestamp indicating when the model was last run, useful for monitoring model freshness and scheduling updates.

Last successful run timestamp

Optional

Timestamp of the last successful run, providing insights into the reliability and stability of the dbt processes.

Last run status

All versions

Current status of the last run, which helps in identifying successes or failures in recent executions.

Last run duration

All versions

Duration of the last run, important for performance analysis and optimization.

Last run processed rows

Certain databases

Number of rows processed in the last run, applicable only to specific databases and providing a measure of the volume of data handled.

Identification of hardcoded source object names

Backend only

Identifies dbt models where source object names are hardcoded, which is a development anti-pattern.

Identification of orphan CTEs

Backend only

Identifies dbt models with unreferenced WITH clauses, potentially indicating performance issues or incorrect impact analysis.

This detailed metadata extraction is instrumental in maintaining an efficient, transparent, and optimized data management environment. The information enables teams to monitor dbt project health, track changes, and plan improvements based on empirical data.

Supported database platforms

The attribute level lineage is supported as follows:

Yes:
- For dbt Python models, only datastore level lineage is supported. Column lineage for Python models is currently in the works.
Partially - This means the attribute lineage is available only when:
- dbt documentation generation is enabled (so catalog.json is available).
- The dbt model is an SQL model, not Python, and it uses ANSI SQL syntax.
- Other database technologies using ANSI SQL syntax can be supported with minimal effort (such as Oracle, MySQL).

Once a specific internal database technology scanner is available (for example, for Azure Synapse), the full attribute level lineage will be available as well.

Database platform	dbt cloud support	Datastore level lineage	Attribute level lineage	Python models support in dbt
Snowflake	Yes	Yes	Yes*	Yes
PostgreSQL	Yes	Yes	Partially*	No
Amazon Redshift	Yes	Yes	Partially*	No
Microsoft Fabric	Yes	Yes	Partially*	Yes
Databricks	Yes	Yes	Partially*	Yes
Google BigQuery	Yes	Yes	Planned	Yes
Apache Spark	Yes	No	No	No
Starburst or Trino	Yes	No	No	No
Azure Synapse	No	Yes	Partially*	No
Other	No	No	No	No

Database platform

dbt cloud support

Datastore level lineage

Attribute level lineage

Python models support in dbt

Snowflake

Yes

Yes*

Yes

PostgreSQL

Yes

Partially*

Amazon Redshift

Yes

Partially*

Microsoft Fabric

Yes

Partially*

Yes

Databricks

Yes

Partially*

Yes

Google BigQuery

Yes

Planned

Yes

Apache Spark

Yes

Starburst or Trino

Yes

Azure Synapse

Yes

Partially*

Other

Additional features

The dbt scanner supports the same SQL anomaly detection checks as the Snowflake scanner. See SQL DQ (anomaly) detection for more details.

FAQ

How much time does it take to scan lineage from dbt with around 500 models?

It depends on the connection method used. Here are some estimates:
- File input: 1 min
- Discovery API: 2 min
- Admin API: 6 min

Troubleshooting

See dbt scan fails when using local files to feed the scanner.

Was this page useful?