BigQuery Lineage Scanner

Scanned and supported objects

While we strive for comprehensive lineage capture, certain dataflows and transformations might be incomplete or unavailable due to technical constraints. We continuously work to expand coverage and accuracy.

Supported statements types

CREATE TABLE AS SELECT
CREATE VIEW
INSERT
MERGE
SELECT
UPDATE

Supported database objects

Views
Materialized views

Limitations

Some SQL constructs might not be supported.

Fallback lineage provides table-level lineage only. See Fallback table-level lineage > Limitations.

Supported connectivity

The scanner connects to the BigQuery server using the following connection method:

Connector type: JDBC.
Authentication method: Service account key.

BigQuery permissions

Google Cloud Platform (GCP) roles and permissions for metadata extraction

To enable the extraction of metadata across multiple BigQuery projects, ensure that the following permissions and roles are correctly assigned to the service account. These permissions are required for each project where metadata will be extracted.

The same service account should be used for all projects.

Before you start

Keep in mind:

The following roles must be assigned per project where metadata extraction is required.
Use the same service account across all projects to streamline the metadata extraction process.

Step 1: Assign required GCP roles to the service account

The service account must have sufficient permissions to query metadata and access query history. Assign the following roles to the service account for each project:

BigQuery Metadata Viewer (roles/bigquery.metadataViewer)

Grants permissions to read metadata about datasets, tables, and views.
Needs to be set for all extracted projects.
Required to allow access to table, column, and view metadata.

BigQuery Resource Viewer (roles/bigquery.resourceViewer)

Grants permissions to view BigQuery workloads but does not grant permission to create or modify any slot reservations or commitments.
Needs to be set for all extracted projects.
Required to allow access to the query metadata.

BigQuery Job User (roles/bigquery.jobUser)

Grants permissions to run queries and list jobs.
Needs to be set in the project for which service account key is created.
Required to allow access to query history and enables running queries to extract metadata.

Step 2: Generate a service account key file

Once you assign the required roles to the service account, generate a service account key file for authentication:

Go to the Google Cloud Console.
Navigate to IAM & Admin > Service Accounts.
Select the service account that will be used for metadata extraction.
Select Keys > Add Key > Create New Key.
Choose the JSON key type and download the key file.

Keep this JSON key file secure, as it contains credentials for the service account. Use this key file to authenticate the service account when extracting metadata.

Step 3: Enable fallback table-level lineage (optional)

If you want to use the fallback lineage option, you need to:

Set enableFallbackTableLineage to true in the scanner configuration.

Enable the Data Lineage API in all GCP projects where you plan to use fallback table-level lineage.

Note that BigQuery Data Lineage has a maximum retention period of 30 days and there are costs associated with enabling lineage collection. For detailed information about enabling the API, retention periods, and pricing, see the official BigQuery Data Lineage documentation.

Assign the required datalineage.events.get permission to the service account.

You can grant this by assigning the Data Lineage Viewer role (roles/datalineage.viewer) to the service account for each project where fallback lineage will be used. Alternatively, create a custom role with the datalineage.events.get permission.

Scanner configuration

All fields marked with an asterisk (*) are mandatory.

Property Description

Property	Description
`name*`	Unique name for the scanner job.
`sourceType*`	Specifies the source type to be scanned. Must contain `BIGQUERY`.
`description*`	A human-readable description of the scan.
`includeProjectIds`	List of project IDs to include in the scan.
`excludeProjectIds`	List of project IDs to exclude from the scan.
`regionNames`	List of regions to scan (such as `us`, `eu`). Default value: `us`.
`skipTableColumnsExtraction`	If set to `true`, table column metadata is not extracted. Default value: `false`.
`skipViewsExtraction`	If set to `true`, view metadata is not extracted. Default value: `false`.
`lineageFromPastNDays`	Lineage history in days. For example, if set to `10`, the lineage extract will contain lineage from transformations executed in the past 10 days.
`enableFallbackTableLineage`	Enable fallback to Google’s Data Lineage API for table-level lineage when SQL parsing fails. Default value: `false`. When configured, table-level lineage is provided for queries that could not be parsed. See Fallback table-level lineage > Limitations.
`connection.projectId*`	The project ID of the project where the service account is created.

name*

Unique name for the scanner job.

sourceType*

Specifies the source type to be scanned. Must contain BIGQUERY.

description*

A human-readable description of the scan.

includeProjectIds

List of project IDs to include in the scan.

excludeProjectIds

List of project IDs to exclude from the scan.

regionNames

List of regions to scan (such as us, eu). Default value: us.

skipTableColumnsExtraction

If set to true, table column metadata is not extracted. Default value: false.

skipViewsExtraction

If set to true, view metadata is not extracted. Default value: false.

lineageFromPastNDays

Lineage history in days. For example, if set to 10, the lineage extract will contain lineage from transformations executed in the past 10 days.

enableFallbackTableLineage

Enable fallback to Google’s Data Lineage API for table-level lineage when SQL parsing fails. Default value: false.

When configured, table-level lineage is provided for queries that could not be parsed. See Fallback table-level lineage > Limitations.

connection.projectId*

The project ID of the project where the service account is created.

BigQuery scanner example configuration

{
   "scannerConfigs": [
      {
         "name": "example-config",
         "sourceType": "BIGQUERY",
         "description": "Example configuration for the BigQuery Lineage Scanner",
         "includeProjectIds":[
            "project1",
            "project2"
         ],
         "excludeProjectIds":[
            "project3"
         ],
         "regionNames":[
            "us",
            "eu"
         ],
         "skipTableColumnsExtraction": false,
         "skipViewsExtraction": false,
         "lineageFromPastNDays": 5,
         "enableFallbackTableLineage": false,
         "connection": {
            "projectId": "example-project-id"
         }
      }
   ]
}

Fallback table-level lineage

When SQL parsing fails to resolve lineage for certain BigQuery queries, the scanner can fall back to Google’s Data Lineage API to retrieve table-level lineage. This fallback mechanism helps ensure lineage coverage even when queries contain SQL constructs that cannot be parsed.

To enable fallback lineage, set enableFallbackTableLineage to true in your scanner configuration. See Scanner configuration for details.

How it works

The fallback lineage process goes through the following steps:

The scanner first attempts to extract lineage by parsing SQL queries from query history.
For queries that cannot be parsed or analyzed, the scanner identifies unprocessed target tables and views.
When fallback is configured, the scanner queries Google’s Data Lineage API to retrieve source tables for these unprocessed targets.
Table-level lineage relationships are then created from the API response.

The lineageFromPastNDays parameter applies to both the main lineage extraction and fallback lineage processing, determining how far back in history to look for lineage information.

Limitations

Fallback lineage provides table-level lineage only. Column-level lineage is not available through the fallback mechanism as it requires successful SQL parsing and analysis.
The fallback mechanism only processes target tables and views that were not successfully analyzed during the main lineage extraction phase.

Troubleshooting

Here are some common issues you might encounter while using the scanner, along with suggested solutions.

Cannot authenticate with the service account

Problem

Preflight checks do not pass and authentication with the service account fails. The logs contain a similar message: Preflight check failed: ….

Solution

Double-check that the service account key file or its content is correctly specified in the configuration.
If providing a file, ensure the file path can be accessed and is correctly referenced.

Cannot extract metadata from BigQuery

Problem

The scanner cannot extract metadata from BigQuery datasets. The logs contain a similar message: Failed to extract table columns for project <project-name> for region <region-name>.

Solution

Confirm that the service account has the necessary roles and permissions assigned in the GCP project.
Ensure the service account has BigQuery Metadata Viewer role assigned for the project.

Cannot extract lineage information

Problem

Jobs extraction fails. The logs contain a similar message: Failed to extract query jobs for project <project-name> for region <region-name>.

Solution

Check the role BigQuery Resource Viewer is assigned to the service account.
Check, that the regionNames are correctly specified in the configuration file.
Check that there is a lineage metadata to extract in the last lineageFromPastNDays days. If there isn’t any, increase the value appropriately.

Cannot extract metadata from specific projects

Problem

Metadata extraction fails for certain BigQuery projects. The logs contain a similar message for a specific project: Failed to extract metadata for project <project-name> for region <region-name>.

Solution

Verify that the project IDs are correctly specified in the configuration file.
Ensure that the service account has sufficient access to those projects.

Lineage is missing

Problem

The lineage is missing or incomplete even though the extraction finished without any issues.

This can be caused by multiple issues. Check the sql_analyzer_error_log.csv log file: for the queries present in the file, there is no lineage available at the moment.

Solution

Share the sql_analyzer_error_log.csv log file with the Ataccama Support team for further analysis.

Fallback lineage processing fails

Problem

Fallback lineage processing fails with permission errors or API unavailable errors. The logs contain messages such as Permission denied. Ensure the service account has 'datalineage.events.get' permission and Data Lineage API is enabled in project <project-id> or Data Lineage API is unavailable for location <location>.

Solution

Identify which GCP project is causing the error from the log message. The error is project-specific, so you need to address it for each affected project.
Ensure that the Data Lineage API is enabled in the specific GCP project(s) where the error occurs:
- Go to the Google Cloud Console.
- Navigate to APIs & Services > Library.
- Search for "Data Lineage API" and verify it is configured for the project.
Verify that the service account has the datalineage.events.get permission for the affected project:
- Assign the Data Lineage Viewer role (roles/datalineage.viewer) to the service account for each project where fallback lineage is used.
- Alternatively, create a custom role with the datalineage.events.get permission.
Check that the region specified in regionNames is supported by the Data Lineage API.
Verify that enableFallbackTableLineage is set to true in your scanner configuration if you intend to use this feature.

Was this page useful?