BigQuery Lineage Scanner
Overview
The BigQuery Lineage Scanner extracts detailed lineage information from Google BigQuery using the Simba JDBC driver. The scanner creates a lineage from runtime lineage information and views and materialized views metadata. It uses metadata from INFORMATION_SCHEMA tables to provide accurate, column-level lineage insights.
Supported statement types and SQL syntax
-
Supported Statements Types:
-
CREATE TABLE AS SELECT
-
CREATE VIEW
-
INSERT
-
MERGE
-
SELECT
-
UPDATE
-
-
Supported Database Objects for Design Time Lineage:
-
Views
-
Materialized Views
-
-
Not yet supported SQL constructs:
-
BigQuery-specific SQL constructs are not yet supported.
-
Note: Most of the unsupported syntax is planned to be supported in the upcoming versions of the scanner, based on the prioritization.
Prerequisites
Required GCP Roles and Privileges for Metadata Extraction
To enable the extraction of metadata across multiple BigQuery projects, ensure that the following permissions and roles are correctly assigned to the service account. These permissions are required for each project where metadata will be extracted. The same service account should be used for all projects.
1. Assign Required GCP Roles to the Service Account
The service account must have sufficient privileges to query metadata and access query history. Assign the following roles to the service account for each project:
-
BigQuery Metadata Viewer (roles/bigquery.metadataViewer)
-
Grants permissions to read metadata about datasets, tables, and views.
-
Needs to be set for all extracted projects.
-
Purpose: Allows access to table, column, and view metadata.
-
-
BigQuery Resource Viewer (roles/bigquery.resourceViewer)
-
Grants permissions to view BigQuery workloads, but does not grant permission to create or modify any slot reservations or commitments.
-
Needs to be set for all extracted projects.
-
Purpose: Allows access to the query metadata.
-
-
BigQuery Job User (roles/bigquery.jobUser)
-
Grants permissions to run queries and list jobs.
-
Needs to be set in the project for which service account key is created.
-
Purpose: Allows access to query history and enables running queries to extract metadata.
-
2. Generate a Service Account Key File
Once the required roles are assigned to the service account, generate a service account key file for authentication:
-
Go to the Google Cloud Console.
-
Navigate to IAM & Admin > Service Accounts.
-
Select the service account that will be used for metadata extraction.
-
Click on Keys > Add Key > Create New Key.
-
Choose the JSON key type and download the key file.
Keep this JSON key file secure, as it contains credentials for the service account. Use this key file to authenticate the service account when extracting metadata. |
Toolkit configuration
The scanner is configured using the following JSON structure:
Property | Description |
---|---|
|
Unique name of the scanner job |
|
Must contain BIGQUERY |
|
Human-readable description |
|
List of project IDs to include in the scan |
|
List of project IDs to exclude from the scan |
|
List of regions to scan (e.g., |
|
If set to true, table column metadata is not extracted. Default: false |
|
If set to true, view metadata is not extracted. Default: false |
|
Specifies how many past days of query history to analyze for lineage extraction |
|
The project ID of the project where the service account is created |
|
Path to the service account key file for authentication |
|
The contents of the service account key file for authentication |
|
Email of the service account used for authentication |
Legend: *mandatory
{
"scannerConfigs": [
{
"name": "example-config",
"sourceType": "BIGQUERY",
"description": "Example configuration for the BigQuery Lineage Scanner",
"includeProjectIds": ["project1", "project2"],
"excludeProjectIds": ["project3"],
"regionNames": ["us", "eu"],
"skipTableColumnsExtraction": false,
"skipViewsExtraction": false,
"lineageFromPastNDays": 5,
"connection": {
"projectId": "example-project-id",
"serviceAccountKeyPath": "/path/to/service-account-key.json",
"serviceAccountEmail": "example-service-account@example.com"
}
}
]
}
Troubleshooting
Below are common issues you might encounter while using the scanner, along with suggested solutions.
Issue 1: Scanner Fails to Authenticate with the Service Account
-
Problem: Preflight checks do not pass, and authentication with the service account fails.
-
In logs there is a message like
Preflight check failed: …
-
-
Solution:
-
Double-check that the service account key file or its content is correctly specified in the configuration.
-
If configuring file, ensure the file path is accessible and correctly referenced.
-
Issue 2: Scanner Fails to Extract Metadata from BigQuery
-
Problem: The scanner cannot extract metadata from BigQuery datasets.
-
In the logs, there is a message like
Failed to extract table columns for project <project-name> for region <region-name>.
-
-
Solution:
-
Confirm that the service account has the necessary roles and permissions assigned in the GCP project.
-
Ensure the service account has "BigQuery Metadata Viewer" role assigned for the project.
-
Issue 3: Scanner Fails to Extract Lineage Information
-
Problem: Jobs extraction fails.
-
In the logs, there is a message like
Failed to extract query jobs for project <project-name> for region <region-name>.
-
-
Solution:
-
Check the role BigQuery Resource Viewer is assigned to the service account.
-
Check, that the regionNames are correctly specified in the configuration file.
-
Check that there is a lineage metadata to extract in last
lineageFromPastNDays
days, if no, increase the value.
-
Issue 4: Scanner Fails to Extract Metadata from Specific Projects
-
Problem: Metadata extraction fails for certain BigQuery projects.
-
In the logs, there is a message like
Failed to extract metadata for project <project-name> for region <region-name>.
for a specific project.
-
-
Solution:
-
Verify that the project IDs are correctly specified in the configuration file.
-
Ensure that the service account has sufficient access to those projects.
-
Issue 5: The lineage is missing
-
Problem: The lineage is missing or incomplete, but extraction ran without any issues.
-
This can be caused by multiple issues.
-
Please check the
sql_analyzer_error_log.csv
log file, for queries present there, there is no lineage at the moment.
-
-
Solution:
-
Share the
sql_analyzer_error_log.csv
log file with Ataccama support for further analysis.
-
Was this page useful?