OpenLineage Lineage Scanner
OpenLineage is an open-source standard for collecting and analyzing data lineage. The framework tracks metadata of scheduled events and captures how data is transformed from sources to outputs.
OpenLineage integrates with data processing frameworks such as Apache Airflow, Apache Spark, and dbt to capture lineage information. For more information, see the official OpenLineage documentation.
Supported file formats
When datasets referenced in OpenLineage events are stored in S3 (Amazon Web Services), the scanner supports the following file formats:
-
CSV
-
Microsoft Excel
-
Parquet
-
Iceberg
|
If the scanner encounters an unsupported file format, the lineage information for that dataset is not captured. Support for additional formats can be extended on demand. For source technologies with complex path definitions (such as S3-style paths from non-S3 systems), providing example paths might also be required. |
Supported connectivity
The OpenLineage scanner supports two ways of reading OpenLineage events:
-
Marquez API-based scanner: Reads events from a Marquez (or compatible OpenLineage) instance via REST API.
-
Local file-based scanner: Reads OpenLineage event JSON files from a local directory. Available for standalone Docker installations only.
Marquez API-based scanner
The Marquez API-based scanner connects to a running Marquez instance and retrieves completed OpenLineage run events via its REST API.
|
If you do not have a Marquez instance, a standalone lineage scanner bundle that includes Marquez is available. See Marquez bundle for OpenLineage. |
Scanner configuration
All fields marked with an asterisk (*) are mandatory.
| Property | Description |
|---|---|
|
Unique name for the scanner job. |
|
Specifies the source type to be scanned.
Must be |
|
A human-readable description of the scan. |
|
REST API base URL of the Marquez instance. Mandatory if using your own Marquez instance. If using the Marquez bundle for OpenLineage, the value defaults to Example: |
|
Optional list of job namespaces to include. If omitted or left empty, all namespaces are scanned. |
|
Lower bound for the event time filter (ISO-8601 format).
Example: Use together with |
|
Upper bound for the event time filter (ISO-8601 format).
Example: Use together with |
|
Scans events from the past N days relative to the time the scan runs.
Must be a positive integer.
Example: Use as a convenient alternative to |
{
"scannerConfigs": [
{
"name": "Apache Spark processing",
"sourceType": "OPEN_LINEAGE",
"description": "Scan Apache Spark jobs via OpenLineage",
"namespaces": [],
"lineageFromPastNDays": 7,
"jobsUntil": "2025-02-16T00:00:00Z"
}
]
}
Note that the apiUrl field is skipped as the Marquez bundle installation provides a default value.
The field is only required when using your own Marquez instance.
Local file-based scanner
As an alternative to reading events from a live Marquez API, the scanner can read OpenLineage event files directly from a local directory. This is useful when you have already collected events as JSON files, or when network access to a Marquez instance is not available.
Before you begin
Before running the scan, place the OpenLineage event files into a directory under <install>/user-data/.
For example, create a folder named open_lineage_dir:
<install>/user-data/open_lineage_dir/
<install> is the path where the standalone Docker scanner is installed.
For installation details, see Standalone Lineage Scanner.
Alternatively, the scanner can read from any local path that is accessible to it, provided the necessary read permissions are in place.
|
Each file in the directory must be a valid JSON file containing exactly one OpenLineage |
Scanner configuration
All fields marked with an asterisk (*) are mandatory.
| Property | Description |
|---|---|
|
Unique name for the scanner job. |
|
Specifies the source type to be scanned.
Must be |
|
A human-readable description of the scan. |
|
Path to the directory containing the OpenLineage event JSON files, as seen from inside the scanner container. The |
|
Optional list of job namespaces to include. If omitted or left empty, all namespaces are scanned. |
|
Lower bound for the event time filter (ISO-8601 format).
Example: Use together with |
|
Upper bound for the event time filter (ISO-8601 format).
Example: Use together with |
|
Scans events from the past N days relative to the time the scan runs.
Must be a positive integer.
Example: Use as a convenient alternative to |
{
"scannerConfigs": [
{
"name": "OpenLineageLocal",
"sourceType": "OPEN_LINEAGE",
"description": "Scan OpenLineage events from local files",
"eventsDirectory": "/opt/ataccama/lineage-scanning/user-data/open_lineage_dir",
"namespaces": [],
"lineageFromPastNDays": 7,
"jobsUntil": null
}
]
}
Source technology configuration
To emit OpenLineage events from your data processing tools, configure the appropriate integration:
Apache Airflow
To enable emitting OpenLineage events from Airflow, follow the official guidelines: Apache Airflow Provider for OpenLineage.
Apache Spark
Configuration details vary depending on your Spark environment:
-
AWS Glue: See OpenLineage AWS Glue Quickstart.
-
Databricks: See OpenLineage Databricks Quickstart.
-
Local Spark setup: See OpenLineage Local Spark Quickstart.
dbt
For dbt projects, OpenLineage events can be collected via the dbt scanner integration. See dbt Lineage Scanner for configuration details.
Metadata extraction details
Each OpenLineage event represents one execution of a data transformation job — documenting how input datasets were transformed into output datasets. The scanner processes the following from each completed event:
-
Events: Only completed events (
eventType: COMPLETE) are scanned. -
Jobs: The job associated with each event, including its name and namespace.
-
Job inputs and outputs: Datasets used as inputs and outputs for each job, including associated data sources and attributes when available.
-
Symlinks: Alternative dataset identifiers (for example, a Hive table that maps to an S3 path) that improve the accuracy of lineage mapping.
-
Lineage: Source-to-target dataset paths, including attribute-level mappings.
Limitations
Events without inputs or outputs
A lineage graph is only generated for events where both inputs and outputs arrays are non-empty.
Events with empty arrays are skipped and produce no lineage.
To learn more about the OpenLineage object model, see OpenLineage Object Model.
The following is an example of an event that does not produce any lineage (note the empty inputs and outputs):
{
"eventTime": "2025-03-03T11:00:00.000000Z",
"eventType": "COMPLETE",
"job": {
"name": "some-job-name",
"namespace": "some-namespace"
},
"run": {
"runId": "01950527-7c5c-7b19-988b-4fe86c93295c"
},
"inputs": [],
"outputs": []
}
Attribute-level transformation details
Lineage between attributes is captured even for complex transformations, but the details of the transformation logic are not preserved.
All attribute mappings are reported as direct (IDENTITY) — regardless of whether the underlying transformation was a CASE WHEN, an aggregation, or a simple column rename.
The transformation object in such cases looks as follows:
"transformations": [
{
"description": "",
"masking": false,
"subtype": "IDENTITY",
"type": "DIRECT"
}
]
Was this page useful?