OpenLineage Lineage Scanner
OpenLineage is an open-source platform for collecting and analyzing data lineage. The framework tracks metadata of scheduled events and captures how data is transformed from the sources to outputs of such events.
OpenLineage integrates with various data processing frameworks, such as Apache Airflow, Apache Spark, or dbt, to capture lineage information. For more information, see the official OpenLineage documentation.
Supported OpenLineage source technologies
The scanner supports processing the following file formats located in S3 within Amazon Web Services (AWS):
-
CSV
-
Microsoft Excel
-
Parquet
-
Iceberg
If the scanner encounters an unsupported file format, the lineage information will not be captured. This support can be extended on demand by providing a metadata scan to the Ataccama Lineage team. In cases where the source technology includes complex path definitions (such as paths similar to S3 file paths), providing example paths might also be necessary. |
Limitations
A valid lineage graph can only be generated for those OpenLineage events where inputs
and outputs
arrays are not empty.
To learn more about the OpenLineage object model, see OpenLineage Object Model.
Here is an example of an OpenLineage event that does not produce any lineage:
{
"eventTime": "2025-03-03T11:00:00.000000Z",
"eventType": "COMPLETE",
"job": {
"name": "some-job-name",
"namespace": "some-namespace"
},
"run": {
"runId": "01950527-7c5c-7b19-988b-4fe86c93295c"
},
"inputs": [],
"outputs": []
}
If the scanner encounters an unsupported file format, the lineage information will not be captured. This support can be extended on demand by providing a metadata scan to the Ataccama Lineage team.
In cases where the source technology includes complex path definitions (such as paths similar to S3 file paths), providing example paths might also be necessary.
In addition, details of transformations between the attributes are not processed and instead, only direct mapping between attributes is reported. Specifically, lineage between attributes is captured even in the case of more complicated operations, but details about the transformations are omitted.
The following is an example of a simple transformation between the attributes:
"transformations": [
{
"description": "",
"masking": false,
"subtype": "IDENTITY",
"type": "DIRECT"
}
]
Apache Airflow configuration
To enable emitting OpenLineage events, follow the official guidelines for Airflow integration: Apache Airflow Provider for OpenLineage.
Apache Spark configuration
Configuration details vary depending on the Apache Spark environment:
-
AWS Glue - See OpenLineage AWS Glue Quickstart.
-
Databricks - See OpenLineage Databricks Quickstart.
-
Local Spark setup - See OpenLineage Local Spark Quickstart.
Metadata extraction details
The OpenLineage scanner supports extracting metadata from OpenLineage events and processes the following metadata. This is done because each OpenLineage event corresponds to an individual execution of a data transformation job, documenting precisely how input datasets are transformed into output datasets.
-
Events: Detailed records of events during data workflows. Only completed events are scanned.
-
Jobs: Details of the jobs the event is associated with. This is extracted for each event processed.
-
Job inputs and outputs: Details of datasets used as inputs and outputs for each job, including associated data sources and attributes, when available.
-
Symlinks: Additional dataset-specific metadata that enhances the accuracy of lineage mapping.
-
Lineage: Comprehensive lineage paths linking source datasets to target datasets, including attribute-level transformations.
OpenLineage permissions and security
No special permissions are required to run the OpenLineage scanner.
Scanner configuration
All fields marked with an asterisk (*) are mandatory.
Property | Description |
---|---|
|
Unique name for the scanner job. |
|
Specifies the source type to be scanned.
Must be |
|
A human-readable description of the scan. |
|
REST API endpoint for data retrieval. |
|
List of namespaces of jobs to be scanned. If not specified, all namespaces are scanned. |
|
Datetime from which jobs should be scanned.
Example: Use together with |
|
Datetime until which jobs should be scanned.
Example: Use together with |
{
"scannerConfigs": [
{
"name": "Apache Spark processing",
"sourceType": "OPEN_LINEAGE",
"description": "Apache Spark prosessing using OpenLineage",
"apiUrl": "http://localhost:5000/api/v1",
"namespaces": [],
"jobsSince": "2025-02-01T15:50:12.301Z",
"jobsUntil": ""2025-02-16T15:50:12.301Z"
}
]
}
Was this page useful?