OpenLineage Lineage Scanner

OpenLineage is an open-source platform for collecting and analyzing data lineage. The framework tracks metadata of scheduled events and captures how data is transformed from the sources to outputs of such events.

OpenLineage integrates with various data processing frameworks, such as Apache Airflow, Apache Spark, or dbt, to capture lineage information. For more information, see the official OpenLineage documentation.

Supported OpenLineage source technologies

The scanner supports processing the following file formats located in S3 within Amazon Web Services (AWS):

CSV
Microsoft Excel
Parquet
Iceberg

If the scanner encounters an unsupported file format, the lineage information will not be captured. This support can be extended on demand by providing a metadata scan to the Ataccama Lineage team.

In cases where the source technology includes complex path definitions (such as paths similar to S3 file paths), providing example paths might also be necessary.

Limitations

A valid lineage graph can only be generated for those OpenLineage events where inputs and outputs arrays are not empty. To learn more about the OpenLineage object model, see OpenLineage Object Model.

Here is an example of an OpenLineage event that does not produce any lineage:

{
   "eventTime": "2025-03-03T11:00:00.000000Z",
   "eventType": "COMPLETE",
   "job": {
      "name": "some-job-name",
      "namespace": "some-namespace"
   },
   "run": {
      "runId": "01950527-7c5c-7b19-988b-4fe86c93295c"
   },
   "inputs": [],
   "outputs": []
}

If the scanner encounters an unsupported file format, the lineage information will not be captured. This support can be extended on demand by providing a metadata scan to the Ataccama Lineage team.

In cases where the source technology includes complex path definitions (such as paths similar to S3 file paths), providing example paths might also be necessary.

In addition, details of transformations between the attributes are not processed and instead, only direct mapping between attributes is reported. Specifically, lineage between attributes is captured even in the case of more complicated operations, but details about the transformations are omitted.

The following is an example of a simple transformation between the attributes:

"transformations": [
   {
      "description": "",
      "masking": false,
      "subtype": "IDENTITY",
      "type": "DIRECT"
   }
]

Apache Airflow configuration

To enable emitting OpenLineage events, follow the official guidelines for Airflow integration: Apache Airflow Provider for OpenLineage.

Apache Spark configuration

Configuration details vary depending on the Apache Spark environment:

AWS Glue - See OpenLineage AWS Glue Quickstart.
Databricks - See OpenLineage Databricks Quickstart.
Local Spark setup - See OpenLineage Local Spark Quickstart.

Metadata extraction details

The OpenLineage scanner supports extracting metadata from OpenLineage events and processes the following metadata. This is done because each OpenLineage event corresponds to an individual execution of a data transformation job, documenting precisely how input datasets are transformed into output datasets.

Events: Detailed records of events during data workflows. Only completed events are scanned.
Jobs: Details of the jobs the event is associated with. This is extracted for each event processed.
Job inputs and outputs: Details of datasets used as inputs and outputs for each job, including associated data sources and attributes, when available.
Symlinks: Additional dataset-specific metadata that enhances the accuracy of lineage mapping.
Lineage: Comprehensive lineage paths linking source datasets to target datasets, including attribute-level transformations.

OpenLineage permissions and security

No special permissions are required to run the OpenLineage scanner.

Scanner configuration

All fields marked with an asterisk (*) are mandatory.

Property Description

Property	Description
`name*`	Unique name for the scanner job.
`sourceType*`	Specifies the source type to be scanned. Must be `OPEN_LINEAGE`.
`description*`	A human-readable description of the scan.
`apiUrl*`	REST API endpoint for data retrieval.
`namespaces`	List of namespaces of jobs to be scanned. If not specified, all namespaces are scanned.
`jobsSince`	Datetime from which jobs should be scanned. Example: `2024-09-01T00:00:00Z`. Use together with `jobsUntil` to limit scanning to a specific time window. If not specified, all available jobs are scanned.
`jobsUntil`	Datetime until which jobs should be scanned. Example: `2024-09-30T00:00:00Z`. Use together with `jobsSince` to limit scanning to a specific time window. If not specified, all available jobs are scanned.

name*

Unique name for the scanner job.

sourceType*

Specifies the source type to be scanned. Must be OPEN_LINEAGE.

description*

A human-readable description of the scan.

apiUrl*

REST API endpoint for data retrieval.

namespaces

List of namespaces of jobs to be scanned. If not specified, all namespaces are scanned.

jobsSince

Datetime from which jobs should be scanned. Example: 2024-09-01T00:00:00Z.

Use together with jobsUntil to limit scanning to a specific time window. If not specified, all available jobs are scanned.

jobsUntil

Datetime until which jobs should be scanned. Example: 2024-09-30T00:00:00Z.

Use together with jobsSince to limit scanning to a specific time window. If not specified, all available jobs are scanned.

OpenLineage scanner example configuration

{
   "scannerConfigs": [
      {
         "name": "Apache Spark processing",
         "sourceType": "OPEN_LINEAGE",
         "description": "Apache Spark prosessing using OpenLineage",
         "apiUrl": "http://localhost:5000/api/v1",
         "namespaces": [],
         "jobsSince": "2025-02-01T15:50:12.301Z",
         "jobsUntil": ""2025-02-16T15:50:12.301Z"
      }
   ]
}

Was this page useful?