Open Lineage Scanner
Overview
OpenLineage is an open platform for collection and analysis of data lineage. The framework tracks metadata of scheduled events and captures how data is transformed from the sources to outputs of such events. OpenLineage integrates with various data processing frameworks, such as Apache Airflow, Apache Spark or Dbt, to capture lineage information. For more information, see the official OpenLineage documentation.
Apache Airflow configuration
To enable emitting OpenLineage events follow the official Airflow integration guidelines Apache Airflow Provider for OpenLineage
Metadata Extraction Details
We support the extraction of metadata from OpenLineage events and we process the following metadata:
-
Events: Detailed records of events during data workflows. We scan only completed events.
-
Jobs: For each event, we scan details of job the event is associated with.
-
Inputs and Outputs of each Job : Details of datasets used as inputs and outputs for each job, including associated data sources and attributes, when available.
-
Symlinks: Additional dataset-specific metadata that enhances the accuracy of lineage mapping.
-
Lineage: Comprehensive lineage paths linking source datasets to target datasets, including attribute-level transformations.
-
Purpose: Each Open Lineage event corresponds to an individual execution of a data transformation job, documenting precisely how input datasets are transformed into output datasets.
-
Scanner Configuration
The following configuration properties are available for the scanner:
Property | Description and Example Values |
---|---|
name* |
Unique name for the scanner job. Example: "OpenLineageScan" |
sourceType* |
Specifies the source type to be scanned. Must be always OPEN_LINEAGE. |
description |
Human-readable description of the scan. Example: "Scan OPEN LINEAGE jobs" |
apiUrl* |
REST API endpoint for data retrieval. Example: "http://<api-url>" |
namespaces |
List of namespaces of jobs to be scanned. |
jobsSince |
Datetime from which jobs should be scanned. Example: "2024-09-01T00:00:00Z" |
jobsUntil |
Datetime until which jobs should be scanned. Example: "2024-09-30T00:00:00Z" |
Configuration Notes
Mandatory properties: name, sourceType, and description and apiUrl. If namespaces is omitted, all namespaces will be scanned. If jobsSince and jobsUntil are omitted, the scanner will include all available jobs. Specifying these properties restricts scanning to the defined time window.
Example Configuration
Below is an example configuration for the OpenLineage Scanner:
{
"scannerConfigs": [
{
"name": "Apache Spark processing",
"sourceType": "OPEN_LINEAGE",
"description": "Apache Spark prosessing using OpenLineage",
"apiUrl": "http://localhost:5000/api/v1",
"namespaces" : [
],
"jobsSince" : "2025-02-01T15:50:12.301Z",
"jobsUntil" : "2025-02-16T15:50:12.301Z"
}
]
}
Supported Open Lineage Source Technologies
The scanner supports processing the following file formats located in S3 within AWS: CSV (Comma-separated files) Excel files Parquet Iceberg
Limitations of Open Lineage Scanning
-
We can only generate a valid lineage graph for those OpenLineage events that have the arrays “inputs“ and “outputs“ non-empty. More information on OpenLineage object model can be found here openlineage.io/docs/next/spec/object-model/. Example of an open lineage event that does not produce any lineage:
{ "eventTime": "2025-03-03T11:00:00.000000Z", "eventType": "COMPLETE", "job": { "name": "some-job-name", "namespace": "some-namespace" }, "run": { "runId": "01950527-7c5c-7b19-988b-4fe86c93295c" } "inputs": [], "outputs": [], }
-
Unsupported source technologies: If the scanner encounters an unsupported source technology, it will be unable to capture lineage information. To add support for new source technologies, our team will require a metadata scan. In cases where the source technology includes complex path definitions (such as paths similar to S3 file paths), providing example paths may also be necessary.
-
We are not processing details of transformations between the attributes and always report just direct mapping between the attributes. The lineage between the attributes will be captured even in case of more complicated operations, but details about the transformations will be omitted.
Example of a simple transformation between the attributes:
"transformations": [ { "description": "", "masking": false, "subtype": "IDENTITY", "type": "DIRECT" } ]
Was this page useful?