User Community Service Desk Downloads

Open Lineage Scanner

Overview

OpenLineage is an open platform for collection and analysis of data lineage. The framework tracks metadata of scheduled events and captures how data is transformed from the sources to outputs of such events. OpenLineage integrates with various data processing frameworks, such as Apache Airflow, Apache Spark or Dbt, to capture lineage information. For more information, see the official OpenLineage documentation.

Apache Airflow configuration

To enable emitting OpenLineage events follow the official Airflow integration guidelines Apache Airflow Provider for OpenLineage

Apache Spark Configuration

Configuration details vary depending on the Apache Spark environment:

Metadata Extraction Details

We support the extraction of metadata from OpenLineage events and we process the following metadata:

  • Events: Detailed records of events during data workflows. We scan only completed events.

  • Jobs: For each event, we scan details of job the event is associated with.

  • Inputs and Outputs of each Job : Details of datasets used as inputs and outputs for each job, including associated data sources and attributes, when available.

  • Symlinks: Additional dataset-specific metadata that enhances the accuracy of lineage mapping.

  • Lineage: Comprehensive lineage paths linking source datasets to target datasets, including attribute-level transformations.

    • Purpose: Each Open Lineage event corresponds to an individual execution of a data transformation job, documenting precisely how input datasets are transformed into output datasets.

Permissions and Security

No special permissions are required to run the OpenLineage scanner.

Scanner Configuration

The following configuration properties are available for the scanner:

Property Description and Example Values

name*

Unique name for the scanner job. Example: "OpenLineageScan"

sourceType*

Specifies the source type to be scanned. Must be always OPEN_LINEAGE.

description

Human-readable description of the scan. Example: "Scan OPEN LINEAGE jobs"

apiUrl*

REST API endpoint for data retrieval. Example: "http://<api-url>"

namespaces

List of namespaces of jobs to be scanned.

jobsSince

Datetime from which jobs should be scanned. Example: "2024-09-01T00:00:00Z"

jobsUntil

Datetime until which jobs should be scanned. Example: "2024-09-30T00:00:00Z"

Configuration Notes

Mandatory properties: name, sourceType, and description and apiUrl. If namespaces is omitted, all namespaces will be scanned. If jobsSince and jobsUntil are omitted, the scanner will include all available jobs. Specifying these properties restricts scanning to the defined time window.

Example Configuration

Below is an example configuration for the OpenLineage Scanner:

{
  "scannerConfigs": [
    {
      "name": "Apache Spark processing",
      "sourceType": "OPEN_LINEAGE",
      "description": "Apache Spark prosessing using OpenLineage",
      "apiUrl": "http://localhost:5000/api/v1",
      "namespaces" : [
      ],
      "jobsSince" : "2025-02-01T15:50:12.301Z",
      "jobsUntil" : "2025-02-16T15:50:12.301Z"
    }
  ]
}

Supported Open Lineage Source Technologies

The scanner supports processing the following file formats located in S3 within AWS: CSV (Comma-separated files) Excel files Parquet Iceberg

Limitations of Open Lineage Scanning

  • We can only generate a valid lineage graph for those OpenLineage events that have the arrays “inputs“ and “outputs“ non-empty. More information on OpenLineage object model can be found here openlineage.io/docs/next/spec/object-model/. Example of an open lineage event that does not produce any lineage:

    {
      "eventTime": "2025-03-03T11:00:00.000000Z",
      "eventType": "COMPLETE",
      "job": {
        "name": "some-job-name",
        "namespace": "some-namespace"
      },
      "run": {
        "runId": "01950527-7c5c-7b19-988b-4fe86c93295c"
      }
      "inputs": [],
      "outputs": [],
    }
  • Unsupported source technologies: If the scanner encounters an unsupported source technology, it will be unable to capture lineage information. To add support for new source technologies, our team will require a metadata scan. In cases where the source technology includes complex path definitions (such as paths similar to S3 file paths), providing example paths may also be necessary.

  • We are not processing details of transformations between the attributes and always report just direct mapping between the attributes. The lineage between the attributes will be captured even in case of more complicated operations, but details about the transformations will be omitted.

    Example of a simple transformation between the attributes:
    "transformations": [
        {
            "description": "",
            "masking": false,
            "subtype": "IDENTITY",
            "type": "DIRECT"
        }
    ]

Was this page useful?