User Community Service Desk Downloads

OpenLineage Lineage Scanner

OpenLineage is an open-source standard for collecting and analyzing data lineage. The framework tracks metadata of scheduled events and captures how data is transformed from sources to outputs.

OpenLineage integrates with data processing frameworks such as Apache Airflow, Apache Spark, and dbt to capture lineage information. For more information, see the official OpenLineage documentation.

Supported file formats

When datasets referenced in OpenLineage events are stored in S3 (Amazon Web Services), the scanner supports the following file formats:

  • CSV

  • Microsoft Excel

  • Parquet

  • Iceberg

If the scanner encounters an unsupported file format, the lineage information for that dataset is not captured. Support for additional formats can be extended on demand.

For source technologies with complex path definitions (such as S3-style paths from non-S3 systems), providing example paths might also be required.

Supported connectivity

The OpenLineage scanner supports two ways of reading OpenLineage events:

  1. Marquez API-based scanner: Reads events from a Marquez (or compatible OpenLineage) instance via REST API.

  2. Local file-based scanner: Reads OpenLineage event JSON files from a local directory. Available for standalone Docker installations only.

Marquez API-based scanner

The Marquez API-based scanner connects to a running Marquez instance and retrieves completed OpenLineage run events via its REST API.

If you do not have a Marquez instance, a standalone lineage scanner bundle that includes Marquez is available. See Marquez bundle for OpenLineage.

Scanner configuration

All fields marked with an asterisk (*) are mandatory.

Property Description

name*

Unique name for the scanner job.

sourceType*

Specifies the source type to be scanned. Must be OPEN_LINEAGE.

description*

A human-readable description of the scan.

apiUrl*

REST API base URL of the Marquez instance.

Mandatory if using your own Marquez instance.

If using the Marquez bundle for OpenLineage, the value defaults to @@ref:env:[MARQUEZ_API_URL] and can be skipped.

Example: localhost:5000/api/v1.

namespaces

Optional list of job namespaces to include. If omitted or left empty, all namespaces are scanned.

jobsSince

Lower bound for the event time filter (ISO-8601 format). Example: 2024-09-01T00:00:00Z.

Use together with jobsUntil to scan only a specific time window. If omitted, all available events are scanned.

jobsUntil

Upper bound for the event time filter (ISO-8601 format). Example: 2024-09-30T00:00:00Z.

Use together with jobsSince to scan only a specific time window. If omitted, all available events are scanned.

lineageFromPastNDays

Scans events from the past N days relative to the time the scan runs. Must be a positive integer. Example: 7 scans events from the last 7 days.

Use as a convenient alternative to jobsSince when you want a rolling time window. If jobsSince is also set, jobsSince takes precedence.

Marquez API-based scanner example configuration
{
   "scannerConfigs": [
      {
         "name": "Apache Spark processing",
         "sourceType": "OPEN_LINEAGE",
         "description": "Scan Apache Spark jobs via OpenLineage",
         "namespaces": [],
         "lineageFromPastNDays": 7,
         "jobsUntil": "2025-02-16T00:00:00Z"
      }
   ]
}

Note that the apiUrl field is skipped as the Marquez bundle installation provides a default value. The field is only required when using your own Marquez instance.

Local file-based scanner

As an alternative to reading events from a live Marquez API, the scanner can read OpenLineage event files directly from a local directory. This is useful when you have already collected events as JSON files, or when network access to a Marquez instance is not available.

Before you begin

Before running the scan, place the OpenLineage event files into a directory under <install>/user-data/. For example, create a folder named open_lineage_dir:

<install>/user-data/open_lineage_dir/

<install> is the path where the standalone Docker scanner is installed. For installation details, see Standalone Lineage Scanner.

Alternatively, the scanner can read from any local path that is accessible to it, provided the necessary read permissions are in place.

Each file in the directory must be a valid JSON file containing exactly one OpenLineage RunEvent. The scanner does not check subdirectories.

Scanner configuration

All fields marked with an asterisk (*) are mandatory.

Property Description

name*

Unique name for the scanner job.

sourceType*

Specifies the source type to be scanned. Must be OPEN_LINEAGE.

description*

A human-readable description of the scan.

eventsDirectory*

Path to the directory containing the OpenLineage event JSON files, as seen from inside the scanner container.

The <install>/user-data/ directory on the host is mapped to /opt/ataccama/lineage-scanning/user-data/ inside the container. For example, if you placed your files in <install>/user-data/open_lineage_dir, set this to /opt/ataccama/lineage-scanning/user-data/open_lineage_dir.

namespaces

Optional list of job namespaces to include. If omitted or left empty, all namespaces are scanned.

jobsSince

Lower bound for the event time filter (ISO-8601 format). Example: 2025-01-01T00:00:00Z.

Use together with jobsUntil to scan only a specific time window. If omitted, all events in the directory are scanned.

jobsUntil

Upper bound for the event time filter (ISO-8601 format). Example: 2025-01-31T00:00:00Z.

Use together with jobsSince to scan only a specific time window. If omitted, all events in the directory are scanned.

lineageFromPastNDays

Scans events from the past N days relative to the time the scan runs. Must be a positive integer. Example: 7 scans events from the last 7 days.

Use as a convenient alternative to jobsSince when you want a rolling time window. If jobsSince is also set, jobsSince takes precedence.

Local file-based scanner example configuration
{
   "scannerConfigs": [
      {
         "name": "OpenLineageLocal",
         "sourceType": "OPEN_LINEAGE",
         "description": "Scan OpenLineage events from local files",
         "eventsDirectory": "/opt/ataccama/lineage-scanning/user-data/open_lineage_dir",
         "namespaces": [],
         "lineageFromPastNDays": 7,
         "jobsUntil": null
      }
   ]
}

Source technology configuration

To emit OpenLineage events from your data processing tools, configure the appropriate integration:

Apache Airflow

To enable emitting OpenLineage events from Airflow, follow the official guidelines: Apache Airflow Provider for OpenLineage.

Apache Spark

Configuration details vary depending on your Spark environment:

dbt

For dbt projects, OpenLineage events can be collected via the dbt scanner integration. See dbt Lineage Scanner for configuration details.

Metadata extraction details

Each OpenLineage event represents one execution of a data transformation job — documenting how input datasets were transformed into output datasets. The scanner processes the following from each completed event:

  • Events: Only completed events (eventType: COMPLETE) are scanned.

  • Jobs: The job associated with each event, including its name and namespace.

  • Job inputs and outputs: Datasets used as inputs and outputs for each job, including associated data sources and attributes when available.

  • Symlinks: Alternative dataset identifiers (for example, a Hive table that maps to an S3 path) that improve the accuracy of lineage mapping.

  • Lineage: Source-to-target dataset paths, including attribute-level mappings.

Limitations

Events without inputs or outputs

A lineage graph is only generated for events where both inputs and outputs arrays are non-empty. Events with empty arrays are skipped and produce no lineage. To learn more about the OpenLineage object model, see OpenLineage Object Model.

The following is an example of an event that does not produce any lineage (note the empty inputs and outputs):

{
   "eventTime": "2025-03-03T11:00:00.000000Z",
   "eventType": "COMPLETE",
   "job": {
      "name": "some-job-name",
      "namespace": "some-namespace"
   },
   "run": {
      "runId": "01950527-7c5c-7b19-988b-4fe86c93295c"
   },
   "inputs": [],
   "outputs": []
}

Attribute-level transformation details

Lineage between attributes is captured even for complex transformations, but the details of the transformation logic are not preserved. All attribute mappings are reported as direct (IDENTITY) — regardless of whether the underlying transformation was a CASE WHEN, an aggregation, or a simple column rename.

The transformation object in such cases looks as follows:

"transformations": [
   {
      "description": "",
      "masking": false,
      "subtype": "IDENTITY",
      "type": "DIRECT"
   }
]

Permissions and security

No special permissions are required to run the OpenLineage scanner.

Was this page useful?