User Community Service Desk Downloads

Configure AWS Glue with OpenLineage

This guide explains how to configure AWS Glue ETL jobs to send OpenLineage events to your Ataccama ONE orchestrator connection using the OpenLineage Spark listener. Once configured, your AWS Glue jobs emit pipeline lineage events — including dataset inputs, outputs, and job execution metadata — that appear in data observability.

Before you begin, ensure you have generated an API key and copied your endpoint URL.

Compatibility

AWS Glue version Spark version OpenLineage JAR

3.0

3.1

External JAR via --extra-jars

4.0

3.3

External JAR via --extra-jars

5.0

3.5

External JAR via --extra-jars and --user-jars-first=true

  • AWS Glue 3.0 and 4.0: You must provide the OpenLineage Spark JAR as an external dependency.

  • AWS Glue 5.0: The OpenLineage Spark listener is bundled in the AWS Glue runtime, but it may not include the latest features or fixes. For best results, use the external JAR approach even on AWS Glue 5.0 to ensure you have the latest OpenLineage version.

Upload the OpenLineage Spark JAR (for example, openlineage-spark_2.12-1.xx.0.jar) to an S3 bucket and reference it with the --extra-jars AWS Glue job argument. For AWS Glue 5.0, also set --user-jars-first=true to ensure the external JAR takes precedence over the built-in one. The public JAR is available on Maven Central.

Prerequisites

  • An AWS Glue ETL job (version 3.0, 4.0, or 5.0)

  • OpenLineage endpoint URL and API key from your ONE orchestrator connection — see Gather connection credentials

  • IAM permissions for your AWS Glue job role: S3 access, CloudWatch Logs access, and optionally Secrets Manager access if using it for API key management

  • OpenLineage Spark JAR uploaded to S3 (required for AWS Glue 3.0/4.0, recommended for AWS Glue 5.0)

    A custom Ataccama build of the OpenLineage Spark adapter is available on request. The publicly available JAR provides basic lineage tracking, but failed jobs are reported with a COMPLETE status. The Ataccama adapter correctly reports FAIL status with error details, enabling failure detection and alerting in ONE. Contact your Ataccama Customer Success Manager to obtain the custom JAR.
  • For VPC-bound jobs sending events to an external endpoint: a NAT Gateway or internet egress path

Configuration

Step 1: Obtain Ataccama ONE credentials

In Ataccama ONE, navigate to your orchestrator connection and copy:

  • Endpoint URL: the full URL in the format:

    https://<YOUR_INSTANCE>.ataccama.one/gateway/openlineage/<CONNECTION_ID>/api/v1/lineage
  • API key: the authentication token for the connection

You need to split the endpoint URL into two parts for the Spark configuration:

Property Value Example

transport.url

Base URL (without the API path)

https://<YOUR_INSTANCE>.ataccama.one

transport.endpoint

API path

/gateway/openlineage/<CONNECTION_ID>/api/v1/lineage

The transport.url and transport.endpoint must be configured as separate properties. Combining the full URL into a single transport.url or transport.endpoint property does not work with the OpenLineage Spark listener.

Step 2: Configure OpenLineage Spark properties

There are two ways to configure the OpenLineage Spark listener: using --conf Spark properties or using an openlineage.yml configuration file. Both require the spark.extraListeners property to be set via --conf.

Option A: Spark properties via --conf

Add the following Spark configuration properties to your AWS Glue job using the --conf argument:

--conf spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener
--conf spark.openlineage.transport.type=http
--conf spark.openlineage.transport.url=https://<YOUR_INSTANCE>.ataccama.one
--conf spark.openlineage.transport.endpoint=/gateway/openlineage/<CONNECTION_ID>/api/v1/lineage
--conf spark.openlineage.transport.auth.type=api_key
--conf spark.openlineage.transport.auth.apiKey=<YOUR_API_KEY>
--conf spark.openlineage.namespace=<YOUR_OPENLINEAGE_NAMESPACE>
Property Description

spark.extraListeners

Registers the OpenLineage Spark listener

spark.openlineage.transport.type

Transport protocol (http)

spark.openlineage.transport.url

Base URL of the Ataccama ONE connection endpoint

spark.openlineage.transport.endpoint

URL path of the Ataccama ONE connection endpoint

spark.openlineage.transport.auth.type

Authentication type (api_key)

spark.openlineage.transport.auth.apiKey

API key (the authentication token for the connection)

spark.openlineage.namespace

Logical namespace for grouping events (for example, aws_glue_production)

Option B: Configuration file (openlineage.yml)

As an alternative to --conf properties, you can provide an openlineage.yml configuration file. The OpenLineage Spark listener looks for this file at the path specified by the OPENLINEAGE_CONFIG environment variable, or in the default location (/tmp/openlineage.yml on AWS Glue).

# openlineage.yml
transport:
  type: http
  url: https://<YOUR_INSTANCE>.ataccama.one
  endpoint: /gateway/openlineage/<CONNECTION_ID>/api/v1/lineage
  auth:
    type: api_key
    apiKey: <YOUR_API_KEY>
namespace: <YOUR_OPENLINEAGE_NAMESPACE>
Even when using a configuration file, you must still set spark.extraListeners via --conf to register the listener.
Upload to S3 and download at runtime

Upload the configuration file to S3 and download it in your job script before initializing the Spark context.

Generate at runtime

Generate the configuration file in the job script. This is useful when you need to inject secrets (for example, from AWS Secrets Manager) or build the configuration dynamically.

This approach also avoids the --conf special character issue with API keys.

External JAR configuration

For AWS Glue 3.0 and 4.0, add the external JAR:

--extra-jars s3://<YOUR_BUCKET>/jars/openlineage-spark_2.12-1.xx.0.jar

For AWS Glue 5.0, we recommend using the external JAR as well to ensure you have the latest OpenLineage features and fixes. Add both arguments:

--extra-jars s3://<YOUR_BUCKET>/jars/openlineage-spark_2.12-1.xx.0.jar
--user-jars-first=true

Step 3: Handle API key authentication

AWS Glue’s argument parser corrupts special characters (such as = in Base64-encoded strings) when passed through --conf. If your API key contains such characters, the --conf approach may result in HTTP 403 errors.

Store the API key in AWS Secrets Manager and fetch it at runtime in your AWS Glue job script. This avoids the --conf parsing issue entirely.

If your AWS Glue job runs inside a VPC (for example, it uses a JDBC connection), you need a Secrets Manager VPC endpoint for the job to reach Secrets Manager.

Step 4: Configure AWS Glue job arguments

The --conf properties are passed as a single string value in the AWS Glue job’s --conf argument. All --conf entries are concatenated with spaces into one string and assigned to the --conf key in the job’s default arguments.

In the AWS Console, add these as Job parameters under Job details > Advanced properties.

VPC and networking considerations

The networking requirements depend on whether your AWS Glue job uses a VPC connection.

Jobs without a VPC connection

AWS Glue jobs that do not use a VPC connection (no JDBC, no connection attached) run in AWS Glue’s managed network, which has internet access by default. These jobs can reach the Ataccama ONE endpoint without any additional networking configuration.

Jobs with a VPC connection

AWS Glue jobs that use a VPC connection (for example, a JDBC connection to RDS or Redshift) run inside your VPC subnet. These jobs do not have internet access by default and cannot reach external endpoints like Ataccama ONE.

To enable internet egress for VPC-bound jobs, you need a NAT Gateway:

  1. Create a NAT Gateway in a public subnet (a subnet with an Internet Gateway route).

  2. Add a route in the AWS Glue connection subnet’s route table: 0.0.0.0/0 to the NAT Gateway.

The NAT Gateway must be in a different subnet than the one used by the AWS Glue connection. The AWS Glue connection subnet should be a private subnet with a route to the NAT Gateway.

If you only need Secrets Manager access (not OpenLineage event delivery), you can create a Secrets Manager VPC endpoint instead of a NAT Gateway.

To deliver OpenLineage events to an external Ataccama ONE endpoint, a NAT Gateway (or equivalent internet egress path) is required.

Verify the connection

After running an AWS Glue job with OpenLineage configured:

  1. Navigate to Data Observability > Connections in ONE.

  2. Confirm events are being received — the connection status changes to Connected and job executions appear on the connection’s Overview tab.

Troubleshooting

No events appearing

  1. Verify that your AWS Glue job completed successfully.

  2. Check the job’s CloudWatch logs for OpenLineage transport errors.

  3. If the job runs in a VPC, confirm that a NAT Gateway or internet egress path is configured (see Jobs with a VPC connection).

  4. Verify that the --conf properties are correctly formatted as a single concatenated string.

HTTP 403 errors

If the API key contains special characters (such as =), the AWS Glue argument parser may corrupt them when passed through --conf. Use the AWS Secrets Manager approach to avoid this issue.

Was this page useful?