Configure AWS Glue with OpenLineage
This guide explains how to configure AWS Glue ETL jobs to send OpenLineage events to your Ataccama ONE orchestrator connection using the OpenLineage Spark listener. Once configured, your AWS Glue jobs emit pipeline lineage events — including dataset inputs, outputs, and job execution metadata — that appear in data observability.
Before you begin, ensure you have generated an API key and copied your endpoint URL.
Compatibility
| AWS Glue version | Spark version | OpenLineage JAR |
|---|---|---|
3.0 |
3.1 |
External JAR via |
4.0 |
3.3 |
External JAR via |
5.0 |
3.5 |
External JAR via |
-
AWS Glue 3.0 and 4.0: You must provide the OpenLineage Spark JAR as an external dependency.
-
AWS Glue 5.0: The OpenLineage Spark listener is bundled in the AWS Glue runtime, but it may not include the latest features or fixes. For best results, use the external JAR approach even on AWS Glue 5.0 to ensure you have the latest OpenLineage version.
Upload the OpenLineage Spark JAR (for example, openlineage-spark_2.12-1.xx.0.jar) to an S3 bucket and reference it with the --extra-jars AWS Glue job argument.
For AWS Glue 5.0, also set --user-jars-first=true to ensure the external JAR takes precedence over the built-in one.
The public JAR is available on Maven Central.
Prerequisites
-
An AWS Glue ETL job (version 3.0, 4.0, or 5.0)
-
OpenLineage endpoint URL and API key from your ONE orchestrator connection — see Gather connection credentials
-
IAM permissions for your AWS Glue job role: S3 access, CloudWatch Logs access, and optionally Secrets Manager access if using it for API key management
-
OpenLineage Spark JAR uploaded to S3 (required for AWS Glue 3.0/4.0, recommended for AWS Glue 5.0)
A custom Ataccama build of the OpenLineage Spark adapter is available on request. The publicly available JAR provides basic lineage tracking, but failed jobs are reported with a COMPLETE status. The Ataccama adapter correctly reports FAIL status with error details, enabling failure detection and alerting in ONE. Contact your Ataccama Customer Success Manager to obtain the custom JAR. -
For VPC-bound jobs sending events to an external endpoint: a NAT Gateway or internet egress path
Configuration
Step 1: Obtain Ataccama ONE credentials
In Ataccama ONE, navigate to your orchestrator connection and copy:
-
Endpoint URL: the full URL in the format:
https://<YOUR_INSTANCE>.ataccama.one/gateway/openlineage/<CONNECTION_ID>/api/v1/lineage
-
API key: the authentication token for the connection
You need to split the endpoint URL into two parts for the Spark configuration:
| Property | Value | Example |
|---|---|---|
|
Base URL (without the API path) |
|
|
API path |
|
The transport.url and transport.endpoint must be configured as separate properties.
Combining the full URL into a single transport.url or transport.endpoint property does not work with the OpenLineage Spark listener.
|
Step 2: Configure OpenLineage Spark properties
There are two ways to configure the OpenLineage Spark listener: using --conf Spark properties or using an openlineage.yml configuration file.
Both require the spark.extraListeners property to be set via --conf.
Option A: Spark properties via --conf
Add the following Spark configuration properties to your AWS Glue job using the --conf argument:
--conf spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener --conf spark.openlineage.transport.type=http --conf spark.openlineage.transport.url=https://<YOUR_INSTANCE>.ataccama.one --conf spark.openlineage.transport.endpoint=/gateway/openlineage/<CONNECTION_ID>/api/v1/lineage --conf spark.openlineage.transport.auth.type=api_key --conf spark.openlineage.transport.auth.apiKey=<YOUR_API_KEY> --conf spark.openlineage.namespace=<YOUR_OPENLINEAGE_NAMESPACE>
| Property | Description |
|---|---|
|
Registers the OpenLineage Spark listener |
|
Transport protocol ( |
|
Base URL of the Ataccama ONE connection endpoint |
|
URL path of the Ataccama ONE connection endpoint |
|
Authentication type ( |
|
API key (the authentication token for the connection) |
|
Logical namespace for grouping events (for example, |
Option B: Configuration file (openlineage.yml)
As an alternative to --conf properties, you can provide an openlineage.yml configuration file.
The OpenLineage Spark listener looks for this file at the path specified by the OPENLINEAGE_CONFIG environment variable, or in the default location (/tmp/openlineage.yml on AWS Glue).
# openlineage.yml
transport:
type: http
url: https://<YOUR_INSTANCE>.ataccama.one
endpoint: /gateway/openlineage/<CONNECTION_ID>/api/v1/lineage
auth:
type: api_key
apiKey: <YOUR_API_KEY>
namespace: <YOUR_OPENLINEAGE_NAMESPACE>
Even when using a configuration file, you must still set spark.extraListeners via --conf to register the listener.
|
External JAR configuration
For AWS Glue 3.0 and 4.0, add the external JAR:
--extra-jars s3://<YOUR_BUCKET>/jars/openlineage-spark_2.12-1.xx.0.jar
For AWS Glue 5.0, we recommend using the external JAR as well to ensure you have the latest OpenLineage features and fixes. Add both arguments:
--extra-jars s3://<YOUR_BUCKET>/jars/openlineage-spark_2.12-1.xx.0.jar --user-jars-first=true
Step 3: Handle API key authentication
AWS Glue’s argument parser corrupts special characters (such as = in Base64-encoded strings) when passed through --conf.
If your API key contains such characters, the --conf approach may result in HTTP 403 errors.
Recommended: AWS Secrets Manager
Store the API key in AWS Secrets Manager and fetch it at runtime in your AWS Glue job script.
This avoids the --conf parsing issue entirely.
| If your AWS Glue job runs inside a VPC (for example, it uses a JDBC connection), you need a Secrets Manager VPC endpoint for the job to reach Secrets Manager. |
Step 4: Configure AWS Glue job arguments
The --conf properties are passed as a single string value in the AWS Glue job’s --conf argument.
All --conf entries are concatenated with spaces into one string and assigned to the --conf key in the job’s default arguments.
In the AWS Console, add these as Job parameters under Job details > Advanced properties.
VPC and networking considerations
The networking requirements depend on whether your AWS Glue job uses a VPC connection.
Jobs without a VPC connection
AWS Glue jobs that do not use a VPC connection (no JDBC, no connection attached) run in AWS Glue’s managed network, which has internet access by default. These jobs can reach the Ataccama ONE endpoint without any additional networking configuration.
Jobs with a VPC connection
AWS Glue jobs that use a VPC connection (for example, a JDBC connection to RDS or Redshift) run inside your VPC subnet. These jobs do not have internet access by default and cannot reach external endpoints like Ataccama ONE.
To enable internet egress for VPC-bound jobs, you need a NAT Gateway:
-
Create a NAT Gateway in a public subnet (a subnet with an Internet Gateway route).
-
Add a route in the AWS Glue connection subnet’s route table:
0.0.0.0/0to the NAT Gateway.
| The NAT Gateway must be in a different subnet than the one used by the AWS Glue connection. The AWS Glue connection subnet should be a private subnet with a route to the NAT Gateway. |
If you only need Secrets Manager access (not OpenLineage event delivery), you can create a Secrets Manager VPC endpoint instead of a NAT Gateway.
To deliver OpenLineage events to an external Ataccama ONE endpoint, a NAT Gateway (or equivalent internet egress path) is required.
Verify the connection
After running an AWS Glue job with OpenLineage configured:
-
Navigate to Data Observability > Connections in ONE.
-
Confirm events are being received — the connection status changes to Connected and job executions appear on the connection’s Overview tab.
Troubleshooting
No events appearing
-
Verify that your AWS Glue job completed successfully.
-
Check the job’s CloudWatch logs for OpenLineage transport errors.
-
If the job runs in a VPC, confirm that a NAT Gateway or internet egress path is configured (see Jobs with a VPC connection).
-
Verify that the
--confproperties are correctly formatted as a single concatenated string.
HTTP 403 errors
If the API key contains special characters (such as =), the AWS Glue argument parser may corrupt them when passed through --conf.
Use the AWS Secrets Manager approach to avoid this issue.
Was this page useful?