Configure Azure Data Factory with OpenLineage
This guide shows you how to set up the Azure Data Factory (ADF) OpenLineage integration to send pipeline execution events to your Ataccama ONE orchestrator connection.
Before you begin, ensure you have generated an API key and copied your endpoint URL.
| No modifications to your existing ADF pipelines are required. The integration uses Azure Monitor diagnostic logs, which ADF emits automatically. |
How the ADF integration works
The integration captures ADF pipeline activity by routing diagnostic logs through Azure infrastructure:
-
ADF emits diagnostic logs (pipeline runs, activity runs) to Azure Monitor.
-
Diagnostic settings forward these logs to an Azure Event Hub.
-
An Azure Function processes each event in real time.
-
The Function transforms events to OpenLineage format and sends them to your ONE orchestrator connection endpoint.
Optionally, the Function can enrich events by calling the ADF REST API to resolve dataset references to their physical locations (storage URIs, database tables).
To enable this, set RESOLVE_DATASETS=true and provide the required API enrichment variables.
Supported activity types
| Activity type | Lineage support |
|---|---|
Copy Activity |
Full — source and sink datasets, row counts, bytes written |
Mapping Data Flow |
Input and output datasets |
Lookup |
Source dataset |
GetMetadata |
Source dataset |
Stored Procedure |
Database and procedure reference |
Script Activity |
Database reference |
Execute Pipeline |
Parent-child pipeline linkage |
Databricks Activities |
Dataset references |
HDInsight Activities |
Dataset references |
Custom Activities |
Generic extraction |
Prerequisites
-
Ataccama ADF integration package, which includes deployment templates and an Azure Function app.
This package is not publicly distributed and is available on request. Contact your Ataccama Customer Success Manager to obtain the integration package before proceeding. -
Azure subscription with permissions to create resources
-
Existing Azure Data Factory
-
OpenLineage endpoint URL and API key from your ONE orchestrator connection — see Gather connection credentials
-
Azure CLI installed (required for Bicep and manual deployment)
-
Terraform installed (required for Terraform deployment only)
-
Azure Functions Core Tools (required for manual deployment only)
Deploy the integration
Choose one of the following deployment options.
Option 1: Deploy with Bicep
# Log in to Azure
az login
# Create a resource group (if needed)
az group create --name rg-adf-openlineage --location eastus
# Deploy the integration
az deployment group create \
--resource-group rg-adf-openlineage \
--template-file deploy/bicep/main.bicep \
--parameters \
dataFactoryResourceId="/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.DataFactory/factories/{factory}" \
openLineageUrl="https://your-endpoint.com" \
openLineageApiKey="your-api-key"
Option 2: Deploy with Terraform
cd deploy/terraform
# Copy and edit the variables file
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your values
# Initialize and deploy
terraform init
terraform plan
terraform apply
Option 3: Manual deployment
-
Create an Event Hub namespace and hub to receive ADF diagnostic logs, along with a dedicated consumer group for the Azure Function:
The Azure Function uses a dedicated consumer group named openlineage-processorto process events independently from any other consumers on the hub. If you deploy using Bicep or Terraform, this consumer group is created automatically.az eventhubs namespace create \ --name adf-openlineage-ehns \ --resource-group rg-adf-openlineage \ --sku Standard az eventhubs eventhub create \ --name adf-diagnostic-logs \ --namespace-name adf-openlineage-ehns \ --resource-group rg-adf-openlineage \ --message-retention 1 \ --partition-count 4 az eventhubs eventhub consumer-group create \ --name openlineage-processor \ --eventhub-name adf-diagnostic-logs \ --namespace-name adf-openlineage-ehns \ --resource-group rg-adf-openlineage -
Configure ADF diagnostic settings:
az monitor diagnostic-settings create \ --name openlineage-settings \ --resource "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.DataFactory/factories/{factory}" \ --event-hub adf-diagnostic-logs \ --event-hub-rule "/subscriptions/{sub}/resourceGroups/rg-adf-openlineage/providers/Microsoft.EventHub/namespaces/adf-openlineage-ehns/authorizationRules/RootManageSharedAccessKey" \ --logs '[{"category":"PipelineRuns","enabled":true},{"category":"ActivityRuns","enabled":true},{"category":"TriggerRuns","enabled":true}]' -
Deploy the Azure Function:
cd function_app npm install npm run build func azure functionapp publish <your-function-app-name>
Configure the Azure Function
Environment variables
| Variable | Required | Description |
|---|---|---|
|
Yes |
Event Hub connection string |
|
Yes |
Event Hub name |
|
Yes |
OpenLineage endpoint URL from your ONE orchestrator connection |
|
Yes |
API key from your ONE orchestrator connection (Bearer token) |
|
No |
Namespace prefix for OpenLineage jobs (default: |
|
No* |
Azure subscription ID, required for API enrichment |
|
No* |
ADF resource group, required for API enrichment |
|
No* |
ADF factory name, required for API enrichment |
|
No |
Cache time-to-live for ADF API responses in seconds (default: |
|
No |
Maximum number of cache entries (default: |
|
No |
Resolve root pipeline via ADF API for parent-child lineage. Set to |
|
No |
Resolve dataset names via ADF API for enriched lineage. Set to |
|
No |
Log full OpenLineage event JSON to Application Insights. Set to |
*Required only when API enrichment is enabled (RESOLVE_DATASETS=true).
API enrichment is optional.
Without it, the integration captures pipeline and activity hierarchy and basic run metadata.
With it enabled (RESOLVE_DATASETS=true), the Function calls the ADF REST API to resolve dataset references to their physical storage URIs and database tables.
Enable API enrichment if you want detailed dataset-level lineage.
|
Required Azure permissions
Deployer permissions
The user or service principal running the deployment requires:
| Permission | Scope | Reason |
|---|---|---|
|
Resource group |
Create resources (Function App, Event Hub, Storage, Key Vault, Application Insights) |
|
Data Factory |
Create RBAC role assignments for the Function App’s managed identity |
|
Key Vault |
Write the OpenLineage API key secret into Key Vault (assigned automatically by Terraform) |
|
Data Factory |
Create diagnostic settings to route logs to Event Hub |
The Bicep template relies on ARM implicit deployer permissions for Key Vault secret creation.
The Terraform template explicitly assigns Key Vault Secrets Officer to the deployer.
|
Function App managed identity permissions
The following RBAC roles are assigned automatically by the Bicep and Terraform templates:
| Role | Scope | Reason |
|---|---|---|
|
Data Factory |
Read pipelines, datasets, and linked services via the ADF REST API (for API enrichment) |
|
Key Vault |
Read the OpenLineage API key secret |
The Function App authenticates with Event Hub using a connection string stored in ADF_EVENT_HUB_CONNECTION, configured automatically by the deployment templates.
|
If you deployed manually (not via Bicep or Terraform), you must assign these roles yourself.
The user running these commands must have |
Verify the connection
-
Run a pipeline in your Azure Data Factory.
-
Navigate to Data Observability > Connections in ONE.
-
Confirm events are being received — the connection status changes to Connected and job executions appear on the connection’s Overview tab.
| Events typically arrive within seconds of the ADF run completing. If you are using the Consumption plan for your Function App, allow a few extra seconds for a cold start on the first run. |
Limitations of the ADF integration
-
Dataset-level lineage only: Column-level lineage is not captured; this is an ADF platform limitation.
-
Data Flow transformations: Internal transformation steps within Mapping Data Flows are not captured; only input and output datasets are tracked.
-
SQL query lineage: Stored Procedure and Script activities capture the procedure or script name, not the tables accessed within them.
-
Parameterized datasets: Dataset parameters may not be fully resolved at runtime; the integration captures parameter names rather than resolved values.
-
Unsupported linked service types: Linked services not in the supported list will result in generic dataset references.
Troubleshooting
No events appearing
-
Verify that ADF diagnostic settings are enabled and routing to your Event Hub.
-
Run a pipeline — events are generated only when pipelines run.
-
In the Azure Portal, check your Event Hub to confirm messages are arriving.
-
Check your Function App logs in Application Insights for errors.
Missing datasets
-
Enable API enrichment by setting
AZURE_SUBSCRIPTION_ID,ADF_RESOURCE_GROUP, andADF_FACTORY_NAME. -
Verify that the Function App’s managed identity has
Readerrole on the Data Factory. -
Datasets are cached for one hour by default; wait for the cache to expire, or increase
ADF_API_CACHE_TTL_SECONDS.
Was this page useful?