User Community Service Desk Downloads

Configure Azure Data Factory with OpenLineage

This guide shows you how to set up the Azure Data Factory (ADF) OpenLineage integration to send pipeline execution events to your Ataccama ONE orchestrator connection.

Before you begin, ensure you have generated an API key and copied your endpoint URL.

No modifications to your existing ADF pipelines are required. The integration uses Azure Monitor diagnostic logs, which ADF emits automatically.

How the ADF integration works

The integration captures ADF pipeline activity by routing diagnostic logs through Azure infrastructure:

  1. ADF emits diagnostic logs (pipeline runs, activity runs) to Azure Monitor.

  2. Diagnostic settings forward these logs to an Azure Event Hub.

  3. An Azure Function processes each event in real time.

  4. The Function transforms events to OpenLineage format and sends them to your ONE orchestrator connection endpoint.

Optionally, the Function can enrich events by calling the ADF REST API to resolve dataset references to their physical locations (storage URIs, database tables). To enable this, set RESOLVE_DATASETS=true and provide the required API enrichment variables.

Supported activity types

Activity type Lineage support

Copy Activity

Full — source and sink datasets, row counts, bytes written

Mapping Data Flow

Input and output datasets

Lookup

Source dataset

GetMetadata

Source dataset

Stored Procedure

Database and procedure reference

Script Activity

Database reference

Execute Pipeline

Parent-child pipeline linkage

Databricks Activities

Dataset references

HDInsight Activities

Dataset references

Custom Activities

Generic extraction

Prerequisites

  • Ataccama ADF integration package, which includes deployment templates and an Azure Function app.

    This package is not publicly distributed and is available on request. Contact your Ataccama Customer Success Manager to obtain the integration package before proceeding.
  • Azure subscription with permissions to create resources

  • Existing Azure Data Factory

  • OpenLineage endpoint URL and API key from your ONE orchestrator connection — see Gather connection credentials

  • Azure CLI installed (required for Bicep and manual deployment)

  • Terraform installed (required for Terraform deployment only)

  • Azure Functions Core Tools (required for manual deployment only)

Deploy the integration

Choose one of the following deployment options.

Option 1: Deploy with Bicep

# Log in to Azure
az login

# Create a resource group (if needed)
az group create --name rg-adf-openlineage --location eastus

# Deploy the integration
az deployment group create \
  --resource-group rg-adf-openlineage \
  --template-file deploy/bicep/main.bicep \
  --parameters \
    dataFactoryResourceId="/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.DataFactory/factories/{factory}" \
    openLineageUrl="https://your-endpoint.com" \
    openLineageApiKey="your-api-key"

Option 2: Deploy with Terraform

cd deploy/terraform

# Copy and edit the variables file
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your values

# Initialize and deploy
terraform init
terraform plan
terraform apply

Option 3: Manual deployment

  1. Create an Event Hub namespace and hub to receive ADF diagnostic logs, along with a dedicated consumer group for the Azure Function:

    The Azure Function uses a dedicated consumer group named openlineage-processor to process events independently from any other consumers on the hub. If you deploy using Bicep or Terraform, this consumer group is created automatically.
    az eventhubs namespace create \
      --name adf-openlineage-ehns \
      --resource-group rg-adf-openlineage \
      --sku Standard
    
    az eventhubs eventhub create \
      --name adf-diagnostic-logs \
      --namespace-name adf-openlineage-ehns \
      --resource-group rg-adf-openlineage \
      --message-retention 1 \
      --partition-count 4
    
    az eventhubs eventhub consumer-group create \
      --name openlineage-processor \
      --eventhub-name adf-diagnostic-logs \
      --namespace-name adf-openlineage-ehns \
      --resource-group rg-adf-openlineage
  2. Configure ADF diagnostic settings:

    az monitor diagnostic-settings create \
      --name openlineage-settings \
      --resource "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.DataFactory/factories/{factory}" \
      --event-hub adf-diagnostic-logs \
      --event-hub-rule "/subscriptions/{sub}/resourceGroups/rg-adf-openlineage/providers/Microsoft.EventHub/namespaces/adf-openlineage-ehns/authorizationRules/RootManageSharedAccessKey" \
      --logs '[{"category":"PipelineRuns","enabled":true},{"category":"ActivityRuns","enabled":true},{"category":"TriggerRuns","enabled":true}]'
  3. Deploy the Azure Function:

    cd function_app
    npm install
    npm run build
    func azure functionapp publish <your-function-app-name>

Configure the Azure Function

Environment variables

Variable Required Description

ADF_EVENT_HUB_CONNECTION

Yes

Event Hub connection string

ADF_EVENT_HUB_NAME

Yes

Event Hub name

OPENLINEAGE_URL

Yes

OpenLineage endpoint URL from your ONE orchestrator connection

OPENLINEAGE_API_KEY

Yes

API key from your ONE orchestrator connection (Bearer token)

OPENLINEAGE_NAMESPACE

No

Namespace prefix for OpenLineage jobs (default: azure_data_factory)

AZURE_SUBSCRIPTION_ID

No*

Azure subscription ID, required for API enrichment

ADF_RESOURCE_GROUP

No*

ADF resource group, required for API enrichment

ADF_FACTORY_NAME

No*

ADF factory name, required for API enrichment

ADF_API_CACHE_TTL_SECONDS

No

Cache time-to-live for ADF API responses in seconds (default: 3600)

ADF_API_CACHE_MAX_SIZE

No

Maximum number of cache entries (default: 1000)

RESOLVE_ROOT_PIPELINE

No

Resolve root pipeline via ADF API for parent-child lineage. Set to true or false (default: false)

RESOLVE_DATASETS

No

Resolve dataset names via ADF API for enriched lineage. Set to true or false (default: false)

LOG_OPENLINEAGE_EVENTS

No

Log full OpenLineage event JSON to Application Insights. Set to true or false (default: false)

*Required only when API enrichment is enabled (RESOLVE_DATASETS=true).

API enrichment is optional. Without it, the integration captures pipeline and activity hierarchy and basic run metadata. With it enabled (RESOLVE_DATASETS=true), the Function calls the ADF REST API to resolve dataset references to their physical storage URIs and database tables. Enable API enrichment if you want detailed dataset-level lineage.

Required Azure permissions

Deployer permissions

The user or service principal running the deployment requires:

Permission Scope Reason

Contributor or higher

Resource group

Create resources (Function App, Event Hub, Storage, Key Vault, Application Insights)

Owner or User Access Administrator

Data Factory

Create RBAC role assignments for the Function App’s managed identity

Key Vault Secrets Officer

Key Vault

Write the OpenLineage API key secret into Key Vault (assigned automatically by Terraform)

Microsoft.Insights/diagnosticSettings/write

Data Factory

Create diagnostic settings to route logs to Event Hub

The Bicep template relies on ARM implicit deployer permissions for Key Vault secret creation. The Terraform template explicitly assigns Key Vault Secrets Officer to the deployer.

Function App managed identity permissions

The following RBAC roles are assigned automatically by the Bicep and Terraform templates:

Role Scope Reason

Reader

Data Factory

Read pipelines, datasets, and linked services via the ADF REST API (for API enrichment)

Key Vault Secrets User

Key Vault

Read the OpenLineage API key secret

The Function App authenticates with Event Hub using a connection string stored in ADF_EVENT_HUB_CONNECTION, configured automatically by the deployment templates.

If you deployed manually (not via Bicep or Terraform), you must assign these roles yourself.

  1. Retrieve the Function App’s managed identity principal ID:

    PRINCIPAL_ID=$(az functionapp identity show \
      --name <your-function-app> \
      --resource-group rg-adf-openlineage \
      --query principalId -o tsv)
  2. Grant the Reader role on the Data Factory:

    az role assignment create \
      --role "Reader" \
      --assignee $PRINCIPAL_ID \
      --scope "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.DataFactory/factories/{factory}"
  3. Grant the Key Vault Secrets User role:

    az role assignment create \
      --role "Key Vault Secrets User" \
      --assignee $PRINCIPAL_ID \
      --scope "/subscriptions/{sub}/resourceGroups/rg-adf-openlineage/providers/Microsoft.KeyVault/vaults/{vault-name}"

The user running these commands must have Owner or User Access Administrator role on the target resources. If you don’t have permission to assign roles, contact your Azure subscription administrator.

Verify the connection

  1. Run a pipeline in your Azure Data Factory.

  2. Navigate to Data Observability > Connections in ONE.

  3. Confirm events are being received — the connection status changes to Connected and job executions appear on the connection’s Overview tab.

Events typically arrive within seconds of the ADF run completing. If you are using the Consumption plan for your Function App, allow a few extra seconds for a cold start on the first run.

Limitations of the ADF integration

  • Dataset-level lineage only: Column-level lineage is not captured; this is an ADF platform limitation.

  • Data Flow transformations: Internal transformation steps within Mapping Data Flows are not captured; only input and output datasets are tracked.

  • SQL query lineage: Stored Procedure and Script activities capture the procedure or script name, not the tables accessed within them.

  • Parameterized datasets: Dataset parameters may not be fully resolved at runtime; the integration captures parameter names rather than resolved values.

  • Unsupported linked service types: Linked services not in the supported list will result in generic dataset references.

Troubleshooting

No events appearing

  1. Verify that ADF diagnostic settings are enabled and routing to your Event Hub.

  2. Run a pipeline — events are generated only when pipelines run.

  3. In the Azure Portal, check your Event Hub to confirm messages are arriving.

  4. Check your Function App logs in Application Insights for errors.

Missing datasets

  1. Enable API enrichment by setting AZURE_SUBSCRIPTION_ID, ADF_RESOURCE_GROUP, and ADF_FACTORY_NAME.

  2. Verify that the Function App’s managed identity has Reader role on the Data Factory.

  3. Datasets are cached for one hour by default; wait for the cache to expire, or increase ADF_API_CACHE_TTL_SECONDS.

ADF API rate limits

The ADF REST API has a limit of 12,500 read operations per hour. If you reach this limit, increase ADF_API_CACHE_TTL_SECONDS (default: 3600). Pipeline and dataset definitions rarely change, so a longer TTL (such as 7200 or higher) is generally safe.

Was this page useful?