User Community Service Desk Downloads

Azure Databricks Connection

Prerequisites

Cluster configuration

The following property must be added to your Spark configuration:

spark.serializer org.apache.spark.serializer.KryoSerializer

ADLS container permissions

The container must have either Storage Blob Data Owner or Storage Blob Data Contributor permissions for the blob storage and the user account or Service Principal. To set up these permissions, refer to the following Microsoft article.

Connections

To work with the Databricks cluster, you need to set up the following connections between:

  • Processing virtual machine (VM) with DPE and Databricks cluster.

  • Processing VM with DPE and Ataccama dedicated ADLS Gen2 container.

  • Databricks cluster and Ataccama dedicated ADLS Gen2 container.

As a customer, you are responsible for setting up the connection from the Databricks cluster to your data lake.

Method 1: Authentication via token for cluster and Service Principal Name (SPN) for ADLS

  1. Add the following properties to /opt/ataccama/one/dpe/etc/application-SPARK_DATABRICKS.properties. Create a token for the users who have access to the Databricks cluster and use it in the following configuration.

    # Connection from DPE to Databricks:
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.cluster={ name-of-the-cluster-in-databricks }
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.url=https://<YOUR_DATABRICKS_CLUSTER>.azuredatabricks.net
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.authType=PERSONAL_TOKEN
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.token={ DBR_TOKEN }
    
    # Specify location of Ataccama dedicated ADLS Gen2 container
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.point=/ataccama/dbr-tmp
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.url=abfss://{ container name }@{ storage name }.dfs.core.windows.net/{ path-to-the-folder }
    
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.adl.impl=org.apache.hadoop.fs.adl.AdlFileSystem
    
    #  Connection from Processing VM with DPE application to Ataccama dedicated ADLS Gen2 container
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.auth.type=OAuth
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.id={ ABFSS_CLIENT_ID }
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.secret={ ABFSS_CLIENT_SECRET }
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/{ tenantID }/oauth2/token
    
    # Connection from Databricks cluster to Ataccama dedicated ADLS Gen2 container
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.auth.type=OAuth
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.id={ ABFSS_CLIENT_ID }
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.secret={ ABFSS_CLIENT_SECRET }
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/{ tenantID }/oauth2/token
  2. Now add the following metastore properties to /opt/ataccama/one/dpe/etc/application.properties:

    # Databricks as data source configuration
    plugin.metastoredatasource.ataccama.one.cluster.databricks.name={ CLUSTER_NAME }
    plugin.metastoredatasource.ataccama.one.cluster.databricks.driver-class=com.databricks.client.jdbc.Driver
    plugin.metastoredatasource.ataccama.one.cluster.databricks.driver-class-path={ ATACCAMA_ONE_HOME }/dpe/lib/jdbc/DatabricksJDBC42.jar
    plugin.metastoredatasource.ataccama.one.cluster.databricks.url={ DBR_JDBC_STRING }
    plugin.metastoredatasource.ataccama.one.cluster.databricks.authentication=TOKEN
    plugin.metastoredatasource.ataccama.one.cluster.databricks.databricksUrl={ DBR_URL }
    plugin.metastoredatasource.ataccama.one.cluster.databricks.timeout=15m
    plugin.metastoredatasource.ataccama.one.cluster.databricks.profiling-sample-limit = 100000
    plugin.metastoredatasource.ataccama.one.cluster.databricks.full-select-query-pattern = SELECT {columns} FROM {table}
    plugin.metastoredatasource.ataccama.one.cluster.databricks.preview-query-pattern = SELECT {columns} FROM {table} LIMIT {previewLimit}
    plugin.metastoredatasource.ataccama.one.cluster.databricks.row-count-query-pattern = SELECT COUNT(*) FROM {table}
    plugin.metastoredatasource.ataccama.one.cluster.databricks.sampling-query-pattern = SELECT {columns} FROM {table} LIMIT {limit}
    plugin.metastoredatasource.ataccama.one.cluster.databricks.dsl-query-preview-query-pattern = SELECT * FROM ({dslQuery}) dslQuery LIMIT {previewLimit}
    plugin.metastoredatasource.ataccama.one.cluster.databricks.dsl-query-import-metadata-query-pattern = SELECT * FROM ({dslQuery}) dslQuery LIMIT 0

Method 2: SPN both for cluster and ADLS

Instead of the token, you can use a Service Principal which has access to the Databricks cluster, and use it in the following configuration.

Your ClientID is the same as your ApplicationID.
  1. Add the following properties to /opt/ataccama/one/dpe/etc/application-SPARK_DATABRICKS.properties:

    # Connection from DPE to Databricks:
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.cluster={ name-of-the-cluster-in-databricks }
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.url=https://<YOUR_DATABRICKS_CLUSTER>.azuredatabricks.net
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.authType=AAD_CLIENT_CREDENTIAL
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.tenantId={ tenantID }
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.clientId={ clientID }
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.clientSecret={ clientSecret }
    # The following resource value is always static and should not be changed
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d
    
    # Specify location of Ataccama dedicated ADLS Gen2 container
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.point=/ataccama/dbr-tmp
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.url=abfss://{ container name }@{ storage name }.dfs.core.windows.net/{ path-to-the-folder }
    
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.adl.impl=org.apache.hadoop.fs.adl.AdlFileSystem
    
    # Connection from DPE/Databricks to Ataccama Dedicated ALDS Gen2 container
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.auth.type=OAuth
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.id={ clientID }
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.secret={ clientSecret }
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/{ tenantID }/oauth2/token
    
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.auth.type=OAuth
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.id={ clientID }
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.secret={ clientSecret }
    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/{ tenantID }/oauth2/token
  2. Now add the following metastore properties to /opt/ataccama/one/dpe/etc/application.properties:

    plugin.metastoredatasource.ataccama.one.cluster.databricks.name=Databricks Cluster
    plugin.metastoredatasource.ataccama.one.cluster.databricks.url=jdbc:databricks://{ }
    plugin.metastoredatasource.ataccama.one.cluster.databricks.databricksUrl={ workspaceUrl }
    plugin.metastoredatasource.ataccama.one.cluster.databricks.driver-class=com.databricks.client.jdbc.Driver
    plugin.metastoredatasource.ataccama.one.cluster.databricks.driver-class-path=${ataccama.path.root}/lib/runtime/jdbc/databricks/*
    plugin.metastoredatasource.ataccama.one.cluster.databricks.timeout=15m
    plugin.metastoredatasource.ataccama.one.cluster.databricks.authentication=INTEGRATED
    plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.authType=AAD_CLIENT_CREDENTIAL
    plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.tenantId=
    plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.clientId=
    plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.clientSecret=
    plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.scope=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default
    plugin.metastoredatasource.ataccama.one.cluster.databricks.full-select-query-pattern = SELECT {columns} FROM {table}
    plugin.metastoredatasource.ataccama.one.cluster.databricks.preview-query-pattern = SELECT {columns} FROM {table} LIMIT {previewLimit}
    plugin.metastoredatasource.ataccama.one.cluster.databricks.row-count-query-pattern = SELECT COUNT(*) FROM {table}
    plugin.metastoredatasource.ataccama.one.cluster.databricks.sampling-query-pattern = SELECT {columns} FROM {table} LIMIT {limit}
    plugin.metastoredatasource.ataccama.one.cluster.databricks.dsl-query-preview-query-pattern = SELECT * FROM ({dslQuery}) dslQuery LIMIT {previewLimit}
    plugin.metastoredatasource.ataccama.one.cluster.databricks.dsl-query-import-metadata-query-pattern = SELECT * FROM ({dslQuery}) dslQuery LIMIT 0
    
    spring.profiles.active=SPARK_DATABRICKS

Configure Azure Key Vault

You can use Key Vault instead of having to enter plain text values in the configuration. Connect to Key Vault using either direct connection credentials, or through MSI.

Key Vault direct connection

plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.keyvault.authType=AAD_CLIENT_CREDENTIAL
plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.keyvault.vaultUrl=https://<vault-url>.vault.azure.net/
plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.keyvault.tenantId={ tenantId }
plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.keyvault.clientId={ clientId }
plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.keyvault.clientSecret={ YOUR_SECRET }

MSI connection

You can get the client ID by running the following cURL command:

curl 'http://169.254.169.254/metadata/identity/oauth2/token?resource=https://vault.azure.net&api-version=2018-02-01' -H "Metadata: true" |jq
plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.keyvault.authType=AAD_MANAGED_IDENTITY

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.keyvault.vaultUrl=https://<;your_vault>.vault.azure.net/
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.keyvault.clientId=
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.keyvault.tenantId=
plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.keyvault.resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d

# then use keyvault:SECRET:name_of_secret
# similar to this:
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.auth.type=OAutU
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.id=keyvault:SECRET:<name>
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.secret=keyvault:SECRET:<name>
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/ddb8592f..../oauth2/token

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.auth.type=OAuth
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.id=keyvault:SECRET:<name>
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.secret=keyvault:SECRET:<name>
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/ddb8592f...../oauth2/token

Drivers

This guide assumes the use of the DatabricksJDBC42.jar. If you are using SparkJDBS42.jar, you need to change the property values in the configurations as follows.

Instead of:

  • plugin.metastoredatasource.ataccama.one.cluster.databricks.url=jdbc:databricks://{ }

  • plugin.metastoredatasource.ataccama.one.cluster.databricks.driver-class=com.databricks.client.jdbc.Driver

Use:

  • plugin.metastoredatasource.ataccama.one.cluster.databricks.url=jdbc:spark://{ }

  • plugin.metastoredatasource.ataccama.one.cluster.databricks.driver-class=com.simba.spark.jdbc.Driver

If you are using DatabricksJDBC42.jar, proceed to Rearrange libraries.

Rearrange libraries

This step is not needed if you use SparkJDBS42.jar.

If you use DatabricksJDBC42.jar, you need to rearrange the drivers in the Ataccama libraries:

  1. Create a folder to store the new set of drivers:

    mkdir /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/*
    cp DatabricksJDBC42.jar /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/.
  2. Update /opt/ataccama/one/dpe/etc/application-SPARK_DATABRICKS.properties accordingly:

    1. Find the following line:

      plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.!exclude=!atc-hive-jdbc*!hive-jdbc*;!slf4j-api-*.jar;!kryo-*.jar;!scala*.jar;!commons-lang3*.jar!cif-dtdb*.jar
    2. Delete ;!slf4j-api-.jar and !commons-lang3.jar, which gives:

      plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.!exclude=!atc-hive-jdbc*!hive-jdbc*;!kryo-*.jar;!scala*.jar;!cif-dtdb*.jar
  3. In the same file, find:

    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.databricks=../../../lib/runtime/databricks/*

    Then add the newly-created folder, which gives:

    plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.databricks=../../../lib/runtime/databricks/*;/opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/*
  4. Copy DatabricksJDBC42.jar to the new folder and run the following commands to add the Ataccama libraries:

    cp /opt/ataccama/one/dpe/lib/runtime/ext/hadoop-* /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/.
    cp /opt/ataccama/one/dpe/lib/runtime/ext/commons-configuration2-2.1.1.jar /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/.
    cp /opt/ataccama/one/dpe/lib/runtime/ext/avro-1.7.4.jar /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/.
    cp /opt/ataccama/one/dpe/lib/runtime/ext/wildfly-openssl-*.Final.jar /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/.
    cp /opt/ataccama/one/dpe/lib/runtime/ext/htrace-core4-4.1.0-incubating.jar /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/.
    cp /opt/ataccama/one/dpe/lib/runtime/ext/jackson-core-asl-1.9.13-atlassian-6.jar /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/.
  5. Finally, make sure that all jar files are owned by DPE. You can do this using the following command:

    sudo chown -R dpe:dpe /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks

Was this page useful?