Azure Databricks Connection
Prerequisites
Cluster configuration
The following property must be added to your Spark configuration:
spark.serializer org.apache.spark.serializer.KryoSerializer
ADLS container permissions
The container must have either Storage Blob Data Owner
or Storage Blob Data Contributor
permissions for the blob storage and the user account or Service Principal.
To set up these permissions, refer to the following Microsoft article.
Connections
To work with the Databricks cluster, you need to set up the following connections between:
-
Processing virtual machine (VM) with DPE and Databricks cluster.
-
Processing VM with DPE and Ataccama dedicated ADLS Gen2 container.
-
Databricks cluster and Ataccama dedicated ADLS Gen2 container.
These connections are established via the properties set in Method 1: Authentication via token for cluster and Service Principal Name (SPN) for ADLS and Method 2: SPN both for cluster and ADLS respectively.
As a customer, you are responsible for setting up the connection from the Databricks cluster to your data lake. |
Method 1: Authentication via token for cluster and Service Principal Name (SPN) for ADLS
-
Add the following properties to
/opt/ataccama/one/dpe/etc/application-SPARK_DATABRICKS.properties
. Create a token for the users who have access to the Databricks cluster and use it in the following configuration.# Connection from DPE to Databricks: plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.cluster={ name-of-the-cluster-in-databricks } plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.url=https://<YOUR_DATABRICKS_CLUSTER>.azuredatabricks.net plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.authType=PERSONAL_TOKEN plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.token={ DBR_TOKEN } # Specify location of Ataccama dedicated ADLS Gen2 container plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.point=/ataccama/dbr-tmp plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.url=abfss://{ container name }@{ storage name }.dfs.core.windows.net/{ path-to-the-folder } plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.adl.impl=org.apache.hadoop.fs.adl.AdlFileSystem # Connection from Processing VM with DPE application to Ataccama dedicated ADLS Gen2 container plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.auth.type=OAuth plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.id={ ABFSS_CLIENT_ID } plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.secret={ ABFSS_CLIENT_SECRET } plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/{ tenantID }/oauth2/token # Connection from Databricks cluster to Ataccama dedicated ADLS Gen2 container plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.auth.type=OAuth plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.id={ ABFSS_CLIENT_ID } plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.secret={ ABFSS_CLIENT_SECRET } plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/{ tenantID }/oauth2/token
-
Now add the following metastore properties to
/opt/ataccama/one/dpe/etc/application.properties
:# Databricks as data source configuration plugin.metastoredatasource.ataccama.one.cluster.databricks.name={ CLUSTER_NAME } plugin.metastoredatasource.ataccama.one.cluster.databricks.driver-class=com.databricks.client.jdbc.Driver plugin.metastoredatasource.ataccama.one.cluster.databricks.driver-class-path={ ATACCAMA_ONE_HOME }/dpe/lib/jdbc/DatabricksJDBC42.jar plugin.metastoredatasource.ataccama.one.cluster.databricks.url={ DBR_JDBC_STRING } plugin.metastoredatasource.ataccama.one.cluster.databricks.authentication=TOKEN plugin.metastoredatasource.ataccama.one.cluster.databricks.databricksUrl={ DBR_URL } plugin.metastoredatasource.ataccama.one.cluster.databricks.timeout=15m plugin.metastoredatasource.ataccama.one.cluster.databricks.profiling-sample-limit = 100000 plugin.metastoredatasource.ataccama.one.cluster.databricks.full-select-query-pattern = SELECT {columns} FROM {table} plugin.metastoredatasource.ataccama.one.cluster.databricks.preview-query-pattern = SELECT {columns} FROM {table} LIMIT {previewLimit} plugin.metastoredatasource.ataccama.one.cluster.databricks.row-count-query-pattern = SELECT COUNT(*) FROM {table} plugin.metastoredatasource.ataccama.one.cluster.databricks.sampling-query-pattern = SELECT {columns} FROM {table} LIMIT {limit} plugin.metastoredatasource.ataccama.one.cluster.databricks.dsl-query-preview-query-pattern = SELECT * FROM ({dslQuery}) dslQuery LIMIT {previewLimit} plugin.metastoredatasource.ataccama.one.cluster.databricks.dsl-query-import-metadata-query-pattern = SELECT * FROM ({dslQuery}) dslQuery LIMIT 0
Method 2: SPN both for cluster and ADLS
Instead of the token, you can use a Service Principal which has access to the Databricks cluster, and use it in the following configuration.
Your ClientID is the same as your ApplicationID .
|
-
Add the following properties to
/opt/ataccama/one/dpe/etc/application-SPARK_DATABRICKS.properties
:# Connection from DPE to Databricks: plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.cluster={ name-of-the-cluster-in-databricks } plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.url=https://<YOUR_DATABRICKS_CLUSTER>.azuredatabricks.net plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.authType=AAD_CLIENT_CREDENTIAL plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.tenantId={ tenantID } plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.clientId={ clientID } plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.clientSecret={ clientSecret } # The following resource value is always static and should not be changed plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d # Specify location of Ataccama dedicated ADLS Gen2 container plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.point=/ataccama/dbr-tmp plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.url=abfss://{ container name }@{ storage name }.dfs.core.windows.net/{ path-to-the-folder } plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.adl.impl=org.apache.hadoop.fs.adl.AdlFileSystem # Connection from DPE/Databricks to Ataccama Dedicated ALDS Gen2 container plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.auth.type=OAuth plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.id={ clientID } plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.secret={ clientSecret } plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/{ tenantID }/oauth2/token plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.auth.type=OAuth plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.id={ clientID } plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.secret={ clientSecret } plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/{ tenantID }/oauth2/token
-
Now add the following metastore properties to
/opt/ataccama/one/dpe/etc/application.properties
:plugin.metastoredatasource.ataccama.one.cluster.databricks.name=Databricks Cluster plugin.metastoredatasource.ataccama.one.cluster.databricks.url=jdbc:databricks://{ } plugin.metastoredatasource.ataccama.one.cluster.databricks.databricksUrl={ workspaceUrl } plugin.metastoredatasource.ataccama.one.cluster.databricks.driver-class=com.databricks.client.jdbc.Driver plugin.metastoredatasource.ataccama.one.cluster.databricks.driver-class-path=${ataccama.path.root}/lib/runtime/jdbc/databricks/* plugin.metastoredatasource.ataccama.one.cluster.databricks.timeout=15m plugin.metastoredatasource.ataccama.one.cluster.databricks.authentication=INTEGRATED plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.authType=AAD_CLIENT_CREDENTIAL plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.tenantId= plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.clientId= plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.clientSecret= plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.scope=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default plugin.metastoredatasource.ataccama.one.cluster.databricks.full-select-query-pattern = SELECT {columns} FROM {table} plugin.metastoredatasource.ataccama.one.cluster.databricks.preview-query-pattern = SELECT {columns} FROM {table} LIMIT {previewLimit} plugin.metastoredatasource.ataccama.one.cluster.databricks.row-count-query-pattern = SELECT COUNT(*) FROM {table} plugin.metastoredatasource.ataccama.one.cluster.databricks.sampling-query-pattern = SELECT {columns} FROM {table} LIMIT {limit} plugin.metastoredatasource.ataccama.one.cluster.databricks.dsl-query-preview-query-pattern = SELECT * FROM ({dslQuery}) dslQuery LIMIT {previewLimit} plugin.metastoredatasource.ataccama.one.cluster.databricks.dsl-query-import-metadata-query-pattern = SELECT * FROM ({dslQuery}) dslQuery LIMIT 0 spring.profiles.active=SPARK_DATABRICKS
Configure Azure Key Vault
You can use Key Vault instead of having to enter plain text values in the configuration. Connect to Key Vault using either direct connection credentials, or through MSI.
Key Vault direct connection
plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.keyvault.authType=AAD_CLIENT_CREDENTIAL plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.keyvault.vaultUrl=https://<vault-url>.vault.azure.net/ plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.keyvault.tenantId={ tenantId } plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.keyvault.clientId={ clientId } plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.keyvault.clientSecret={ YOUR_SECRET }
MSI connection
You can get the client ID by running the following cURL command:
|
plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.keyvault.authType=AAD_MANAGED_IDENTITY
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.keyvault.vaultUrl=https://<;your_vault>.vault.azure.net/
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.keyvault.clientId=
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.keyvault.tenantId=
plugin.metastoredatasource.ataccama.one.cluster.databricks.aad.keyvault.resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d
# then use keyvault:SECRET:name_of_secret
# similar to this:
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.auth.type=OAutU
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.id=keyvault:SECRET:<name>
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.secret=keyvault:SECRET:<name>
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/ddb8592f..../oauth2/token
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.auth.type=OAuth
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.id=keyvault:SECRET:<name>
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.secret=keyvault:SECRET:<name>
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/ddb8592f...../oauth2/token
Drivers
This guide assumes the use of the DatabricksJDBC42.jar
.
If you are using SparkJDBS42.jar
, you need to change the property values in the configurations as follows.
Instead of:
-
plugin.metastoredatasource.ataccama.one.cluster.databricks.url=jdbc:databricks://{ }
-
plugin.metastoredatasource.ataccama.one.cluster.databricks.driver-class=com.databricks.client.jdbc.Driver
Use:
-
plugin.metastoredatasource.ataccama.one.cluster.databricks.url=jdbc:spark://{ }
-
plugin.metastoredatasource.ataccama.one.cluster.databricks.driver-class=com.simba.spark.jdbc.Driver
If you are using DatabricksJDBC42.jar
, proceed to Rearrange libraries.
Rearrange libraries
This step is not needed if you use SparkJDBS42.jar .
|
If you use DatabricksJDBC42.jar
, you need to rearrange the drivers in the Ataccama libraries:
-
Create a folder to store the new set of drivers:
mkdir /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/* cp DatabricksJDBC42.jar /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/.
-
Update
/opt/ataccama/one/dpe/etc/application-SPARK_DATABRICKS.properties
accordingly:-
Find the following line:
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.!exclude=!atc-hive-jdbc*!hive-jdbc*;!slf4j-api-*.jar;!kryo-*.jar;!scala*.jar;!commons-lang3*.jar!cif-dtdb*.jar
-
Delete
;!slf4j-api-.jar
and!commons-lang3.jar
, which gives:plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.!exclude=!atc-hive-jdbc*!hive-jdbc*;!kryo-*.jar;!scala*.jar;!cif-dtdb*.jar
-
-
In the same file, find:
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.databricks=../../../lib/runtime/databricks/*
Then add the newly-created folder, which gives:
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.databricks=../../../lib/runtime/databricks/*;/opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/*
-
Copy
DatabricksJDBC42.jar
to the new folder and run the following commands to add the Ataccama libraries:cp /opt/ataccama/one/dpe/lib/runtime/ext/hadoop-* /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/. cp /opt/ataccama/one/dpe/lib/runtime/ext/commons-configuration2-2.1.1.jar /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/. cp /opt/ataccama/one/dpe/lib/runtime/ext/avro-1.7.4.jar /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/. cp /opt/ataccama/one/dpe/lib/runtime/ext/wildfly-openssl-*.Final.jar /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/. cp /opt/ataccama/one/dpe/lib/runtime/ext/htrace-core4-4.1.0-incubating.jar /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/. cp /opt/ataccama/one/dpe/lib/runtime/ext/jackson-core-asl-1.9.13-atlassian-6.jar /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks/.
-
Finally, make sure that all jar files are owned by DPE. You can do this using the following command:
sudo chown -R dpe:dpe /opt/ataccama/one/dpe/lib/runtime/jdbc/databricks
Was this page useful?