Databricks Configuration
Ataccama ONE is currently tested to support Databricks 10.3 or higher.
The following properties are used to configure Databricks and are provided either through the Configuration Service, or in the dpe/etc/application-SPARK_DATABRICKS.properties
file.
Check that spring.profiles.active
property in DPE application.properties
is set to SPARK_DATABRICKS
profile.
To add your Databricks cluster as a data source, see Metastore Data Source Configuration.
General configuration
Property | Data type | Description | ||
---|---|---|---|---|
|
String |
Sets the script for customizing how Spark jobs are launched.
For example, you can change which shell script is used to start the job.
The script is located in the The default value for Linux is |
||
|
String |
Points to the location of Ataccama licenses needed for Spark jobs.
The property should be configured only if licenses are stored outside of the standard locations, that is the home directory of the user and the folder Default value: |
||
|
String |
Sets the Default value:
|
||
|
Boolean |
Enables the debugging mode. The debug mode shows the driver and the executor classloaders used in Spark, which lets you detect any classpath conflicts between the Ataccama classpath and the Spark classpath in the Spark driver and in the Spark executors. Default value: |
||
|
String |
The name or the identifier of the cluster.
If the property is not set, the most recently added cluster is used.
Default value: |
||
|
Boolean |
Tests whether the cluster has the correct mount and libraries. Turning the cluster check off speeds up launching jobs when the cluster libraries have already been installed and there are no changes made to the runtime. In that case, checking is skipped. Default value: |
||
|
Boolean |
Enables automatic mounting of internal Azure Storage to Azure Databricks, which simplifies the installation and configuration process. This requires saving of Azure Storage credentials in configuration settings. Default value: |
||
|
Number |
Specifies how many jobs can be run in parallel. Default value: |
||
|
Boolean |
If set to Default value: |
||
|
Boolean |
If set to
Default value: |
Cluster libraries
The classpath with all required Java libraries needs to include the dpe/lib/*
folder.
The files in the classpath are then copied to the cluster.
You can also add custom properties for additional classpaths, for example, plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.drivers
, which would point to the location of database drivers.
To exclude specific files from the classpath, use the pattern !<file_mask>
, for example, !guava-11*
.
Property | Data type | Description | ||
---|---|---|---|---|
|
String |
The classpath delimiter used for libraries for Spark jobs. Default value: |
||
|
String |
The libraries needed for Spark jobs.
The files in the classpath are copied to the Databricks cluster.
Spark jobs are stored as For Spark processing of Azure Data Lake Storage Gen2, an additional set of libraries is required: Default value: |
||
|
String |
Points to additional override libraries that can be used for miscellaneous fixes. Default value: |
||
|
String |
Points to additional Databricks libraries. Default value: |
||
|
String |
Excludes certain Java libraries and drivers that are in the same classpath but are not needed for Spark processing.
Each library needs to be prefixed by an exclamation mark (
Default value: |
||
|
String |
Excludes certain Java libraries and drivers that are in the local classpath.
Each library needs to be prefixed by an exclamation mark ( Default value: |
||
|
String |
Points to the local external libraries. Default value: |
Authentication
Property | Data type | Description | ||
---|---|---|---|---|
|
String |
The URL of the Databricks regional endpoint, for example, |
||
|
String |
Determines the type of authentication used with Databricks. There are following authentication types:
|
||
|
String |
An access token for the Databricks platform. This token is used for jobs that are executed through ONE Desktop. Otherwise, the token is provided when creating a connection to Databricks in ONE Web Application.
|
||
|
String |
The username for Databricks. The username and password are used instead of an access token. |
||
|
String |
The password for Databricks. The username and password are used instead of an access token. |
Azure Active Directory Authentication
For AAD authentication types you need to specify plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d
, which is the Resource ID of Databricks in Azure.
Azure AD Service Principal
For authentication using AAD SP use the following properties:
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.tenantId=tenantID
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.clientId=clientID
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.clientSecret=clientSecret
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d
Azure AD Managed Identity
For authentication using AAD MSI use the following properties:
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.tokenPropertyKey=AccessToken
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d
Configure Service Principal and Secret Scope
It is best practice to store your secrets in a secure format. If you want to keep your credentials in Azure Key Vault, instead of providing the credentials in plain text you need to configure your Azure Databricks cluster to read secrets from the Azure Key Vault secret scope. First, create Databricks secret scope and connect it to Azure Key Vault. Then, specify the following properties in the Advanced Options of your Azure Databricks cluster configuration:
fs.azure.account.auth.type.acmeadls.dfs.core.windows.net OAuth
fs.azure.account.oauth.provider.type.acmeadls.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
#Enter your application (client) ID for the Azure AD application registration.
fs.azure.account.oauth2.client.id.acmeadls.dfs.core.windows.net <;application-id>
#Replace <secret-scope-name> by the name of the corresponding Azure Databricks secret scope and <secret-name> by the name of the client secret in the secret scope.
#Specify the scope/secret name in the following format {{ secret }}.
fs.azure.account.oauth2.client.secret.acmeadls.dfs.core.windows.net {{secrets/<;secret-scope-name>/<secret-name>}}
fs.azure.account.oauth2.client.endpoint.acmeadls.dfs.core.windows.net https://login.microsoftonline.com/<directory-id>/oauth2/token
In order to use AAD Managed Identity or Service Principal with Databricks you need to use SparkJDBC42. |
Amazon Web Services
If you are using Databricks on Amazon Web Services, you need to set the following properties as well as the authentication options. The same user account can be used for mounting a file system and for running jobs in ONE.
Property | Data type | Description |
---|---|---|
|
String |
The folder in the Databricks File System that is used as a mount point for the S3 folder.
Default value: |
|
String |
The location of the S3 folder for storing libraries and files for processing.
The folder is then mounted to the directory in the Databricks File System defined in the property mount.point.
Default value: |
|
String |
The access key identifier for the Amazon S3 user account. Used when mounting the file system. |
|
String |
The secret access key for the Amazon S3 user account. Used when mounting the file system. |
|
String |
The access key identifier for the Amazon S3 user account. Used when configuring Ataccama jobs. |
|
String |
The secret access key for the Amazon S3 user account. Used when configuring Ataccama jobs. |
|
Boolean |
Enables fast upload of data. For more information, see How S3A Writes Data to S3. Default value: |
|
String |
Configures how data is buffered during fast upload.
Must be set to Default value: |
Azure Data Lake Storage Gen2
When using Databricks on Azure Data Lake Storage Gen2, it is necessary to configure these properties and enable authentication. You can use the same user account for mounting the file system and for running Ataccama jobs.
Property | Data type | Description |
---|---|---|
|
String |
The libraries needed for Spark jobs.
The files in the classpath are copied to the Databricks cluster.
Spark jobs are stored as |
|
String |
The folder in the Databricks File System that is used as a mount point for the Azure Data Lake Storage Gen2 folder.
Default value: |
|
String |
The location of the ADLS Gen2 folder for storing libraries and files for processing.
The folder is then mounted to the directory in the Databricks File System defined in the property mount.point.
Default value: |
|
String |
The type of access token provider.
Used when mounting the file system.
Default value: |
|
String |
The token endpoint.
A request is sent to this URL to refresh the access token.
Used when mounting the file system.
Default value: |
|
String |
The client identifier of the ADLS account. Used when mounting the file system. |
|
String |
The client secret of the ADLS account. Used when mounting the file system. |
|
String |
Enables the Azure Data Lake File System.
Default value: |
|
String |
The type of access token provider.
Used when running Ataccama jobs.
Default value: |
|
String |
The token endpoint used for refreshing the access token.
Used when running Ataccama jobs.
Default value: |
|
String |
The client identifier of the ADLS account. Used when running Ataccama jobs. |
|
String |
The client secret of the ADLS account. Used when running Ataccama jobs. |
Was this page useful?