Databricks Configuration

Ataccama ONE is currently tested to support Databricks 10.3 or higher.

The following properties are used to configure Databricks and are provided either through the Configuration Service, or in the dpe/etc/application-SPARK_DATABRICKS.properties file.

Check that spring.profiles.active property in DPE application.properties is set to SPARK_DATABRICKS profile.

To add your Databricks cluster as a data source, see Metastore Data Source Configuration.

General configuration

Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.exec

String

Sets the script for customizing how Spark jobs are launched. For example, you can change which shell script is used to start the job. The script is located in the bin/databricks folder in the root directory of the application.

The default value for Linux is ${ataccama.path.root}/bin/databricks/exec_databricks.sh.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dqc.licenses

String

Points to the location of Ataccama licenses needed for Spark jobs. The property should be configured only if licenses are stored outside of the standard locations, that is the home directory of the user and the folder runtime/license_keys.

Default value: ../../../lib/runtime/license_keys.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.JAVA_HOME

String

Sets the JAVA_HOME variable for running Spark jobs. For Spark jobs, JDK version 8 is required.

Default value: /usr/java/jdk1.8.0_65.

You need two versions of Java for DPE and Spark: DPE uses Java 11 while the Spark launcher requires Java 8.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.debug

Boolean

Enables the debugging mode. The debug mode shows the driver and the executor classloaders used in Spark, which lets you detect any classpath conflicts between the Ataccama classpath and the Spark classpath in the Spark driver and in the Spark executors.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.cluster

String

The name or the identifier of the cluster. If the property is not set, the most recently added cluster is used. Default value: ataccama.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.checkCluster

Boolean

Tests whether the cluster has the correct mount and libraries. Turning the cluster check off speeds up launching jobs when the cluster libraries have already been installed and there are no changes made to the runtime. In that case, checking is skipped.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.autoMount

Boolean

Enables automatic mounting of internal Azure Storage to Azure Databricks, which simplifies the installation and configuration process. This requires saving of Azure Storage credentials in configuration settings.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.job.max_concurrent_runs

Number

Specifies how many jobs can be run in parallel.

Default value: 150.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.fsc.force

Boolean

If set to true, input files from the executor are synchronized with the files on the cluster at each launch or processing. If set to false, the cached versions of files and lookups are used instead, unless their respective checksums have changed.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.waitRunIfExceeds

Boolean

If set to true, initiated jobs are added to the queue if the maximum number of concurrent jobs is exceeded. If set to false, jobs are automatically canceled instead.

Applies only to version 13.9.1 and later.

Default value: true.

Cluster libraries

The classpath with all required Java libraries needs to include the dpe/lib/* folder. The files in the classpath are then copied to the cluster.

You can also add custom properties for additional classpaths, for example, plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.drivers, which would point to the location of database drivers. To exclude specific files from the classpath, use the pattern !<file_mask>, for example, !guava-11*.

Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cpdelim

String

The classpath delimiter used for libraries for Spark jobs.

Default value: ; (semicolon).

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.runtime

String

The libraries needed for Spark jobs. The files in the classpath are copied to the Databricks cluster. Spark jobs are stored as tmp/jobs/${jobId} in the root directory of the application.

For Spark processing of Azure Data Lake Storage Gen2, an additional set of libraries is required: ../../../lib/runtime/ext/*.

Default value: ../../../lib/runtime/*;../../../lib/jdbc/\*;../../../lib/jdbc_ext/*.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.ovr

String

Points to additional override libraries that can be used for miscellaneous fixes.

Default value: ../lib/ovr/*.jar.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.databricks

String

Points to additional Databricks libraries.

Default value: ../lib/runtime/databricks/*.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.!exclude

String

Excludes certain Java libraries and drivers that are in the same classpath but are not needed for Spark processing. Each library needs to be prefixed by an exclamation mark (!). Libraries are separated by the delimiter defined in the cpdelim property.

We strongly recommend not providing a specific version for any entry as this can change in future releases. Instead, use a wildcard.

Default value: !atc-hive-jdbc*;!hive-jdbc*;!slf4j-api-*.jar;!kryo-*.jar;!scala*.jar;!commons-lang3*.jar;!cif-dtdb*.jar.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.lcp.!exclude

String

Excludes certain Java libraries and drivers that are in the local classpath. Each library needs to be prefixed by an exclamation mark (!). Libraries are separated by the delimiter defined in the cpdelim property.

Default value: !guava-11*.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.lcp.ext

String

Points to the local external libraries.

Default value: ../lib/runtime/hadoop3/*.

Authentication

Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.url

String

The URL of the Databricks regional endpoint, for example, northeurope.azuredatabricks.net/.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.authType

String

Determines the type of authentication used with Databricks. There are following authentication types:

BASIC_AUTH - Uses a username and a password to authenticate.
PERSONAL_TOKEN - Uses a token generated at Databricks.
AAD_CLIENT_CREDENTIAL - Uses Azure Active Directory Service Principal with a secret.
AAD_MANAGED_IDENTITY - Uses Azure Active Directory Managed Identities. DPE must be running on an Azure VM within the same tenant and the managed identity must be set up for the VM in Azure AD by:

authentication = ActiveDirectoryManagedIdentity

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.token

String

An access token for the Databricks platform. This token is used for jobs that are executed through ONE Desktop. Otherwise, the token is provided when creating a connection to Databricks in ONE Web Application.

As of the current version, the property is optional. If you configured a metastore data source through ONE Web Application and are using the Catalog Item Reader step in ONE Desktop, the token is automatically passed to ONE Desktop without additional configuration or user action.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.user

String

The username for Databricks. The username and password are used instead of an access token.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.password

String

The password for Databricks. The username and password are used instead of an access token.

Azure Active Directory Authentication

For AAD authentication types you need to specify plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d, which is the Resource ID of Databricks in Azure.

Azure AD Service Principal

For authentication using AAD SP use the following properties:

Azure AD Service Principal

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.tenantId=tenantID
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.clientId=clientID
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.clientSecret=clientSecret
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d

Azure AD Managed Identity

For authentication using AAD MSI use the following properties:

Azure AD Managed Identity

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.tokenPropertyKey=AccessToken
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.aad.resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d

Configure Service Principal and Secret Scope

It is best practice to store your secrets in a secure format. If you want to keep your credentials in Azure Key Vault, instead of providing the credentials in plain text you need to configure your Azure Databricks cluster to read secrets from the Azure Key Vault secret scope. First, create Databricks secret scope and connect it to Azure Key Vault. Then, specify the following properties in the Advanced Options of your Azure Databricks cluster configuration:

Accessing secret scope

fs.azure.account.auth.type.acmeadls.dfs.core.windows.net OAuth
fs.azure.account.oauth.provider.type.acmeadls.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
#Enter your application (client) ID for the Azure AD application registration.
fs.azure.account.oauth2.client.id.acmeadls.dfs.core.windows.net <;application-id>
#Replace <secret-scope-name> by the name of the corresponding Azure Databricks secret scope and <secret-name> by the name of the client secret in the secret scope.
#Specify the scope/secret name in the following format {{ secret }}.
fs.azure.account.oauth2.client.secret.acmeadls.dfs.core.windows.net {{secrets/<;secret-scope-name>/<secret-name>}}
fs.azure.account.oauth2.client.endpoint.acmeadls.dfs.core.windows.net https://login.microsoftonline.com/<directory-id>/oauth2/token

In order to use AAD Managed Identity or Service Principal with Databricks you need to use SparkJDBC42.

Amazon Web Services

If you are using Databricks on Amazon Web Services, you need to set the following properties as well as the authentication options. The same user account can be used for mounting a file system and for running jobs in ONE.

Property Data type Description

Property	Data type	Description
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.point`	String	The folder in the Databricks File System that is used as a mount point for the S3 folder. Default value: `/mnt/ataccama`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.url`	String	The location of the S3 folder for storing libraries and files for processing. The folder is then mounted to the directory in the Databricks File System defined in the property mount.point. Default value: `s3a://…/tmp/dbr`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.s3n.awsAccessKeyId`	String	The access key identifier for the Amazon S3 user account. Used when mounting the file system.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.s3n.awsSecretAccessKey`	String	The secret access key for the Amazon S3 user account. Used when mounting the file system.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.access.key`	String	The access key identifier for the Amazon S3 user account. Used when configuring Ataccama jobs.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.secret.key`	String	The secret access key for the Amazon S3 user account. Used when configuring Ataccama jobs.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.fast.upload`	Boolean	Enables fast upload of data. For more information, see How S3A Writes Data to S3. Default value: `true`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.fast.upload.buffer`	String	Configures how data is buffered during fast upload. Must be set to `bytebuffer` for AWS. For more information, see How S3A Writes Data to S3. Default value: `bytebuffer`.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.point

String

The folder in the Databricks File System that is used as a mount point for the S3 folder. Default value: /mnt/ataccama.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.url

String

The location of the S3 folder for storing libraries and files for processing. The folder is then mounted to the directory in the Databricks File System defined in the property mount.point. Default value: s3a://…/tmp/dbr.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.s3n.awsAccessKeyId

String

The access key identifier for the Amazon S3 user account. Used when mounting the file system.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.s3n.awsSecretAccessKey

String

The secret access key for the Amazon S3 user account. Used when mounting the file system.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.access.key

String

The access key identifier for the Amazon S3 user account. Used when configuring Ataccama jobs.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.secret.key

String

The secret access key for the Amazon S3 user account. Used when configuring Ataccama jobs.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.fast.upload

Boolean

Enables fast upload of data. For more information, see How S3A Writes Data to S3.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.fast.upload.buffer

String

Configures how data is buffered during fast upload. Must be set to bytebuffer for AWS. For more information, see How S3A Writes Data to S3.

Default value: bytebuffer.

Azure Data Lake Storage Gen2

When using Databricks on Azure Data Lake Storage Gen2, it is necessary to configure these properties and enable authentication. You can use the same user account for mounting the file system and for running Ataccama jobs.

Property Data type Description

Property	Data type	Description
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.runtime`	String	The libraries needed for Spark jobs. The files in the classpath are copied to the Databricks cluster. Spark jobs are stored as `tmp/jobs/${jobId}` in the root directory of the application. For Spark processing of Azure Data Lake Storage Gen2, an additional set of libraries is required: `../../../lib/runtime/ext/`. Default value: `../../../lib/runtime/``;../../../lib/jdbc/;../../../lib/jdbc_ext/`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.point`	String	The folder in the Databricks File System that is used as a mount point for the Azure Data Lake Storage Gen2 folder. Default value: `/mnt/ataccama`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.url`	String	The location of the ADLS Gen2 folder for storing libraries and files for processing. The folder is then mounted to the directory in the Databricks File System defined in the property mount.point. Default value: `adl://…/tmp/dbr`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.dfs.adls.oauth2.access.token.provider.type`	String	The type of access token provider. Used when mounting the file system. Default value: `ClientCredential`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.dfs.adls.oauth2.refresh.url`	String	The token endpoint. A request is sent to this URL to refresh the access token. Used when mounting the file system. Default value: `login.microsoftonline.com/…/oauth2/token`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.dfs.adls.oauth2.client.id`	String	The client identifier of the ADLS account. Used when mounting the file system.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.dfs.adls.oauth2.credential`	String	The client secret of the ADLS account. Used when mounting the file system.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.adl.impl`	String	Enables the Azure Data Lake File System. Default value: `org.apache.hadoop.fs.adl.AdlFileSystem`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.dfs.adls.oauth2.access.token.provider.type`	String	The type of access token provider. Used when running Ataccama jobs. Default value: `ClientCredential`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.dfs.adls.oauth2.refresh.url`	String	The token endpoint used for refreshing the access token. Used when running Ataccama jobs. Default value: `login.microsoftonline.com/…/oauth2/token`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.dfs.adls.oauth2.client.id`	String	The client identifier of the ADLS account. Used when running Ataccama jobs.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.dfs.adls.oauth2.credential`	String	The client secret of the ADLS account. Used when running Ataccama jobs.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.runtime

String

The libraries needed for Spark jobs. The files in the classpath are copied to the Databricks cluster. Spark jobs are stored as tmp/jobs/${jobId} in the root directory of the application. For Spark processing of Azure Data Lake Storage Gen2, an additional set of libraries is required: ../../../lib/runtime/ext/. Default value: ../../../lib/runtime/;../../../lib/jdbc/;../../../lib/jdbc_ext/.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.point

String

The folder in the Databricks File System that is used as a mount point for the Azure Data Lake Storage Gen2 folder. Default value: /mnt/ataccama.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.url

String

The location of the ADLS Gen2 folder for storing libraries and files for processing. The folder is then mounted to the directory in the Databricks File System defined in the property mount.point. Default value: adl://…/tmp/dbr.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.dfs.adls.oauth2.access.token.provider.type

String

The type of access token provider. Used when mounting the file system. Default value: ClientCredential.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.dfs.adls.oauth2.refresh.url

String

The token endpoint. A request is sent to this URL to refresh the access token. Used when mounting the file system. Default value: login.microsoftonline.com/…/oauth2/token.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.dfs.adls.oauth2.client.id

String

The client identifier of the ADLS account. Used when mounting the file system.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.dfs.adls.oauth2.credential

String

The client secret of the ADLS account. Used when mounting the file system.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.adl.impl

String

Enables the Azure Data Lake File System. Default value: org.apache.hadoop.fs.adl.AdlFileSystem.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.dfs.adls.oauth2.access.token.provider.type

String

The type of access token provider. Used when running Ataccama jobs. Default value: ClientCredential.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.dfs.adls.oauth2.refresh.url

String

The token endpoint used for refreshing the access token. Used when running Ataccama jobs. Default value: login.microsoftonline.com/…/oauth2/token.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.dfs.adls.oauth2.client.id

String

The client identifier of the ADLS account. Used when running Ataccama jobs.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.dfs.adls.oauth2.credential

String

The client secret of the ADLS account. Used when running Ataccama jobs.

Was this page useful?