User Community Service Desk Downloads
If you can't find the product or version you're looking for, visit support.ataccama.com/downloads

Databricks Configuration

Ataccama ONE is currently tested to support Databricks 10.3 or higher.

The following properties are used to configure Databricks and are provided in the dpe/etc/application-SPARK_DATABRICKS.properties file. Check that spring.profiles.active property in DPE application.properties is set to SPARK_DATABRICKS profile.

To add your Databricks cluster as a data source, see sources:metastore-data-source-configuration.adoc.

Basic settings

Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.exec

String

Sets the script for customizing how Spark jobs are launched. For example, you can change which shell script is used to start the job. The script is located in the bin/databricks folder in the root directory of the application.

The default value for Linux is ${ataccama.path.root}/bin/databricks/exec_databricks.sh.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dqc.licenses

String

Points to the location of Ataccama licenses needed for Spark jobs. The property should be configured only if licenses are stored outside of the standard locations, that is the home directory of the user and the folder runtime/license_keys.

Default value: ../../../lib/runtime/license_keys.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.JAVA_HOME

String

Sets the JAVA_HOME variable for running Spark jobs. For Spark jobs, JDK version 8 is required.

Default value: /usr/java/jdk1.8.0_65.

You need two versions of Java for DPE and Spark: DPE uses Java 17 while the Spark launcher requires Java 8.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.debug

Boolean

Enables the debugging mode. The debug mode shows the driver and the executor classloaders used in Spark, which lets you detect any classpath conflicts between the Ataccama classpath and the Spark classpath in the Spark driver and in the Spark executors.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.checkCluster

Boolean

Tests whether the cluster has the correct mount and libraries. Turning the cluster check off speeds up launching jobs when the cluster libraries have already been installed and there are no changes made to the runtime. In that case, checking is skipped.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.autoMount

Boolean

Enables automatic mounting of internal Azure Storage to Azure Databricks, which simplifies the installation and configuration process. This requires saving of Azure Storage credentials in configuration settings.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.job.max_concurrent_runs

Number

Specifies how many jobs can be run in parallel.

Default value: 150.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.fsc.force

Boolean

If set to true, input files from the executor are synchronized with the files on the cluster at each launch or processing. If set to false, the cached versions of files and lookups are used instead, unless their respective checksums have changed.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.waitRunIfExceeds

Boolean

If set to true, initiated jobs are added to the queue if the maximum number of concurrent jobs is exceeded. If set to false, jobs are automatically canceled instead.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.point

String

The folder in the Databricks File System that is used as a mount point for the Azure Data Lake Storage Gen2 folder.

Default value: /mnt/ataccama.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.url

String

The location of the ADLS Gen2 folder for storing libraries and files for processing. The folder is then mounted to the directory in the Databricks File System defined in the property mount.point.

Default value: adl://…​/tmp/dbr.

Cluster libraries

The classpath with all required Java libraries needs to include the dpe/lib/* folder. The files in the classpath are then copied to the cluster.

You can also add custom properties for additional classpaths, for example, plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.drivers, which would point to the location of database drivers. To exclude specific files from the classpath, use the pattern !<file_mask>, for example, !guava-11*.

Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cpdelim

String

The classpath delimiter used for libraries for Spark jobs.

Default value: ; (semicolon).

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.runtime

String

The libraries needed for Spark jobs. The files in the classpath are copied to the Databricks cluster. Spark jobs are stored as tmp/jobs/${jobId} in the root directory of the application.

For Spark processing of Azure Data Lake Storage Gen2, an additional set of libraries is required: ../../../lib/runtime/ext/*.

Default value: ../../../lib/runtime/*;../../../lib/jdbc/\*;../../../lib/jdbc_ext/*.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.ovr

String

Points to additional override libraries that can be used for miscellaneous fixes.

Default value: ../lib/ovr/*.jar.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.databricks

String

Points to additional Databricks libraries.

Default value: ../lib/runtime/databricks/*.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.!exclude

String

Excludes certain Java libraries and drivers that are in the same classpath but are not needed for Spark processing. Each library needs to be prefixed by an exclamation mark (!). Libraries are separated by the delimiter defined in the cpdelim property.

We strongly recommend not providing a specific version for any entry as this can change in future releases. Instead, use a wildcard.

Default value: !atc-hive-jdbc*;!hive-jdbc*;!slf4j-api-*.jar;!kryo-*.jar;!scala*.jar;!commons-lang3*.jar;!cif-dtdb*.jar.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.lcp.!exclude

String

Excludes certain Java libraries and drivers that are in the local classpath. Each library needs to be prefixed by an exclamation mark (!). Libraries are separated by the delimiter defined in the cpdelim property.

Default value: !guava-11*;!hive-exec*.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.lcp.ext

String

Points to the local external libraries.

Default value: ../lib/runtime/hadoop3/*.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.excludeFromSync

String

Applies to version 14.5.1 and later.

Excludes certain Databricks libraries from file synchronization between the ONE runtime engine and the libraries installed in Databricks.

Libraries that are installed within the cluster and matched by this property are not uninstalled before the job run even if they are not part of ONE runtime. Other libraries (that is, those that are not part of ONE runtime and not matched by the property) are uninstalled before the job start.

The property can also be set to a valid regular expression.

Default value: dbfs:/FileStore/job-jars/.*.

Amazon Web Services

If you are using Databricks on Amazon Web Services, you need to set the following properties as well as the authentication options. The same user account can be used for mounting a file system and for running jobs in ONE.

Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.aws.credentials.provider

String

Optional list of credential providers. Should be specified as a comma-separated list.

If unspecified, a default list of credential provider classes is queried in sequence.

We recommend using the Ataccama provider with the value com.ataccama.dqc.aws.auth.hadoop.AwsHadoopGeneralCredentialsProvider and appropriate additional configuration of the following properties. Otherwise, follow the instructions in the official Apache Hadoop documentation.

Click here to expand
# Set one of the following values to enable S3 server authentication:
# AWS_INSTANCE_IAM - For use with IAM roles assigned to EC2 instance.
# AWS_ACCESS_KEY - For use with the access key and secret key.
# AWS_WEB_IDENTITY_TOKEN - For use with service accounts and assigning IAM roles to Kubernetes pods.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.authType

# Access key associated with the S3 account.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.accessKey

# Secret access key associated with the S3 account.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.secretKey

# The session token to validate temporary security credentials.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.sessionToken

# If "true", specifies that the assume role feature is enabled.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.assumeRole.enabled

# Specifies the AWS region where the Security Token Service (STS) should be used.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.assumeRole.region

# Specifies the Amazon Resource Name (ARN) of the IAM role to be assumed.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.assumeRole.roleArn

# Provides the optional external ID used in the trust relationship between accounts.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.assumeRole.externalId

# Specifies the session name, used to identify the connection in AWS logs.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.assumeRole.sessionName

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.s3n.awsAccessKeyId

String

The access key identifier for the Amazon S3 user account. Used when mounting the file system.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.s3n.awsSecretAccessKey

String

The secret access key for the Amazon S3 user account. Used when mounting the file system.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.access.key

String

The access key identifier for the Amazon S3 user account. Used when configuring Ataccama jobs.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.secret.key

String

The secret access key for the Amazon S3 user account. Used when configuring Ataccama jobs.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.fast.upload

Boolean

Enables fast upload of data. For more information, see How S3A Writes Data to S3.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.fast.upload.buffer

String

Configures how data is buffered during fast upload. Must be set to bytebuffer for AWS. For more information, see How S3A Writes Data to S3.

Default value: bytebuffer.

Azure Data Lake Storage Gen2

When using Databricks on Azure Data Lake Storage Gen2, it is necessary to configure these properties and enable authentication. You can use the same user account for mounting the file system and for running Ataccama jobs.

Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.dfs.adls.oauth2.access.token.provider.type

String

The type of access token provider. Used when mounting the file system.

Default value: ClientCredential.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.dfs.adls.oauth2.refresh.url

String

The token endpoint. A request is sent to this URL to refresh the access token. Used when mounting the file system.

Default value: <https://login.microsoftonline.com/…​/oauth2/token>.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.dfs.adls.oauth2.client.id

String

The client identifier of the ADLS account. Used when mounting the file system.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.dfs.adls.oauth2.credential

String

The client secret of the ADLS account. Used when mounting the file system.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.adl.impl

String

Enables the Azure Data Lake File System.

Default value: org.apache.hadoop.fs.adl.AdlFileSystem.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.dfs.adls.oauth2.access.token.provider.type

String

The type of access token provider. Used when running Ataccama jobs.

Default value: ClientCredential.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.dfs.adls.oauth2.refresh.url

String

The token endpoint used for refreshing the access token. Used when running Ataccama jobs.

Default value: <https://login.microsoftonline.com/…​/oauth2/token>.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.dfs.adls.oauth2.client.id

String

The client identifier of the ADLS account. Used when running Ataccama jobs.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.dfs.adls.oauth2.credential

String

The client secret of the ADLS account. Used when running Ataccama jobs.

Databricks job cluster configuration

Depending on your use case, you can choose between two types of clusters to use with Databricks.

Cluster type Description Pros Cons

All purpose cluster

Perpetually running generic cluster (with auto-sleep) that is configured once and can run multiple Spark jobs at once.

  • Faster start time.

/

Job cluster

The Databricks job scheduler creates a new job cluster whenever a job needs to be run and terminates the cluster after the job is completed.

  • Can result in lower Databricks cost.

  • Dedicated resources for each job.

  • Slower start time because a cluster must be created every time a job is run (cluster pooling can be used to reduce startup time).

The following table includes the default configuration of jobCluster.

If you want to enable the job cluster when you upgrade from an earlier version, add and define the necessary properties from the following table and Spark configuration in the dpe/etc/application-SPARK_DATABRICKS.properties file and set the job-cluster setting to enable=true:

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.enable=true.
The property column in the following table and Spark configuration includes default or example values you should replace depending on your use case.
Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.enable

Boolean

To enable the job cluster, set the property to true.

Default value: false.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.checkConfigUpdate

Boolean

Checks whether the configuration has changed since the last job cluster was created.

Default value: false.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.jobSuffix

String

Defines the suffix of jobs run on the cluster. For example, if jobSuffix=jobCluster, the jobs name could be DbrRunnerMain-jobCluster.

Default value: jobCluster.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.resources.spark_version=11.3.x-scala2.12

String

Defines the Spark version used on the cluster.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.resources.node_type_id=Standard_D3_v2

Or

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.resources.instance_pool_id=0426-110823-bided16-pool-x2cinbmv

String

Sets the driver and the worker type. This property does not have a default value.

Must be set to node_type_id or, if you are using a pooling cluster, an instance_pool_id.

If you are using an instance_pool_id, it is used for both the driver and worker nodes.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.resources.num_workers

Integer

Defines the number of workers If you use autoscaling, there is no need to define this property.

Default value: 3.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.autoscale.enable

Boolean

Enables autoscaling. If set to true, the number of workers is scaled automatically depending on the resources needed at the time.

Default value: false.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.autoscale.min_workers

Integer

Defines the minimum number of workers when using autoscaling. Only applies if autoscaling is enabled.

Default value: 1.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.autoscale.max_workers

Integer

Defines the maximum number of workers when using autoscaling. Only applies if autoscaling is enabled.

Default value: 2.

Advanced job cluster options Spark configuration

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.spark.driver.extraJavaOptions=-Dsun.nio.ch.disableSystemWideOverlappingFileLockCheck=true
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.spark.hadoopRDD.ignoreEmptySplits=false
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.spark.serializer=org.apache.spark.serializer.KryoSerializer
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.spark.databricks.delta.preview.enabled=true
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.spark.kryoserializer.buffer.max=1024m

# Storage account (example for Azure Data Lake Storage Gen2)

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.fs.azure.account.auth.type.ataccamadatalakegen2.dfs.core.windows.net=OAuth
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.fs.azure.account.oauth2.client.id.ataccamadatalakegen2.dfs.core.windows.net=<;application-id>
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.fs.azure.account.key.ataccamadatalakegen2.dfs.core.windows.net=<;account key>
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.fs.azure.account.oauth.provider.type.ataccamadatalakegen2.dfs.core.windows.net=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.fs.azure.account.oauth2.client.secret.ataccamadatalakegen2.dfs.core.windows.net=<;secret>
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.fs.azure.account.oauth2.client.endpoint.ataccamadatalakegen2.dfs.core.windows.net=https://login.microsoftonline.com/<directory-id>/oauth2/token

Was this page useful?