User Community Service Desk Downloads
If you can't find the product or version you're looking for, visit support.ataccama.com/downloads

Databricks Configuration

Ataccama ONE is currently tested to support the following Databricks versions:

  • v14.5.0: Databricks 11.3 and 12.2.

  • v14.5.1 and later: Databricks 11.3, 12.2, and 13.3.

The following properties are used to configure Databricks and are provided in the dpe/etc/application-SPARK_DATABRICKS.properties file. Check that spring.profiles.active property in DPE application.properties is set to SPARK_DATABRICKS profile.

To add your Databricks cluster as a data source, see Metastore Data Source Configuration.

Basic settings

Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.exec

String

Sets the script for customizing how Spark jobs are launched. For example, you can change which shell script is used to start the job. The script is located in the bin/databricks folder in the root directory of the application.

The default value for Linux is ${ataccama.path.root}/bin/databricks/exec_databricks.sh.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dqc.licenses

String

Points to the location of Ataccama licenses needed for Spark jobs. The property should be configured only if licenses are stored outside of the standard locations, that is the home directory of the user and the folder runtime/license_keys.

Default value: ../../../lib/runtime/license_keys.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.JAVA_HOME

String

Sets the JAVA_HOME variable for running Spark jobs. For Spark jobs, JDK version 8 is required.

Default value: /usr/java/jdk1.8.0_65.

You need two versions of Java for DPE and Spark: DPE uses Java 17 while the Spark launcher requires Java 8.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.debug

Boolean

Enables the debugging mode. The debug mode shows the driver and the executor classloaders used in Spark, which lets you detect any classpath conflicts between the Ataccama classpath and the Spark classpath in the Spark driver and in the Spark executors.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.checkCluster

Boolean

Tests whether the cluster has the correct mount and libraries. Turning the cluster check off speeds up launching jobs when the cluster libraries have already been installed and there are no changes made to the runtime. In that case, checking is skipped.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.autoMount

Boolean

Enables automatic mounting of internal Azure Storage to Azure Databricks, which simplifies the installation and configuration process. This requires saving of Azure Storage credentials in configuration settings.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.job.max_concurrent_runs

Number

Specifies how many jobs can be run in parallel.

Default value: 150.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.fsc.force

Boolean

If set to true, input files from the executor are synchronized with the files on the cluster at each launch or processing. If set to false, the cached versions of files and lookups are used instead, unless their respective checksums have changed.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.waitRunIfExceeds

Boolean

If set to true, initiated jobs are added to the queue if the maximum number of concurrent jobs is exceeded. If set to false, jobs are automatically canceled instead.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.point

String

The folder in the Databricks File System that is used as a mount point for the Azure Data Lake Storage Gen2 folder.

Default value: /mnt/ataccama.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.url

String

The location of the ADLS Gen2 folder for storing libraries and files for processing. The folder is then mounted to the directory in the Databricks File System defined in the property mount.point.

Default value: adl://…​/tmp/dbr.

Cluster libraries

The classpath with all required Java libraries needs to include the dpe/lib/* folder. The files in the classpath are then copied to the cluster.

You can also add custom properties for additional classpaths, for example, plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.drivers, which would point to the location of database drivers. To exclude specific files from the classpath, use the pattern !<file_mask>, for example, !guava-11*.

Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cpdelim

String

The classpath delimiter used for libraries for Spark jobs.

Default value: ; (semicolon).

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.runtime

String

The libraries needed for Spark jobs. The files in the classpath are copied to the Databricks cluster. Spark jobs are stored as tmp/jobs/${jobId} in the root directory of the application.

For Spark processing of Azure Data Lake Storage Gen2, an additional set of libraries is required: ../../../lib/runtime/ext/*.

Default value: ../../../lib/runtime/*;../../../lib/jdbc/\*;../../../lib/jdbc_ext/*.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.ovr

String

Points to additional override libraries that can be used for miscellaneous fixes.

Default value: ../lib/ovr/*.jar.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.databricks

String

Points to additional Databricks libraries.

Default value: ../lib/runtime/databricks/*.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.!exclude

String

Excludes certain Java libraries and drivers that are in the same classpath but are not needed for Spark processing. Each library needs to be prefixed by an exclamation mark (!). Libraries are separated by the delimiter defined in the cpdelim property.

We strongly recommend not providing a specific version for any entry as this can change in future releases. Instead, use a wildcard.

Default value: !atc-hive-jdbc*;!hive-jdbc*;!slf4j-api-*.jar;!kryo-*.jar;!scala*.jar;!commons-lang3*.jar;!cif-dtdb*.jar.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.lcp.!exclude

String

Excludes certain Java libraries and drivers that are in the local classpath. Each library needs to be prefixed by an exclamation mark (!). Libraries are separated by the delimiter defined in the cpdelim property.

Default value: !guava-11*;!hive-exec*.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.lcp.ext

String

Points to the local external libraries.

Default value: ../lib/runtime/hadoop3/*.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dbr.excludeFromSync

String

Applies to version 14.5.1 and later.

Excludes certain Databricks libraries from file synchronization between the ONE runtime engine and the libraries installed in Databricks.

Libraries that are installed within the cluster and matched by this property are not uninstalled before the job run even if they are not part of ONE runtime. Other libraries (that is, those that are not part of ONE runtime and not matched by the property) are uninstalled before the job start.

The property can also be set to a valid regular expression.

Default value: dbfs:/FileStore/job-jars/.*.

Amazon Web Services

If you are using Databricks on Amazon Web Services, you need to set the following properties as well as the authentication options. The same user account can be used for mounting a file system and for running jobs in ONE.

Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.aws.credentials.provider

String

Optional list of credential providers. Should be specified as a comma-separated list.

If unspecified, a default list of credential provider classes is queried in sequence.

We recommend using the Ataccama provider with the value com.ataccama.dqc.aws.auth.hadoop.AwsHadoopGeneralCredentialsProvider and appropriate additional configuration of the following properties. Otherwise, follow the instructions in the official Apache Hadoop documentation.

Click here to expand
# Set one of the following values to enable S3 server authentication:
# AWS_INSTANCE_IAM - For use with IAM roles assigned to EC2 instance.
# AWS_ACCESS_KEY - For use with the access key and secret key.
# AWS_WEB_IDENTITY_TOKEN - For use with service accounts and assigning IAM roles to Kubernetes pods.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.authType

# Access key associated with the S3 account.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.accessKey

# Secret access key associated with the S3 account.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.secretKey

# The session token to validate temporary security credentials.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.sessionToken

# If "true", specifies that the assume role feature is enabled.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.assumeRole.enabled

# Specifies the AWS region where the Security Token Service (STS) should be used.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.assumeRole.region

# Specifies the Amazon Resource Name (ARN) of the IAM role to be assumed.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.assumeRole.roleArn

# Provides the optional external ID used in the trust relationship between accounts.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.assumeRole.externalId

# Specifies the session name, used to identify the connection in AWS logs.
plugin.metastoredatasource.ataccama.one.cluster.<clusterId>.launch-properties.conf.fs.ata.aws.assumeRole.sessionName

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.s3n.awsAccessKeyId

String

The access key identifier for the Amazon S3 user account. Used when mounting the file system.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.fs.s3n.awsSecretAccessKey

String

The secret access key for the Amazon S3 user account. Used when mounting the file system.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.access.key

String

The access key identifier for the Amazon S3 user account. Used when configuring Ataccama jobs.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.secret.key

String

The secret access key for the Amazon S3 user account. Used when configuring Ataccama jobs.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.fast.upload

Boolean

Enables fast upload of data. For more information, see How S3A Writes Data to S3.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.s3a.fast.upload.buffer

String

Configures how data is buffered during fast upload. Must be set to bytebuffer for AWS. For more information, see How S3A Writes Data to S3.

Default value: bytebuffer.

Azure Data Lake Storage Gen2

When using Databricks on Azure Data Lake Storage Gen2, it is necessary to configure these properties and enable authentication. You can use the same user account for mounting the file system and for running Ataccama jobs.

Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.dfs.adls.oauth2.access.token.provider.type

String

The type of access token provider. Used when mounting the file system.

Default value: ClientCredential.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.dfs.adls.oauth2.refresh.url

String

The token endpoint. A request is sent to this URL to refresh the access token. Used when mounting the file system.

Default value: <https://login.microsoftonline.com/…​/oauth2/token>.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.dfs.adls.oauth2.client.id

String

The client identifier of the ADLS account. Used when mounting the file system.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.mount.conf.dfs.adls.oauth2.credential

String

The client secret of the ADLS account. Used when mounting the file system.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.adl.impl

String

Enables the Azure Data Lake File System.

Default value: org.apache.hadoop.fs.adl.AdlFileSystem.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.dfs.adls.oauth2.access.token.provider.type

String

The type of access token provider. Used when running Ataccama jobs.

Default value: ClientCredential.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.dfs.adls.oauth2.refresh.url

String

The token endpoint used for refreshing the access token. Used when running Ataccama jobs.

Default value: <https://login.microsoftonline.com/…​/oauth2/token>.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.dfs.adls.oauth2.client.id

String

The client identifier of the ADLS account. Used when running Ataccama jobs.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.dfs.adls.oauth2.credential

String

The client secret of the ADLS account. Used when running Ataccama jobs.

Databricks job cluster configuration

Depending on your use case, you can choose between two types of clusters to use with Databricks.

Cluster type Description Pros Cons

All purpose cluster

Perpetually running generic cluster (with auto-sleep) that is configured once and can run multiple Spark jobs at once.

  • Faster start time.

/

Job cluster

The Databricks job scheduler creates a new job cluster whenever a job needs to be run and terminates the cluster after the job is completed.

  • Can result in lower Databricks cost.

  • Dedicated resources for each job.

  • Slower start time because a cluster must be created every time a job is run (cluster pooling can be used to reduce startup time).

The following table includes the default configuration of jobCluster.

If you want to enable the job cluster when you upgrade from an earlier version, add and define the necessary properties from the following table and Spark configuration in the dpe/etc/application-SPARK_DATABRICKS.properties file and set the job-cluster setting to enable=true:

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.enable=true.
The property column in the following table and Spark configuration includes default or example values you should replace depending on your use case.
Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.enable

Boolean

To enable the job cluster, set the property to true.

Default value: false.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.checkConfigUpdate

Boolean

Checks whether the configuration has changed since the last job cluster was created.

Default value: false.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.jobSuffix

String

Defines the suffix of jobs run on the cluster. For example, if jobSuffix=jobCluster, the jobs name could be DbrRunnerMain-jobCluster.

Default value: jobCluster.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.resources.spark_version=11.3.x-scala2.12

String

Defines the Spark version used on the cluster.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.resources.node_type_id=Standard_D3_v2

Or

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.resources.instance_pool_id=0426-110823-bided16-pool-x2cinbmv

String

Sets the driver and the worker type. This property does not have a default value.

Must be set to node_type_id or, if you are using a pooling cluster, an instance_pool_id.

If you are using an instance_pool_id, it is used for both the driver and worker nodes.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.resources.num_workers

Integer

Defines the number of workers If you use autoscaling, there is no need to define this property.

Default value: 3.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.autoscale.enable

Boolean

Enables autoscaling. If set to true, the number of workers is scaled automatically depending on the resources needed at the time.

Default value: false.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.autoscale.min_workers

Integer

Defines the minimum number of workers when using autoscaling. Only applies if autoscaling is enabled.

Default value: 1.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.autoscale.max_workers

Integer

Defines the maximum number of workers when using autoscaling. Only applies if autoscaling is enabled.

Default value: 2.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.resources.enable_elastic_disk

Boolean

When enabled, this cluster dynamically acquires additional disk space when its Spark workers are running low on disk space. This feature requires specific AWS permissions to function correctly. For more information, see the official Databricks documentation.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.ssh_public_keys

Array of String

SSH public key contents that will be added to each Spark node in this cluster. The corresponding private keys can be used to login with the user name ubuntu on port 2200. Up to 10 keys can be specified.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.custom_tags

ClusterTag

An object containing a set of tags for cluster resources. Databricks tags all cluster resources (such as AWS instances and EBS volumes) with these tags in addition to default_tags.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.cluster_log_conf.dbfs.destination

String

DBFS destination. For example, dbfs:/my/path.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.cluster_log_conf.s3

S3StorageInfo.

S3 location of cluster log. destination and either region or warehouse must be provided. For example, { "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.aws_attributes.spot_bid_price_percent

INT32

The max price for AWS spot instances, as a percentage of the corresponding instance type’s on-demand price. If not specified, the default value is 100. When spot instances are requested for this cluster, only spot instances whose max price percentage matches this field will be considered. For safety, this field is limited to be no more than 10000.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.aws_attributes.first_on_demand

INT32

The first first_on_demand nodes of the cluster will be placed on on-demand instances. If this value is greater than 0, the cluster driver node will be placed on an on-demand instance. If this value is greater than or equal to the current cluster size, all nodes will be placed on on-demand instances. If this value is less than the current cluster size, first_on_demand nodes will be placed on on-demand instances and the remainder will be placed on availability instances. This value does not affect cluster size and cannot be mutated over the lifetime of a cluster.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.aws_attributes.availability

AwsAvailability

Availability type used for all subsequent nodes past the first_on_demand ones. If first_on_demand is zero, this availability type will be used for the entire cluster. Examples are SPOT, ON_DEMAND, and SPOT_WITH_FALLBACK.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.init_scripts.workspace

WorkspaceStorageInfo

Workspace location of init script. Destination must be provided. For example, { "workspace" : { "destination" : "/Users/someone@domain.com/init_script.sh" } }.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.init_scripts.s3

S3StorageInfo

S3 location of init script. Destination and either region or warehouse must be provided. For example, { "s3": { "destination" : "s3://init_script_bucket/prefix", "region" : "us-west-2" } }.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.init_scripts.dbfs

DbfsStorageInfo

(Deprecated) DBFS location of init script. Destination must be provided. For example, { "dbfs" : { "destination" : "dbfs:/home/init_script" } }.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark_env_vars

SparkEnvPair

An object containing a set of optional, user-specified environment variable key-value pairs. Key-value pair of the form (X,Y) are exported as is (i.e., export X='Y') while launching the driver and workers. For example,: {"SPARK_WORKER_MEMORY": "28000m", "SPARK_LOCAL_DIRS": "/local_disk0"}.

Descriptions for the enable_elastic_disk, ssh_public_keys, custom_tags, cluster_log_conf, spot_bid_price_percent, first_on_demand, availability, workspace, dbfs, s3 and spark_env_vars fields are as defined by the official Databricks documentation.

Advanced job cluster options Spark configuration

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.spark.driver.extraJavaOptions=-Dsun.nio.ch.disableSystemWideOverlappingFileLockCheck=true
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.spark.hadoopRDD.ignoreEmptySplits=false
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.spark.serializer=org.apache.spark.serializer.KryoSerializer
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.spark.databricks.delta.preview.enabled=true
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.spark.kryoserializer.buffer.max=1024m

# Storage account (example for Azure Data Lake Storage Gen2)

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.fs.azure.account.auth.type.ataccamadatalakegen2.dfs.core.windows.net=OAuth
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.fs.azure.account.oauth2.client.id.ataccamadatalakegen2.dfs.core.windows.net=<;application-id>
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.fs.azure.account.key.ataccamadatalakegen2.dfs.core.windows.net=<;account key>
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.fs.azure.account.oauth.provider.type.ataccamadatalakegen2.dfs.core.windows.net=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.fs.azure.account.oauth2.client.secret.ataccamadatalakegen2.dfs.core.windows.net=<;secret>
plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.job-cluster.spark-conf.fs.azure.account.oauth2.client.endpoint.ataccamadatalakegen2.dfs.core.windows.net=https://login.microsoftonline.com/<directory-id>/oauth2/token

Was this page useful?