Hadoop Configuration

The following properties are used to configure Hadoop and are provided either through the Configuration Service, in the DPE deployment, or in the configuration file dpe/etc/application-SPARK_HADOOP.properties.

Configuring Spark using Hadoop is currently supported only on the Linux platform.

General configuration

Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.exec

String

Sets the script for customizing how Spark jobs are launched. For example, you can change which shell script is used to start the job. The script is located in the /bin/hadoop folder in the root directory of the application.

Set this to ${ataccama.path.root}/bin/knox/exec_knox.sh to use Apache Knox.

The configuration option for Apache Knox is available only from version 13.3.2 and later.

The default value for Linux is ${ataccama.path.root}/bin/hadoop/exec_spark2.sh.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.dqc.licenses

String

Points to the location of Ataccama licenses needed for Spark jobs. The property should be configured only if licenses are stored outside of the standard locations, that is, the home directory of the user and the folder runtime/license_keys.

Default value: ../../../lib/runtime/license_keys.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.JAVA_HOME

String

Sets the JAVA_HOME variable for running Spark jobs. For Spark jobs, JDK version 8 is required.

Default value: /usr/java/jdk1.8.0_65.

You need two versions of Java for DPE and Spark: DPE uses Java 11 while the Spark launcher requires Java 8.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.SPARK_DRIVER_OPTS

String

Configures Spark driver options. For example, --driver-memory limits how much memory the driver process can use.

Default value: --driver-memory 2g.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.JAVA_OPTS

String

Use JAVA_OPTS to pass the required number for number of evaluation threads, for example plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.JAVA_OPTS=-Ddqd.processing.models.parallel=5. The variable defines the number of models which will be processed in parallel in the DQ engine.

Currently, this option brings no notable performance benefits.

Default value: 1.

Cluster libraries

The classpath with all required Java libraries needs to include the dpe/lib/* folder. The files in the classpath are then copied to the cluster.

You can also add custom properties for additional classpaths, for example, plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.drivers, which would point to the location of database drivers. To exclude specific files from the classpath, use the pattern !<file_mask>, for example, !guava-11*.

Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cpdelim

String

The classpath delimiter used for libraries for Spark jobs.

Default value: ; (semicolon).

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.runtime

String

The libraries needed for Spark jobs. The files in the classpath are copied to the Hadoop cluster. Spark jobs are stored as tmp/jobs/${jobId} in the root directory of the application.

Default value: ../../../lib/runtime/*;../../../lib/jdbc/*;../../../lib/jdbc_ext/*.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.ovr

String

Default value: ../../../lib/ovr/*.jar.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cp.!exclude

String

Excludes certain Java libraries and drivers that are in the same classpath but are not needed for Spark processing. Each library needs to be prefixed by an exclamation mark (!). Libraries are separated by the delimiter defined in the cpdelim property.

We strongly recommend not providing a specific version for any entry as this can change in future releases. Instead, use a wildcard.

Default value: !atc-hive-jdbc*;!hive-jdbc*;!slf4j-api-*.jar;!kryo-*.jar;!scala*.jar;!commons-lang3*.jar;!cif-dtdb*.jar.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.yarn.keytab

String

Required when using Kerberos and Yarn on some Hadoop distributions, such as Cloudera. Points to the keytab file that stores the Kerberos principal and the corresponding encrypted key that is generated from the principal password.

See also plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.kerberos.keytab.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.yarn.principal

String

Required when using Kerberos and Yarn on some Hadoop distributions, such as Cloudera. The name of the Kerberos principal.

See also plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.kerberos.principal.

Using Apache Knox

This configuration option is available starting with versions 13.3.2, 13.5.0 and later.

Apache Knox is an Apache gateway service for facilitating safe communication between your own Hadoop clusters and outside systems such as Ataccama. When you use Apache Knox, you need to have a ONE Runtime Server with Remote Executor component configured on your edge node.

In this deployment option, Data Procesing Engine (DPE) is located outside of Hadoop DMZ and communicates with Hadoop and ONE Runtime Server via REST API through Apache Knox API. DPE Spark jobs such as profiling, evaluations, and plans are started on the ONE Runtime Server instance (instead of DPE) on your edge node that has the remote executor enabled.

DPE itself uses two Knox services: Hive JDBC via Knox and ONE Runtime Server via a newly created Ataccama-specific Knox service. Browsing your data source such as Hive is done via a JDBC request from DPE through Knox and then directly to your Hive.

In order to configure DPE for Apache Knox, follow these steps:

Install ONE Runtime Server with Remote Executor on your edge node.
Configure your Knox to redirect requests to Remote Executor. For more information, see How to Secure Remote Executor with Apache Knox.
Enable application-SPARK_KNOX.properties in the application.properties by adding spring.profiles.active=SPARK_KNOX.

Configure your DPE using additional application-SPARK_KNOX.properties in dpe/etc folder:

Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.cpdelim

String

The classpath delimiter used for libraries for Spark jobs.

Default value: ; (semicolon).

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.lcp.runtime

String

Value: ../../../lib/runtime/*;../../../lib/jdbc/*;../../../lib/jdbc_ext/*.

Classpath properties are defined using lcp to indicate a local classpath so as to not copy all jar files to the Hadoop via DPE and Knox but rather directly use those installed on the edge node.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.lcp.ovr

String

Value: ../../../lib/ovr/*.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.lcp.ext

String

Value: ../../../lib/ext/*.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.gateway

String

The URL where Apache Knox redirects all communication.

Value: https://<host>:8443/gateway/default/bde/executor.

The default port for Apache Knox is 8443.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.auth

String

In order to use Apache Knox set to basic. DPE uses a username and password to authenticate with Knox.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.exec=${ataccama.path.root}/bin/knox/exec_knox.sh

String

The script for Apache Knox.

Configure your metastore to be able to browse your datasource. See Metastore Data Source Configuration.

Authentication

There are two available modes of authentication: basic and Kerberos. For basic authentication, you only need to provide the username (SPARK.basic.user property).

Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.basic.user

String

The username for data processing on the cluster. Used only for basic authentication.

Default value: bde.user.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.auth

String

The type of authentication. Available options: basic, kerberos.

To use Apache Knox, set to basic.

The configuration option for Apache Knox is available only from version 13.3.2 and later.

Default value: kerberos.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.impersonate

Boolean

If set to true, DPE can start processes on the cluster as a superuser on behalf of another user. In that case, the keytab provided needs to match the superuser’s credentials.

If the property is not set, the option is enabled by default. Used only for Kerberos authentication.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.kerberos.conf

String

Points to the Kerberos configuration file, typically called krb5.conf.

Default value: /etc/krb5.conf.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.kerberos.keytab

String

Points to the keytab file that stores the Kerberos principal and the corresponding encrypted key that is generated from the principal password. Used only for Kerberos authentication.

To use a Kerberos ticket instead of the keytab file, comment out the this property.

Default value: ../../../bin/hadoop/keytab/executor.keytab.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.kerberos.principal

String

The name of the Kerberos principal. The principal is a unique identifier that Kerberos uses to assign tickets for granting access to different services. Used only for Kerberos authentication.

Default value: executor@EXAMPLE.COM.

Advanced configuration

Temporary storage

The following properties define the default folders for storing local files on the Hadoop Distributed File System (HDFS) or on the local file system.

Property Data type Description

Property	Data type	Description
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.fsc.shared`	String	The location of the folder for user files on HDFS. Default value: `/tmp/ataccama_executor_shared_fs/`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.fsc.shrcp`	String	Points to the location of libraries for Spark on HDFS. Default value: `/tmp/ataccama_executor_shared_cp/`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.fsc.node`	String	Indicates where the Hadoop worker node is located on the local file system. Default value: `/tmp/ataccama_executor_node/`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.fsc.force`	Boolean	If set to `true`, input files from the executor are synchronized with the files on the cluster at each launch or processing. If set to `false`, the cached versions of files and lookups are used instead, unless their respective checksums have changed.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.permissions.umask-mode`	Number	Defines the default permissions for HDFS, specifically the umask applied when files and folders are created. Default value: `022`.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.fsc.shared

String

The location of the folder for user files on HDFS.

Default value: /tmp/ataccama_executor_shared_fs/.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.fsc.shrcp

String

Points to the location of libraries for Spark on HDFS.

Default value: /tmp/ataccama_executor_shared_cp/.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.fsc.node

String

Indicates where the Hadoop worker node is located on the local file system.

Default value: /tmp/ataccama_executor_node/.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.fsc.force

Boolean

If set to true, input files from the executor are synchronized with the files on the cluster at each launch or processing. If set to false, the cached versions of files and lookups are used instead, unless their respective checksums have changed.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.fs.permissions.umask-mode

Number

Defines the default permissions for HDFS, specifically the umask applied when files and folders are created.

Default value: 022.

Spark configuration

All properties with the spark prefix are passed to Spark. Two properties are required: SPARK.spark.master and SPARK.spark.io.compression.codec.

You can also add new custom properties by following this pattern: plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.<custom_property_name>.

Property Data type Description

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.master

String

The master node of the Spark cluster. Spark can only be run on YARN; other modes are not supported.

Required. Must be set to yarn.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.submit.deployMode

String

Specifies how Spark is deployed: using worker nodes (cluster) or an external client (client).

Default value: client.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.sql.catalogImplementation

String

Defines which support is used for external catalog implementation on Spark.

Default value: hive.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.sql.hive.metastore.version

String

The Hive metastore version.

Default value: 1.2.1.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.sql.hive.metastore.jars

String

Points to the libraries for the Hive metastore. You can use the libraries that are bundled with Spark (builtin) or define a classpath if using local files, for example, for Cloudera or Hortonworks.

Default value:

If using libraries delivered with Spark: builtin.
For Cloudera: /opt/cloudera/parcels/CDH/lib/hadoop/client/*:/opt/cloudera/parcels/CDH/lib/hive/lib/*.
For Hortonworks: /usr/hdp/current/spark2-client/jars/*.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.yarn.historyServer.address

String

The URL of the Spark history server.

Default value: http://spark-history-server.com:18081.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.eventLog.enabled

Boolean

Enables event logging on Spark.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.eventLog.dir

String

Defines where Cloudera or Hortonworks logs are stored.

Default value:

For Cloudera: hdfs:///user/spark/spark2ApplicationHistory.
For Hortonworks: hdfs:///spark2-history.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.io.compression.codec

String

Sets the compression method for internal data.

Required. Must be set to lz4.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.yarn.maxAppAttempts

Number

Determines how many times YARN tries to submit the application. For debugging, set the value to a number higher than 1.

Default value: 1.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.yarn.queue

String

The name of the queue to which YARN submits the application.

Default value: default.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.ata.readOnlyLatestPartition

Boolean

If set to true, only the latest partition is profiled instead of the whole catalog item. If the catalog item is not partitioned, profiling is run on all records.

The latest partition corresponds to the first partition listed when all partitions are sorted in alphanumeric order. If that partition consists of two or more columns, the higher level partition is used; however, all lower level partitions are included as well.

For example, if there are two partitions, partition:year=2012/month=10 and partition:year=2012/month=12, the whole partition year=2012 is profiled.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.ata.enableCustomSqlTemplates

Boolean

If set to true, using custom SQL queries for retrieving the latest partition is allowed. These SQL queries adjust how the latest partition is determined based on the type of data and the total number of partitions and should be more efficient in returning the latest partition compared to the standard query.

In the current version, the feature is experimental.

Performance optimization

Property Data type Description

Property	Data type	Description
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.dynamicAllocation.enabled`	Boolean	Enables dynamic allocation of resources on Spark. If set to `true`, executors are registered and removed depending on the workload that the application is currently handling. Default value: `true`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.dynamicAllocation.maxExecutors`	Number	Defines the maximum number of executors that can be used on Spark. Used if dynamic allocation of resources is enabled. Default value: `100`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.dynamicAllocation.minExecutors`	Number	Defines the minimum number of executors that are used on Spark. Used if dynamic allocation of resources is enabled. Default value: `2`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.executor.cores`	Number	Specifies how many cores need to be used for each executor. If the property is set, multiple executors can use the same worker node provided that the node has sufficient resources for all executors. Default value: `1`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.executor.instances`	Number	If the dynamic allocation of resources is not enabled, the property specifies how many executors are used on Spark. If the dynamic allocation is enabled, the number specified here is used as the initial number of executors provided that the value is higher than the value of the property `spark.dynamicAllocation.minExecutors`. Default value: `4`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.executor.memory`	String	Determines how much memory is allocated to each executor process. Default value: `4g`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.shuffle.service.enabled`	Boolean	Enables the shuffle service. This is helpful for dynamic allocation, as the shuffle files of executors are saved and the executors that are no longer needed can be removed with no issues. Default value: `true`.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.dynamicAllocation.enabled

Boolean

Enables dynamic allocation of resources on Spark. If set to true, executors are registered and removed depending on the workload that the application is currently handling.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.dynamicAllocation.maxExecutors

Number

Defines the maximum number of executors that can be used on Spark. Used if dynamic allocation of resources is enabled.

Default value: 100.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.dynamicAllocation.minExecutors

Number

Defines the minimum number of executors that are used on Spark. Used if dynamic allocation of resources is enabled.

Default value: 2.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.executor.cores

Number

Specifies how many cores need to be used for each executor. If the property is set, multiple executors can use the same worker node provided that the node has sufficient resources for all executors.

Default value: 1.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.executor.instances

Number

If the dynamic allocation of resources is not enabled, the property specifies how many executors are used on Spark. If the dynamic allocation is enabled, the number specified here is used as the initial number of executors provided that the value is higher than the value of the property spark.dynamicAllocation.minExecutors.

Default value: 4.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.executor.memory

String

Determines how much memory is allocated to each executor process.

Default value: 4g.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.spark.shuffle.service.enabled

Boolean

Enables the shuffle service. This is helpful for dynamic allocation, as the shuffle files of executors are saved and the executors that are no longer needed can be removed with no issues.

Default value: true.

Spark environment variables, debugging, and configuration

Property Data type Description

Property	Data type	Description
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.debug`	Boolean	Enables the debugging mode. The debug mode shows the driver and the executor classloaders used in Spark, which lets you detect any classpath conflicts between the driver and the executor. Default value: `true`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf`	String	Points to the folder containing Hadoop client configuration files for the given cluster, such as `core-site.xml` and `hdfs-site.xml`. If the property not set, the configuration is read automatically from the folder `$MYDIR//client_conf/`. Default value: `/etc/hadoop/conf/`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.SPARK_HOME`	String	Sets the `SPARK_HOME` variable. The variable is first looked for in the default location, that is the installation folder of the Spark driver. If the variable cannot be found, the path needs to be specified manually. Default value: For Cloudera: `/opt/cloudera/parcels/SPARK2/lib/spark2/`. For Hortonworks: `/usr/hdp/current/spark2-client/`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.ENABLE_SPARK_DIST_CLASSPATH`	Boolean	If set to `true`, Spark 2 distribution is enabled. This version is packaged without Hadoop libraries, which lets you select a specific Hadoop version for Spark jobs. Required for Cloudera’s CDH distribution of Hadoop. Default value: `true`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.DISABLE_VALIDATE_SPARK_PROPERTIES`	Boolean	If set to `false`, the Executor attempts to validate Spark properties. To turn this option off, set the value to `true`. Default value: `true`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.ENABLE_DEBUG_LOGGING`	Boolean	Enables the debug logging level for Log4j. Default value: `true`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.ENABLE_SPARK_CONF_DIR`	Boolean	If set to `true`, the client configuration for Spark job is provided using Spark environment variables. Default value: `true`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.SPARK_SUBMIT`	String	The name of the script that launches the bundled application on the given cluster. Default value: `spark-submit`.
`plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.yarn.timeline-service.enabled`	Boolean	Enables the YARN timeline service. The timeline service is used to collect and provide application data and metrics about the current and historical states of the application. Default value: `false`.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.debug

Boolean

Enables the debugging mode. The debug mode shows the driver and the executor classloaders used in Spark, which lets you detect any classpath conflicts between the driver and the executor.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf

String

Points to the folder containing Hadoop client configuration files for the given cluster, such as core-site.xml and hdfs-site.xml. If the property not set, the configuration is read automatically from the folder $MYDIR//client_conf/.

Default value: /etc/hadoop/conf/.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.SPARK_HOME

String

Sets the SPARK_HOME variable. The variable is first looked for in the default location, that is the installation folder of the Spark driver. If the variable cannot be found, the path needs to be specified manually.

Default value:

For Cloudera: /opt/cloudera/parcels/SPARK2/lib/spark2/.
For Hortonworks: /usr/hdp/current/spark2-client/.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.ENABLE_SPARK_DIST_CLASSPATH

Boolean

If set to true, Spark 2 distribution is enabled. This version is packaged without Hadoop libraries, which lets you select a specific Hadoop version for Spark jobs.

Required for Cloudera’s CDH distribution of Hadoop.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.DISABLE_VALIDATE_SPARK_PROPERTIES

Boolean

If set to false, the Executor attempts to validate Spark properties. To turn this option off, set the value to true.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.ENABLE_DEBUG_LOGGING

Boolean

Enables the debug logging level for Log4j.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.ENABLE_SPARK_CONF_DIR

Boolean

If set to true, the client configuration for Spark job is provided using Spark environment variables.

Default value: true.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.env.SPARK_SUBMIT

String

The name of the script that launches the bundled application on the given cluster.

Default value: spark-submit.

plugin.executor-launch-model.ataccama.one.launch-type-properties.SPARK.conf.yarn.timeline-service.enabled

Boolean

Enables the YARN timeline service. The timeline service is used to collect and provide application data and metrics about the current and historical states of the application.

Default value: false.

Was this page useful?