User Community Service Desk Downloads

Databricks Integration Overview

What is Databricks integration?

Databricks integration enables Ataccama ONE to connect to Databricks, offering multiple integration approaches to suit different use cases and requirements:

  • Spark processing integration leverages Spark pushdown capabilities to execute processing directly on Databricks clusters, making it the optimal choice for high-volume data scenarios with billions of records that require distributed computing power

  • JDBC connection integration provides standard database connectivity without pushdown capabilities, where all processing occurs on Data Processing Engine (DPE), making it suitable for moderate-scale data operations

Both approaches allow you to browse and work with Databricks tables directly from ONE. The Spark integration additionally enables complex data processing workflows to execute on Databricks infrastructure with full cluster scaling benefits.

The choice between integration types depends on your data volume requirements, processing complexity, licensing considerations, and operational needs. Each approach uses a distinct architecture - Spark integration requires multiple connections and operational storage for distributed processing, while JDBC integration uses a direct connection model. See Spark processing capabilities and JDBC connection capabilities respectively.

Supported environments

Spark integration

Databricks runtime versions

  • 11.3 LTS

  • 12.2 LTS

  • 13.3 LTS

  • 14.3 LTS

  • 15.4 LTS

Supported cluster types

  • Databricks All Purpose Cluster

  • Job Clusters (with a combination of SQL Warehouse for interactive tasks)

JDBC integration

Databricks runtime versions

  • 11.3 LTS

  • 12.2 LTS

  • 13.3 LTS

  • 14.3 LTS

  • 15.4 LTS

Supported cluster types

  • Databricks SQL Warehouse

Spark processing capabilities

Architecture

Databricks Spark Integration Architecture

The Spark integration establishes multiple connections to enable distributed data processing:

Connection requirements

  1. Data Processing Engine (DPE) to Databricks

    • Job Submission Connection (Databricks REST API): For submitting and monitoring Spark processing jobs.

    • JDBC connection: For browsing tables, importing metadata, and running processing jobs.

  2. DPE to Operational Storage

    • Purpose: Stores Ataccama processing files (JARs, lookup tables, temporary data).

    • Options:

      • ADLS Gen2 Container for Azure Databricks

      • AWS S3 Bucket for Databricks on AWS

        You can optionally configure Databricks volume storage, which is an alternative configuration approach that provides an abstraction layer over your existing ADLS or S3 storage, eliminating the need for direct storage mounting.

What you can do with Spark processing

  • Large-scale distributed data processing using Databricks Spark clusters.

  • Scalable data transformations and DQ rules processing.

  • Complex data processing plans execution on Databricks infrastructure.

  • Processing scales automatically with cluster size.

Licensing requirements

A Spark processing license is required to enable distributed data processing using Ataccama’s Spark processing engine.

A spark license is required for the integrations * Azure Databricks Spark integration * AWS Databricks Spark integration

Note: JDBC connection setup (Databricks JDBC) supports both platforms without requiring Spark licensing.

Authentication options for Spark processing

Azure Databricks

Choose from two authentication approaches:

  • Personal Access Token (PAT)

  • Entra ID Service Principal

Databricks on AWS

Currently supports:

  • Personal Access Token (PAT)

JDBC connection capabilities

Architecture

Databricks JDBC Integration Architecture

The JDBC integration provides a simpler, non-pushdown connection approach:

Connection requirements

  • DPE to Databricks: Standard JDBC connection for all database operations

This simplified architecture eliminates the need for operational storage and REST API connections, making it well-suited for metadata operations and moderate-scale processing tasks.

What you can do with JDBC connections

JDBC connections provide full database functionality equivalent to other standard database integrations, such as:

  • Browse and explore Databricks tables and metadata.

  • Import table schemas and metadata into ONE.

  • Run data processing jobs such as profiling and DQ evaluation.

  • Write data to Databricks tables (for example, when exporting invalid records via post-processing plans).

However, it is important to note that processing occurs entirely on DPE and does not benefit from Databricks cluster scaling capabilities.

Licensing requirements

No Spark processing license required for JDBC-only operations.

For cataloging and profiling use cases, the JDBC-only approach eliminates Spark licensing requirements. However, since processing is limited to DPE server capabilities without cluster scaling, consider Spark integration for large-volume scenarios that exceed DPE capacity.

Authentication options for JDBC connections

Azure Databricks

Choose from two authentication approaches:

  • Personal Access Token (PAT)

  • Entra ID Service Principal

Databricks on AWS

Choose from two authentication approaches:

  • Personal Access Token (PAT)

  • OAuth M2M Service Principal

Unity Catalog support

Unity Catalog integration is available as an optional enhancement for both Azure and AWS deployments. It is accessible through both Spark processing and JDBC connections, with some specific limitations and requirements for each approach:

Unity Catalog with Spark integration

When using Spark integration with Unity Catalog enabled, Ataccama ONE supports only Dedicated access mode, not Standard access mode.

Unity Catalog with Databricks JDBC connection:

  • Three-level hierarchy is not supported ([catalog].[schema].[table]).

  • JDBC connections point to the legacy metastore by default, not the Unity Catalog. You can specify a maximum of 1 catalog using ConnCatalog=<name>.

  • When using an all-purpose cluster, it must be a single-user cluster owned by the service principal you’re authenticating as.

Next steps

For detailed configuration instructions, see:

Spark processing setup

JDBC connection setup

Was this page useful?