Integration of Databricks with CADP
This document explains how to configure and integrate CipherTrust Manager with Databricks.
Built on open source and open standards, a lakehouse simplifies your data estate by eliminating the silos that historically complicate data and AI. It provides:
An architecture for integration, storage, processing, governance, sharing, analytics, and AI.
An approach to how you work with structured and unstructured data. One end-to-end view of data lineage and provenance.
A toolset for Python and SQL, notebooks and IDEs, batch and streaming, and all major cloud providers.
Different capabilities are available to protect data in Databricks. The most secure option is Bring Your Own Encryption (BYOE), which includes:
Data Ingest – Used with Batch Data Transformation (BDT) or User Defined Functions (UDF).
Data Access – User defined functions for column-level protection.
Note
The focus of this integration is on Data Access protecting sensitive data in Databricks columns by using CipherTrust Application Data Protection(CADP).
For more information, see the Databricks document.
Architecture
Databricks can run on all three major cloud service providers - AWS, Azure, and GCP. An advantage with Databricks is that even though each CSP has its own unique way to create functions and gateways, there is only one process to follow with Databricks.
Here is an example of how this integration works:
Supported Product Versions
Note
This integration has been validated in the field by a partner or in a customer environment with the following software versions. It is recommended to test the integration in a non-production environment with desired software versions before deploying it to production. Thales will provide best-effort support.
CipherTrust Manager
CipherTrust Manager 2.16 and higher
CADP for Java
CADP for Java 8.13 and higher.
Databricks Compute LTS
Databricks Compute 14.3 or higher.
Note
This integration guide is validated on CADP 8.16.0.000 using Java 1.8.
Note
This document does not contain all the notebook samples available. For a complete listing please refer to the github.
Prerequisites
Steps performed for this integration are provided on this Databricks link: https://docs.databricks.com/en/udf/unity-catalog.html.
Ensure that CADP for Java is installed and configured. Refer to Quick Start.
Ensure that the CipherTrust Manager is installed and configured. Refer to the CipherTrust Manager Documentation for details.
Databricks communicates with the CipherTrust Manager using the Network Attached Encryption (NAE) Interface. Ensure that the NAE interface is configured. For more details, refer to the CipherTrust Manager Documentation.
Ensure that the port configured on NAE interface is accessible from Databricks.
Java UDF’s are currently only supported in Databricks Compute Cluster. Here is an example of what is supported from a Databricks notebook accessing data in a compute cluster:
%sql
select ThalesencryptCharUDF(c_name) as enc_name, c_name from samples.tpch.customer
limit 50
For more information on alternatives to the UDF approach, see section Databricks SQL Data Warehouse.