Integration of Databricks with CRDP
This document describes how to configure and integrate CipherTrust Manager with Databricks.
Built on open source and open standards, a lakehouse simplifies your data estate by eliminating the silos that historically complicate data and AI. It provides:
An architecture for integration, storage, processing, governance, sharing, analytics, and AI.
An approach to how you work with structured and unstructured data. One end-to-end view of data lineage and provenance.
A toolset for Python and SQL, notebooks and IDEs, batch and streaming, and all major cloud providers.
Different capabilities are available to protect data in Databricks. The most secure option is Bring Your Own Encryption (BYOE), which includes:
Data Ingest – Used with Batch Data Transformation (BDT) or User Defined Functions (UDF).
Data Access – User defined functions for column level protection.
Note
The focus of this integration is on Data Access protecting sensitive data in Databricks columns by using CipherTrust REST Data Protection(CRDP).
For more information, see the Databricks document.
Architecture
Databricks can run on all three major cloud service providers AWS, Azure and GCP. An advantage with Databricks is that even though each CSP has its own unique way to create functions and gateways there is only one method to know with Databricks.
Listed below is an example of how this integration works:
Supported Product Versions
Note
This integration has been validated in the field by a partner or in a customer environment with the following software versions. It is recommended to test the integration in a non-production environment with desired software versions before deploying it to production. Thales will provide best-effort support.
CipherTrust Manager
CipherTrust Manager 2.16 and higher
CRDP 1.0 version
Databricks Compute LTS
Databricks Compute 14.3 or higher.
This integration is validated using Java 1.8 only. Higher version of Java is not supported by Databricks.
Prerequisites
Steps performed for this integration are provided on this Databricks link: https://docs.databricks.com/en/udf/unity-catalog.html.
Ensure that CRDP container is installed and configured. Refer to https://thalesdocs.com/ctp/con/crdp/latest/admin/crdp-deploy_alternative/index.html.
Ensure that the CipherTrust Manager is installed and configured. Refer to the CipherTrust Manager Documentation for details.
Databricks communicates with the CipherTrust Manager using the Network Attached Encryption (NAE) Interface. Ensure that the NAE interface is configured. For more details, refer to the CipherTrust Manager Documentation.
Ensure that the port configured on NAE interface is accessible from Databricks.
Java UDFs are currently only supported in Databricks Compute Cluster. Here is an example of what is supported from a Databricks notebook accessing data in a compute cluster:
select ThalesencryptCharUDF(c_name) as enc_name, c_name from samples.tpch.customer
limit 50
For more information on alternatives to the UDF approach, refer to Databricks SQL Data Warehouse.