Databricks SQL data warehouse
This section describes how to support accessing data in SQL Data Warehouse. Databricks SQL Data Warehouse does not support UDFs. To use the Databricks SQL Data Warehouse, invoke the reveal
or decrypt
function. Do it in a Compute Cluster before the data reaches the SQL Warehouse.
Here's the workflow:
Compute Cluster Processing:
Read the data from the customers table, which might reside in the SQL Warehouse, into a Spark DataFrame on your Compute Cluster.
Apply your
ThalesencryptCharUDF
to the name column to create theenc_name
column.Write the transformed data, including the encrypted
enc_name
column, to a new table. You can also view the table that SQL Warehouse can access. You can save the new table in the same SQL Warehouse or at a different storage location.
SQL Warehouse Query:
In your SQL Warehouse, you can now run a query for the newly created table or view the pre-encrypted data table.
Use case for CADP
In this use case, the sensitive data and Extract Load Transform is loaded, and need to transform it in the compute cluster.
query_result = spark.sql("SELECT c_name FROM my_catalog.my_schema.customer")
names = [row.c_name for row in query_result.collect()]
encrypted_names = thales_cadp_python_function_bulk(names, mode, datatype)
spark = SparkSession.builder.appName("EncryptedTable").getOrCreate()
# Collect names into a list
c_names = [row.c_name for row in query_result.collect()]
# Ensure c_names and encrypted_test_data have the same length
if len(c_names) != len(output):
raise ValueError("Mismatch between c_name count and encrypted_test_data count!")
# Create a DataFrame with both columns
data = list(zip(c_names, output))
schema = StructType([
StructField("c_name", StringType(), True),
StructField("encrypted_c_name", StringType(), True)
])
encrypted_df = spark.createDataFrame(data, schema)
encrypted_df.show()
# Write the DataFrame to a table in Databricks
encrypted_df.write.format("delta").mode("overwrite").saveAsTable("my_catalog.my_schema.encrypted_customers_cadp_char")
Once the data is protected, you can then unprotect it for the specific use cases that require clear text in the SQL Data Warehouse.
# Step 1: Fetch encrypted data from the table
encrypted_query_result = spark.sql("SELECT c_name, encrypted_c_name FROM my_catalog.my_schema.encrypted_customers_cadp_char")
# Step 2: Extract encrypted names and original c_name
encrypted_names = [row.encrypted_c_name for row in encrypted_query_result.collect()]
c_names = [row.c_name for row in encrypted_query_result.collect()]
# Step 3: Decrypt the data
mode = "revealbulk" # Ensure the mode is set to reveal
decrypted_names = thales_cadp_python_function_bulk(encrypted_names, mode, datatype)
# Step 4: Create a DataFrame with decrypted values
data = list(zip(c_names, decrypted_names)) # Preserve original c_name
schema = StructType([
StructField("c_name", StringType(), True),
StructField("decrypted_c_name", StringType(), True)
])
decrypted_df = spark.createDataFrame(data, schema)
# Step 5: Write the decrypted data to a table
decrypted_df.write.format("parquet").mode("overwrite").saveAsTable("hive_metastore.default.decrypted_customers_cadp_char")
In your SQL Warehouse:
select * from hive_metastore.default.decrypted_customers_cadp_char limit 14;
** Output**
c_name decrypted_c_name
Customer#000412570 Customer#000412570
Customer#000412571 Customer#000412571
Customer#000412572 Customer#000412572
Customer#000412573 Customer#000412573
Customer#000412574 Customer#000412574
Customer#000412575 Customer#000412575
Customer#000412576 Customer#000412576
Use case for CRDP
In this use case, the sensitive data and Extract Load Transform is loaded, and need to transform it in the compute cluster.
query_result = spark.sql("SELECT c_name FROM my_catalog.my_schema.customer")
names = [row.c_name for row in query_result.collect()]
encrypted_names = thales_crdp_python_function_bulk(names, mode, datatype)
spark = SparkSession.builder.appName("EncryptedTable").getOrCreate()
# Collect names into a list
c_names = [row.c_name for row in query_result.collect()]
# Ensure c_names and encrypted_test_data have the same length
if len(c_names) != len(output):
raise ValueError("Mismatch between c_name count and encrypted_test_data count!")
# Create a DataFrame with both columns
data = list(zip(c_names, output))
schema = StructType([
StructField("c_name", StringType(), True),
StructField("encrypted_c_name", StringType(), True)
])
encrypted_df = spark.createDataFrame(data, schema)
encrypted_df.show()
# Write the DataFrame to a table in Databricks
encrypted_df.write.format("delta").mode("overwrite").saveAsTable("my_catalog.my_schema.encrypted_customers_crdp_char")
Once the data is protected, you can then unprotect it for the specific use cases that require clear text in the SQL Data Warehouse.
# Step 1: Fetch encrypted data from the table
encrypted_query_result = spark.sql("SELECT c_name, encrypted_c_name FROM my_catalog.my_schema.encrypted_customers_crdp_char")
# Step 2: Extract encrypted names and original c_name
encrypted_names = [row.encrypted_c_name for row in encrypted_query_result.collect()]
c_names = [row.c_name for row in encrypted_query_result.collect()]
# Step 3: Decrypt the data
mode = "revealbulk" # Ensure the mode is set to reveal
decrypted_names = thales_crdp_python_function_bulk(encrypted_names, mode, datatype)
# Step 4: Create a DataFrame with decrypted values
data = list(zip(c_names, decrypted_names)) # Preserve original c_name
schema = StructType([
StructField("c_name", StringType(), True),
StructField("decrypted_c_name", StringType(), True)
])
decrypted_df = spark.createDataFrame(data, schema)
# Step 5: Write the decrypted data to a table
decrypted_df.write.format("parquet").mode("overwrite").saveAsTable("hive_metastore.default.decrypted_customers_crdp_char")
In your SQL Warehouse:
select * from hive_metastore.default.decrypted_customers_crdp_char limit 14;
Output
c_name decrypted_c_name
Customer#000412570 Customer#000412570
Customer#000412571 Customer#000412571
Customer#000412572 Customer#000412572
Customer#000412573 Customer#000412573
Customer#000412574 Customer#000412574
Customer#000412575 Customer#000412575
Customer#000412576 Customer#000412576