Architecture
This section describes the main components of Thales Data Discovery and Classification (DDC) and how they operate together to provide the DDC solution. Before you go ahead with the actual deployment, review the graphic included in this section to get a feel for what a typical DDC deployment looks like. The concepts used in this diagram are introduced in the later sections of this document and explained at length in the Data Discovery and Classification Administration Guide
At the heart of the DDC solution is CipherTrust Manager on which runs the DDC Server. It is from here that users interact with the DDC GUI or use the DDC APIs to create classification profiles, add data stores, launch scans and generate reports.
Where to Install the DDC Agents
DDC supports a number of different data stores. In order to access these data stores, the DDC Server communicates with one or more DDC Agents. The DDC Agent is a software component that is used to scan a data store for Infotypes (such as credit card numbers, email addresses and so on) that are part of a classification profile. All data that is collected is sent from the Agent to the DDC Server which stores the data, together with any user requested reports, on an external Hadoop cluster.
Generally speaking, if you are scanning data stores that are local to Windows or Linux (no network shares), you should install the DDC Agent on the server where the data is located. For all other types of storage (top part of the figure), the DDC Agent should be installed on a proxy server.
Note
A Windows Proxy is needed to connect to databases.
As an example, let’s assume that you wish to scan an NFS share. In this case, the NFS share should be mounted on the proxy server and the DDC Agent should be installed on the proxy server. To scan the share, specify the mount point of the NFS share when creating the scan. For DDC Agent requirements and the types of data stores supported, see Agent Configurations. For information on securing the deployment, refer to Hardening the Deployment.
How DDC Uses Hadoop
DDC uses Hadoop to generate reports from scans and to store their results (report data). Thales Data Platform (TDP) is the only Hadoop flavor currently supported for this purpose. This is different than the Hadoop cluster that DDC also supports as a data store, that is where the user stores the data.
DDC uses Spark and Livy to process the data and stores it in HDFS. Tez is a requirement to use Spark.
The DDC server retrieves the results of the scan from the DDC Agent and stores this information in TDP together with any reports that are generated. It is imperative that your TDP cluster is highly available to avoid losing any data store scans or reports.
DDC also requires Apache Knox as a single point of access to the TDP cluster (both Livy and HDFS), to ensure all the communications get protected with TLS, and for authentication. Therefore, you only need to connect up DDC to Knox. For information on configuring DDC to use TDP, see Configuring CipherTrust Manager.