5. Post Install

Updating and Exporting the Knox Server Certificate

When the Knox service is installed by Ambari, a self-signed certificate is created internally. However, the certificate created uses SHA1 and with a key size of 1024 bits. The update_knox_cert.sh script below will create and replace the certificate with a new one using SHA256 and a key size of 2048 bits.

Run this script in an SSH session on the TDP instance:
```
/root/setup/update_knox_cert.sh
```
When this script is run, it will prompt for the Knox master secret.
This secret is the one that you set in the step "7. Hadoop Services Installation Through Ambari > Customize Services" (CREDENTIALS > Knox Master Secret).

Finally, you need to export the SSL certificate of the Knox server and configure DDC to talk to Hadoop. You need to obtain the certificate where Knox is installed.

Put the Knox server certificate in a file, by using this command:

echo -n | openssl s_client -connect <IP_OF_KNOX_SERVER>:8443 | \
sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > /tmp/hadoop.cert

Copy the certificate to the system where you will connect to the CipherTrust/DDC GUI.
Issue this command to display the certificate in the terminal window:
```
cat /tmp/hadoop.cert
```
Now, you can copy the certificate from the terminal window and paste it in its own file on your machine.

Tip

You will need to upload this file when configuring Data Discovery and Classification to talk to Thales Data Platform, be sure to save it locally. Refer to Configuring TDP section in the Data Discovery and Classification documentation to learn more.

You can find the comprehensive procedure at this link on the Hortonworks documentation pages.

Ambari additional configuration

To execute the instructions below you will need to access Ambari, please refer to Accessing Ambari for further information.

Enabling Namenode HA

Note

These steps are optional for multiple node cluster, but they are recommended.

In the Ambari toolbar on the left, expand Services, then click HDFS.
Go to the ACTIONS menu and click Enable NameNode HA.
Follow each step in the 'Enable NameNode HA Wizzard'.
1. In the Get Started step, input the Nameservice ID.
2. In Select Hosts, keep all the defaults and click NEXT.
3. In the Review step, click NEXT.
4. In Create Checkpoint, perform each step as listed. NEXT is only clickable after you complete all the steps.
5. In Configure Components, wait for the deployment of each component, then click NEXT.
6. In Initialize JournalNodes, perform each subsequent step. NEXT is only clickable after you complete all the steps.
7. In Start Components, wait for each process to complete. Then click NEXT.
8. In Initialize Metadata, perform each step, then click NEXT. Then click OK on the pop-up confirmation window that follows.
9. In Finalize HA Setup, wait for each process to complete, then click DONE.

Livy configuration

Update Livy configuration for Spark2

In the Ambari toolbar on the left, expand Services, then click Spark2.
Select the CONFIGS tab, then below it click ADVANCED.
Expand the Custom spark2-defaults section, then click the Add Property... link. In the Add Property popup, click the "multiple tags" icon to enable the Bulk property add mode. Then enter the following text, replacing <zookeeper-node-hostname> with the zookeper node IP.
```
spark.yarn.appMasterEnv.ZK_URL_DDC = <zookeeper-node-hostname>:2181
```
Note
You can point to more than one Zookeeper node to ensure high availability on the DDC connection. To do that, use a semicolon (;) to separate the nodes. For example:
spark.yarn.appMasterEnv.ZK_URL_DDC = <zookeeper-node1-hostname>:2181;<zookeeper-node2-hostname>:2181
Expand the Custom livy2-conf section, then click Add Property.... In the Add Property popup, click the "multiple tags" icon to enable the Bulk property add mode. Enter the following text.
```
livy.server.session.state-retain.sec = 24h
```
Expand the Advanced livy2-conf section. Update the following entry:
- livy.server.csrf_protection.enabled: false
Click SAVE and then restart Spark2.
At the top of the screen it will tell you that a restart is required and there is an orange RESTART button. Click that button and select Restart All Affected.

Update Livy configuration for Knox

In the Ambari toolbar on the left, expand Services, then click Knox.
Select the CONFIGS tab.
Expand the Advanced topology section.
For Spark/Livy configuration, add this entry one line before </topology>:
- For single node Spark2 Server:
```
<service>
    <role>LIVYSERVER</role>
    <url>http://<HOSTNAME_OF_SPARK2_SERVER>:8999</url>
</service>
```
- For multiple node Spark2 Servers:
```
<service>
    <role>LIVYSERVER</role>
    <url>http://<HOSTNAME_OF_SPARK2_SERVER1>:8999</url>
    <url>http://<HOSTNAME_OF_SPARK2_SERVER2>:8999</url>
    <url>http://<HOSTNAME_OF_SPARK2_SERVER3>:8999</url>
    ...
</service>
```
Tip
This topology will enable you to configure, in Data Discovery and Classification's Hadoop Services, HDFS with /gateway/default/webhdfs/v1 as URI and LIVY with /gateway/default/livy/v1 as URI. Refer to Configuring TDP section in the Data Discovery and Classification documentation to learn more.
Click SAVE then restart Knox.
At the top of the screen it will tell you that a restart is required and there is an orange RESTART button. Click that button and select Restart All Affected.

Update HDFS configuration for Knox

If your cluster has Namenode HA enabled, it is recommended to properly configure Knox to reach higher availability.

In the Ambari toolbar on the left, expand Services, then click Knox.
Select the CONFIGS tab.

Expand the Advanced topology section and add the following sections:

<provider>
<role>ha</role>
<name>HaProvider</name>
<enabled>true</enabled>
<param>
<name>WEBHDFS</name>
<value>maxFailoverAttempts=3;failoverSleep=1000;maxRetryAttempts=300;retrySleep=1000;enabled=true</value>
</param>
</provider>

Add an entry for the Phoenix Query Server.

//If the cluster has one PQS installed:
<service>
<role>AVATICA</role>
<url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER>:8765</url>
</service>

//If the cluster has multiple PQS installed, you have to list all PQS nodes:
<service>
<role>AVATICA</role>
<url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER1>:8765</url>
<url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER2>:8765</url>
<url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER3>:8765</url>
</service>

Click SAVE then restart Knox.
At the top of the screen it will tell you that a restart is required and there is an orange RESTART button. Click that button and select Restart All Affected.

HBase configuration

Update Zookeeper configuration for HBase

In the Ambari toolbar on the left, expand Services, then click HBase.
Select the CONFIGS tab, then below it click ADVANCED.
Expand the Advanced hbase-site section. Update the following entry:
- ZooKeeper Znode Parent: /hbase
Note
These properties will be added to the hbase-site.xml configuration file.
Click SAVE then restart all affected components.
At the top of the screen it will tell you that a restart is required and there is an orange RESTART button. Click that button and select Restart All Affected.

Update PQS configuration for HBase

Apache Phoenix requires additional modification. You need to enable namespace mapping and map the system tables to the namespace. For this procedure we are using the Ambari UI.

In the Ambari toolbar on the left, expand Services, then click HBase.
Expand the Custom hbase-site section. Click the Add Property... link. In the Add Property popup, click the "multiple tags" icon to enable the Bulk property add mode. Then enter the following text:
```
phoenix.schema.isNamespaceMappingEnabled=true
phoenix.schema.mapSystemTablesToNamespace=true
```
Click SAVE then restart all affected components.
Note
After you click SAVE you may see a few warnings, out of which this one pertains to the HBase configuration:
Values greater than 6.4GB are not recommended. Maximum ammount of memory each HBase RegionServer can use.
You can disregard this warning and just click PROCEED ANYWAY to continue.

At the top of the screen it will tell you that a restart is required and there is an orange RESTART button. Click that button and select Restart All Affected.

Updating HBase Site Configuration in Spark

To update the HBase site configuration copy the hbase-site.xml to Spark.

Run the following command on HBase Master node:

cp /etc/hbase/3.1.6-316/0/hbase-site.xml /etc/spark2/3.1.6-316/0/

Also run the following command on all nodes where HBase Master server and regionservers are:

cp /etc/hbase/conf/hbase-site.xml /etc/spark2/conf/

Knox Logging

Changing Knox Log Level

Warning

This is a highly recommended step.

Knox uses Log4j to keep track of its log messages. The default log level, which is INFO, may quickly cause the log file to use up the available space and DDC scans to fail.

Because of that you must change the log level to ERROR by editing the log configuration file. You can do this through Ambari, by following these steps:

In the Ambari toolbar on the left, click to expand Services, then click Knox.
Select the CONFIGS tab and scroll down to "Advanced gateway-log4j".
Open the "Advanced gateway-log4j" section and modify these parameters:
- Change the log4j.logger.audit parameter to:
```
log4j.logger.audit=ERROR, auditfile
```
- Add these two additional parameters at the end:
```
log4j.appender.auditfile.MaxFileSize=10MB
log4j.appender.auditfile.MaxBackupIndex=10
```
Click SAVE and then restart Knox.

Purging the Knox Log Directory

If after applying these changes your Knox log grows too quickly, you may have to purge the log directory. First, however, you need to check if this is the case.

Check if Ambari is displaying a "NameNode Directory Status" error. This error indicates a failed directory (that is, one or more directories are reporting as not healthy.)
Check Ambari for a "Failed directory count" message to find out which directories are reporting problems. If the error message is showing "Failed directory count: 1" it may be the logs directory.
In the terminal, check the free disk space for /var/log/ by issuing the command:
```
df -h
```

If the output of the command is showing no free disk space for the /var/log/ directory, remove all the log files.

Start LDAP demo server

This step is applicable for demo purposes only.

Warning

It is your responsibility as the system owner to properly configure the authentication to secure the production environment. For demo purposes you can opt for the built-in LDAP server but for everything else you should configure your company's authentication/federation provider.

To set up the embedded LDAP for demo usage, follow these steps:

In the Ambari toolbar on the left, expand Services, then click Knox.
Go to the ACTIONS menu and click Start Demo LDAP. Without starting the demo LDAP you cannot use the Knox logins.

Spark Tuning

Warning

This is a highly recommended step.

Spark can be configured by adjusting some properties via Ambari. There are plenty of properties that configure every aspect of the Spark behavior on the Spark official documentation, but this section only just covers some of the most important properties.

Property Name	Default value	Purpose
spark.driver.cores	1	Number of cores to use for the driver process, only in cluster mode.
spark.driver.memory	1 GB	Amount of memory to use for the driver process.
spark.executor.memory	1 GB	Amount of memory to use per executor process, in MiB unless otherwise specified.
spark.executor.cores	1 in Yarn mode	The number of cores to use on each executor. In standalone and Mesos coarse-grained modes, for more detail, see this description.
spark.task.cpus	1	Number of cores to allocate for each task.
spark.executor.instances	2	The number of executors for static allocation. With spark.dynamicAllocation.enabled, the initial set of executors will be at least this large.

Caution

The following instructions are recommended if you have at least 8 CPU / 32GB RAM on board per cluster node.

To increase the resources dedicated to the Spark jobs, you will need to access Ambari, please refer to Accessing Ambari for further information.

In the Ambari toolbar on the left, expand Services, then click Spark2.
Select the CONFIGS tab, then below it click ADVANCED.
Expand Custom spark2-defaults and then click Add Property....

Add the following properties in the text box:

spark.driver.cores=3
spark.driver.memory=3g
spark.executor.cores=3
spark.executor.memory=3g
spark.executor.instances=3

Click SAVE then restart all affected components.
At the top of the screen it will tell you that a restart is required and there is an orange RESTART button. Click that button and select Restart All Affected.

Preparing HDFS for DDC

Warning

This procedure is only needed for a new TDP cluster and a new CM. You should omit it for any existing working environment (TDP + CM).

Creating HDFS User for DDC

By default, HDFS has permission checking enabled. Please follow the procedure below to properly configure DDC and have it ready to be used by DDC. For this example and throughout the following sections of this document, we will use user bob and group hdfs.

Tip

You can see if this permission checking is enabled by inspecting the property dfs.permissions.enabled in the hdfs-site.xml file.

SSH to the TDP instance and log in as root.
Create the user in Linux OS where HDFS is installed, and assign it to the desired group:
```
useradd -g hdfs bob
passwd bob
```
Next, you need to configure the new user in HDFS.
1. Switch to the hdfs O.S. user:
```
su - hdfs
```
2. Run the following commands to setup the user:
```
hdfs dfs -mkdir /user/bob
hdfs dfs -chown -R bob:hdfs /user/bob
hdfs dfs -ls /user
```
  The result for the last command should include:
  drwxr-xr-x - bob hdfs 0 2021-02-29 03:46 /user/bob
  Tip
  You can only use the following characters in the name of the HDFS folder: A to Z, a to z, 0 to 9, '_' and '-'.
3. Still using the hdfs user, the last step is to refresh the namenode groups:
```
hdfs dfsadmin -refreshUserToGroupsMappings
```
  From now on, DDC can use the bob user to run jobs that read and write files in the /user/bob directory.
4. Optionally, you can assign a quota for the new user:
```
hdfs dfsadmin -setSpaceQuota 30g /user/bob
```
5. Return to the root prompt
```
exit
```
Lastly, you need to configure the new user in LDAP.
Warning
It is your responsibility as the system owner to properly configure the authentication to secure the production environment. For demo purposes you can opt for the built-in LDAP server but for everything else you should configure your company's authentication/federation provider.
To set up the user bob in the embedded LDAP for demo usage, please follow those instructions. To execute the instructions below you will need to access Ambari, please refer to Accessing Ambari for further information.
1. In the Ambari toolbar on the left, expand Services, then click Knox.
2. Select the CONFIGS tab and expand the Advanced users-ldif section.
3. Add this text at the bottom of the text box, replacing <user-password> for the desired user password:
```
dn: uid=bob,ou=people,dc=hadoop,dc=apache,dc=org
objectclass:top
objectclass:person
objectclass:organizationalPerson
objectclass:inetOrgPerson
cn: bob
sn: bob
uid: bob
userPassword:<user-password>
```
4. Click SAVE.
5. Click the ACTIONS button and then Stop Demo LDAP to stop it.
6. Click the ACTIONS button and then Start Demo LDAP start it with the new configuration.

Note

This HDFS user must be configured on the Knox connection, inside the DDC Connection Manager (in the CM Connections Management page, select Access Management > Connections Management).

Creating a HDFS Directory for DDC

For DDC to utilize Hadoop Distributed File System (HDFS), you need to create a directory under HDFS. DDC will use this space for storing scan results and reports. You can create it through the command line. You need to do it only on the primary node. This directory can have any name, but for this example and throughout the following sections of this document, we will use /ciphertrust_ddc.

SSH to the TDP instance and log in as root.
Switch to the Linux user hdfs, who has permissions to create and destroy folders:
```
su - hdfs
```
You need to create the /ciphertrust_ddc directory in HDFS, by issuing this command:
```
hdfs dfs -mkdir /ciphertrust_ddc
```
Because the default permissions will not allow the scans to write over the Hadoop Database, you have to change the access permissions. By issuing this command you will grant full access rights for the DDC user configured in Creating HDFS User, that is bob.
```
hdfs dfs -chown -R bob:hdfs /ciphertrust_ddc
```
By running hdfs dfs -ls / the result should include:
drwxr-xr-x - bob hdfs 0 2021-08-12 03:17 /ciphertrust_ddc
Return to the root prompt
```
exit
```

Tip

Those steps will enable you to configure HDFS in Data Discovery and Classification Hadoop Services with /ciphertrust_ddc as Folder. Refer to Configuring TDP section in the Data Discovery and Classification documentation to learn more.

Administration tasks

HDFS Administration

Once you have configured HDFS using the instructions above, you can use the user interface to browse through HDFS sub-directories. Please refer to Browsing HDFS via User Interface

Optionally you can configure HDFS to use an ACL. Please refer to HDFS ACL's

Configure LDAP Authentication

In production environments it is mandatory to delgate authentication to an external LDAP server. Please refer to Configuring LDAP Authentication to get instructions on how to configure Thales Data Platform with a LDAP server.

Your TDP is now ready to be used.

Suggest A Change

5. Post Install

Updating and Exporting the Knox Server Certificate

Ambari additional configuration

Enabling Namenode HA

Livy configuration

Update Livy configuration for Spark2

Update Livy configuration for Knox

Update HDFS configuration for Knox

HBase configuration

Update Zookeeper configuration for HBase

Update PQS configuration for HBase

Updating HBase Site Configuration in Spark

Knox Logging

Changing Knox Log Level

Purging the Knox Log Directory

Start LDAP demo server

Spark Tuning

Preparing HDFS for DDC

Creating HDFS User for DDC

Creating a HDFS Directory for DDC

Administration tasks

HDFS Administration

Configure LDAP Authentication

On this page