5. Post Install
Updating and Exporting the Knox Server Certificate
When the Knox service is installed by Ambari, a self-signed certificate is created internally. However, the certificate created uses SHA1 and with a key size of 1024 bits. The update_knox_cert.sh script below will create and replace the certificate with a new one using SHA256 and a key size of 2048 bits.
Run this script in an SSH session on the TDP instance:
/root/setup/update_knox_cert.sh
When this script is run, it will prompt for the Knox master secret.
This secret is the one that you set in the step "7. Hadoop Services Installation Through Ambari > Customize Services" (CREDENTIALS > Knox Master Secret).
Finally, you need to export the SSL certificate of the Knox server and configure DDC to talk to Hadoop. You need to obtain the certificate where Knox is installed.
Put the Knox server certificate in a file, by using this command:
echo -n | openssl s_client -connect <IP_OF_KNOX_SERVER>:8443 | \ sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > /tmp/hadoop.cert
Copy the certificate to the system where you will connect to the CipherTrust/DDC GUI.
Issue this command to display the certificate in the terminal window:
cat /tmp/hadoop.cert
Now, you can copy the certificate from the terminal window and paste it in its own file on your machine.
Tip
You will need to upload this file when configuring Data Discovery and Classification to talk to Thales Data Platform, be sure to save it locally. Refer to Configuring TDP section in the Data Discovery and Classification documentation to learn more.
You can find the comprehensive procedure at this link on the Hortonworks documentation pages.
Ambari additional configuration
To execute the instructions below you will need to access Ambari, please refer to Accessing Ambari for further information.
Enabling Namenode HA
Note
These steps are optional for multiple node cluster, but they are recommended.
In the Ambari toolbar on the left, expand Services, then click HDFS.
Go to the ACTIONS menu and click Enable NameNode HA.
Follow each step in the 'Enable NameNode HA Wizzard'.
In the Get Started step, input the Nameservice ID.
In Select Hosts, keep all the defaults and click NEXT.
In the Review step, click NEXT.
In Create Checkpoint, perform each step as listed. NEXT is only clickable after you complete all the steps.
In Configure Components, wait for the deployment of each component, then click NEXT.
In Initialize JournalNodes, perform each subsequent step. NEXT is only clickable after you complete all the steps.
In Start Components, wait for each process to complete. Then click NEXT.
In Initialize Metadata, perform each step, then click NEXT. Then click OK on the pop-up confirmation window that follows.
In Finalize HA Setup, wait for each process to complete, then click DONE.
Livy configuration
Update Livy configuration for Spark2
In the Ambari toolbar on the left, expand Services, then click Spark2.
Select the CONFIGS tab, then below it click ADVANCED.
Expand the Custom spark2-defaults section, then click the Add Property... link. In the Add Property popup, click the "multiple tags" icon to enable the Bulk property add mode. Then enter the following text, replacing
<zookeeper-node-hostname>
with the zookeper node IP.spark.yarn.appMasterEnv.ZK_URL_DDC = <zookeeper-node-hostname>:2181
Note
You can point to more than one Zookeeper node to ensure high availability on the DDC connection. To do that, use a semicolon (;) to separate the nodes. For example:
spark.yarn.appMasterEnv.ZK_URL_DDC = <zookeeper-node1-hostname>:2181;<zookeeper-node2-hostname>:2181
Expand the Custom livy2-conf section, then click Add Property.... In the Add Property popup, click the "multiple tags" icon to enable the Bulk property add mode. Enter the following text.
livy.server.session.state-retain.sec = 24h
Expand the Advanced livy2-conf section. Update the following entry:
- livy.server.csrf_protection.enabled:
false
- livy.server.csrf_protection.enabled:
Click SAVE and then restart Spark2.
At the top of the screen it will tell you that a restart is required and there is an orange RESTART button. Click that button and select Restart All Affected.
Update Livy configuration for Knox
In the Ambari toolbar on the left, expand Services, then click Knox.
Select the CONFIGS tab.
Expand the Advanced topology section.
For Spark/Livy configuration, add this entry one line before
</topology>
:For single node Spark2 Server:
<service> <role>LIVYSERVER</role> <url>http://<HOSTNAME_OF_SPARK2_SERVER>:8999</url> </service>
For multiple node Spark2 Servers:
<service> <role>LIVYSERVER</role> <url>http://<HOSTNAME_OF_SPARK2_SERVER1>:8999</url> <url>http://<HOSTNAME_OF_SPARK2_SERVER2>:8999</url> <url>http://<HOSTNAME_OF_SPARK2_SERVER3>:8999</url> ... </service>
Tip
This topology will enable you to configure, in Data Discovery and Classification's Hadoop Services, HDFS with
/gateway/default/webhdfs/v1
as URI and LIVY with/gateway/default/livy/v1
as URI. Refer to Configuring TDP section in the Data Discovery and Classification documentation to learn more.Click SAVE then restart Knox.
At the top of the screen it will tell you that a restart is required and there is an orange RESTART button. Click that button and select Restart All Affected.
Update HDFS configuration for Knox
If your cluster has Namenode HA enabled, it is recommended to properly configure Knox to reach higher availability.
In the Ambari toolbar on the left, expand Services, then click Knox.
Select the CONFIGS tab.
Expand the Advanced topology section and add the following sections:
<provider> <role>ha</role> <name>HaProvider</name> <enabled>true</enabled> <param> <name>WEBHDFS</name> <value>maxFailoverAttempts=3;failoverSleep=1000;maxRetryAttempts=300;retrySleep=1000;enabled=true</value> </param> </provider>
Add an entry for the Phoenix Query Server.
//If the cluster has one PQS installed: <service> <role>AVATICA</role> <url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER>:8765</url> </service> //If the cluster has multiple PQS installed, you have to list all PQS nodes: <service> <role>AVATICA</role> <url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER1>:8765</url> <url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER2>:8765</url> <url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER3>:8765</url> </service>
Click SAVE then restart Knox.
At the top of the screen it will tell you that a restart is required and there is an orange RESTART button. Click that button and select Restart All Affected.
HBase configuration
Update Zookeeper configuration for HBase
In the Ambari toolbar on the left, expand Services, then click HBase.
Select the CONFIGS tab, then below it click ADVANCED.
Expand the Advanced hbase-site section. Update the following entry:
- ZooKeeper Znode Parent:
/hbase
Note
These properties will be added to the
hbase-site.xml
configuration file.- ZooKeeper Znode Parent:
Click SAVE then restart all affected components.
At the top of the screen it will tell you that a restart is required and there is an orange RESTART button. Click that button and select Restart All Affected.
Update PQS configuration for HBase
Apache Phoenix requires additional modification. You need to enable namespace mapping and map the system tables to the namespace. For this procedure we are using the Ambari UI.
In the Ambari toolbar on the left, expand Services, then click HBase.
Expand the Custom hbase-site section. Click the Add Property... link. In the Add Property popup, click the "multiple tags" icon to enable the Bulk property add mode. Then enter the following text:
phoenix.schema.isNamespaceMappingEnabled=true phoenix.schema.mapSystemTablesToNamespace=true
Click SAVE then restart all affected components.
Note
After you click SAVE you may see a few warnings, out of which this one pertains to the HBase configuration:
Values greater than 6.4GB are not recommended. Maximum ammount of memory each HBase RegionServer can use.
You can disregard this warning and just click PROCEED ANYWAY to continue.
At the top of the screen it will tell you that a restart is required and there is an orange RESTART button. Click that button and select Restart All Affected.
Updating HBase Site Configuration in Spark
To update the HBase site configuration copy the hbase-site.xml
to Spark.
Run the following command on HBase Master node:
cp /etc/hbase/3.1.6-316/0/hbase-site.xml /etc/spark2/3.1.6-316/0/
Also run the following command on all nodes where HBase Master server and regionservers are:
cp /etc/hbase/conf/hbase-site.xml /etc/spark2/conf/
Knox Logging
Changing Knox Log Level
Warning
This is a highly recommended step.
Knox uses Log4j to keep track of its log messages. The default log level, which is INFO, may quickly cause the log file to use up the available space and DDC scans to fail.
Because of that you must change the log level to ERROR by editing the log configuration file. You can do this through Ambari, by following these steps:
In the Ambari toolbar on the left, click to expand Services, then click Knox.
Select the CONFIGS tab and scroll down to "Advanced gateway-log4j".
Open the "Advanced gateway-log4j" section and modify these parameters:
Change the log4j.logger.audit parameter to:
log4j.logger.audit=ERROR, auditfile
Add these two additional parameters at the end:
log4j.appender.auditfile.MaxFileSize=10MB log4j.appender.auditfile.MaxBackupIndex=10
Click SAVE and then restart Knox.
Purging the Knox Log Directory
If after applying these changes your Knox log grows too quickly, you may have to purge the log directory. First, however, you need to check if this is the case.
Check if Ambari is displaying a "NameNode Directory Status" error. This error indicates a failed directory (that is, one or more directories are reporting as not healthy.)
Check Ambari for a "Failed directory count" message to find out which directories are reporting problems. If the error message is showing "Failed directory count: 1" it may be the logs directory.
In the terminal, check the free disk space for
/var/log/
by issuing the command:df -h
If the output of the command is showing no free disk space for the /var/log/
directory, remove all the log files.
Start LDAP demo server
This step is applicable for demo purposes only.
Warning
It is your responsibility as the system owner to properly configure the authentication to secure the production environment. For demo purposes you can opt for the built-in LDAP server but for everything else you should configure your company's authentication/federation provider.
To set up the embedded LDAP for demo usage, follow these steps:
In the Ambari toolbar on the left, expand Services, then click Knox.
Go to the ACTIONS menu and click Start Demo LDAP. Without starting the demo LDAP you cannot use the Knox logins.
Spark Tuning
Warning
This is a highly recommended step.
Spark can be configured by adjusting some properties via Ambari. There are plenty of properties that configure every aspect of the Spark behavior on the Spark official documentation, but this section only just covers some of the most important properties.
Property Name | Default value | Purpose |
---|---|---|
spark.driver.cores | 1 | Number of cores to use for the driver process, only in cluster mode. |
spark.driver.memory | 1 GB | Amount of memory to use for the driver process. |
spark.executor.memory | 1 GB | Amount of memory to use per executor process, in MiB unless otherwise specified. |
spark.executor.cores | 1 in Yarn mode | The number of cores to use on each executor. In standalone and Mesos coarse-grained modes, for more detail, see this description. |
spark.task.cpus | 1 | Number of cores to allocate for each task. |
spark.executor.instances | 2 | The number of executors for static allocation. With spark.dynamicAllocation.enabled, the initial set of executors will be at least this large. |
Caution
The following instructions are recommended if you have at least 8 CPU / 32GB RAM on board per cluster node.
To increase the resources dedicated to the Spark jobs, you will need to access Ambari, please refer to Accessing Ambari for further information.
In the Ambari toolbar on the left, expand Services, then click Spark2.
Select the CONFIGS tab, then below it click ADVANCED.
Expand Custom spark2-defaults and then click Add Property....
Add the following properties in the text box:
spark.driver.cores=3 spark.driver.memory=3g spark.executor.cores=3 spark.executor.memory=3g spark.executor.instances=3
Click SAVE then restart all affected components.
At the top of the screen it will tell you that a restart is required and there is an orange RESTART button. Click that button and select Restart All Affected.
Preparing HDFS for DDC
Warning
This procedure is only needed for a new TDP cluster and a new CM. You should omit it for any existing working environment (TDP + CM).
Creating HDFS User for DDC
By default, HDFS has permission checking enabled. Please follow the procedure below to properly configure DDC and have it ready to be used by DDC. For this example and throughout the following sections of this document, we will use user bob
and group hdfs
.
Tip
You can see if this permission checking is enabled by inspecting the property dfs.permissions.enabled
in the hdfs-site.xml
file.
SSH to the TDP instance and log in as
root
.Create the user in Linux OS where HDFS is installed, and assign it to the desired group:
useradd -g hdfs bob passwd bob
Next, you need to configure the new user in HDFS.
Switch to the
hdfs
O.S. user:su - hdfs
Run the following commands to setup the user:
hdfs dfs -mkdir /user/bob hdfs dfs -chown -R bob:hdfs /user/bob hdfs dfs -ls /user
The result for the last command should include:
drwxr-xr-x - bob hdfs 0 2021-02-29 03:46 /user/bob
Tip
You can only use the following characters in the name of the HDFS folder: A to Z, a to z, 0 to 9, '_' and '-'.
Still using the
hdfs
user, the last step is to refresh the namenode groups:hdfs dfsadmin -refreshUserToGroupsMappings
From now on, DDC can use the
bob
user to run jobs that read and write files in the/user/bob
directory.Optionally, you can assign a quota for the new user:
hdfs dfsadmin -setSpaceQuota 30g /user/bob
Return to the
root
promptexit
Lastly, you need to configure the new user in LDAP.
Warning
It is your responsibility as the system owner to properly configure the authentication to secure the production environment. For demo purposes you can opt for the built-in LDAP server but for everything else you should configure your company's authentication/federation provider.
To set up the user
bob
in the embedded LDAP for demo usage, please follow those instructions. To execute the instructions below you will need to access Ambari, please refer to Accessing Ambari for further information.In the Ambari toolbar on the left, expand Services, then click Knox.
Select the CONFIGS tab and expand the Advanced users-ldif section.
Add this text at the bottom of the text box, replacing
<user-password>
for the desired user password:dn: uid=bob,ou=people,dc=hadoop,dc=apache,dc=org objectclass:top objectclass:person objectclass:organizationalPerson objectclass:inetOrgPerson cn: bob sn: bob uid: bob userPassword:<user-password>
Click SAVE.
Click the ACTIONS button and then Stop Demo LDAP to stop it.
Click the ACTIONS button and then Start Demo LDAP start it with the new configuration.
Note
This HDFS user must be configured on the Knox connection, inside the DDC Connection Manager (in the CM Connections Management page, select Access Management > Connections Management).
Creating a HDFS Directory for DDC
For DDC to utilize Hadoop Distributed File System (HDFS), you need to create a directory under HDFS. DDC will use this space for storing scan results and reports. You can create it through the command line. You need to do it only on the primary node. This directory can have any name, but for this example and throughout the following sections of this document, we will use /ciphertrust_ddc
.
SSH to the TDP instance and log in as root.
Switch to the Linux user
hdfs
, who has permissions to create and destroy folders:su - hdfs
You need to create the
/ciphertrust_ddc
directory in HDFS, by issuing this command:hdfs dfs -mkdir /ciphertrust_ddc
Because the default permissions will not allow the scans to write over the Hadoop Database, you have to change the access permissions. By issuing this command you will grant full access rights for the DDC user configured in Creating HDFS User, that is
bob
.hdfs dfs -chown -R bob:hdfs /ciphertrust_ddc
By running
hdfs dfs -ls /
the result should include:drwxr-xr-x - bob hdfs 0 2021-08-12 03:17 /ciphertrust_ddc
Return to the
root
promptexit
Tip
Those steps will enable you to configure HDFS in Data Discovery and Classification Hadoop Services with /ciphertrust_ddc
as Folder. Refer to Configuring TDP section in the Data Discovery and Classification documentation to learn more.
Administration tasks
HDFS Administration
Once you have configured HDFS using the instructions above, you can use the user interface to browse through HDFS sub-directories. Please refer to Browsing HDFS via User Interface
Optionally you can configure HDFS to use an ACL. Please refer to HDFS ACL's
Configure LDAP Authentication
In production environments it is mandatory to delgate authentication to an external LDAP server. Please refer to Configuring LDAP Authentication to get instructions on how to configure Thales Data Platform with a LDAP server.
Your TDP is now ready to be used.