Preparing Ambari
Configure ambari
Run this script to configure the host as an Ambari Server:
/root/setup/ambari_setup.sh
This will prompt you to set the admin password. This will be the password of the Ambari admin. You can use these credentials to access the Ambari UI.
When this script finishes, it shows you the private key that has been set up for the node. Please save the key. You will need it for setting up the Ambari server private key.
You can always view the private key later by issuing the cat .ssh/id_rsa
command, as illustrated in the example below:
Steps to Set up Additional Nodes in the Hadoop Cluster
Note
For a Demo installation this step is optional because a one node Hadoop deployment is sufficient for the demo purposes.
The OVA image can be used to create additional nodes for the cluster.
Perform the step Initial Login Credentials and Set the Hostname.
Instead of running the ambari_setup.sh
script in step 5, run the following script:
/root/setup/node_setup.sh
This will prompt you for the public key of the Ambari server. Go to the Ambari server to retrieve the public key. This can be any of the two files:
/root/.ssh/authorized_keys
/root/.ssh/id_rsa.pub
Once the node is set up, log in to the Ambari server and add the node to the cluster with the following steps:
Go to the Ambari UI.
Click the Hosts button in the menu on the left. You should see the list of nodes already present on the cluster.
Click Actions -> Add New Hosts.
Note
As in the host names field you can add more than one node, you can add more than one node at the same time. However, the previous steps are required on every node.
The Ambari server must solve the hostname of each node and the node must be accessible from it.
On the ssh private key field you need to paste the private key of the Ambari Server node (typically, the first node of your cluster).
Click Register and Confirm.
Now, you just have to follow the instructions and add the desired services on the new node.
Browse the Ambari Server
The Ambari server is configured to use TLS (https) at port 443 and using a self-signed certificate.
Access the Ambari GUI by using the hostname configured in the Set the Hostname.
Log in as admin with the password that you set in the step Configure Ambari.
Click LAUNCH INSTALL WIZARD.
Tip
Ambari UI Not responding If you cannot access the Ambari UI, try resetting the Ambari service. Open an SSH session on the machine and run the command ambari-server restart
Hadoop Services Installation Through Ambari
Name the Cluster
Choose a name that you consider more appropriate and descriptive for your cluster:
Select Version
At this point, you should see the UI in this state:
In the Repositories screen, make sure that you select the Use Local Repository option and fill in the Base URL fields per the screenshot above. The only needed OS in the Ambari configuration is Redhat 7, with the string file:///var/repo/hdp
as value for the three paths. If you can see any other OS, you can remove it.
Install Options
In this screen the hosts are configured. Put the hostname and private key of all nodes that will be on your cluster. The private key is the one that you saved in the step Configure Ambari.
Click REGISTER AND CONFIRM. After that, the installation will start for all nodes. This can take a few minutes.
Note
When this step finishes you may see the message "Some warnings were encountered while performing checks against the registered hosts above". This is a common behavior but usually there is no reason to worry so you can skip it.
Choose Services
Here, we are going to select the services that will be available on our virtual machine. These are the required services for DDC:
HDFS
Yarn + MapReduce2
HBase
Spark
Hive
Tez
Zookeeper
Ambari_metrics
Knox
In the list shown in the GUI, we only need to select those. In the next step, we'll add PQS and Livy to the cluster.
Note
If any of the required services fails to start automatically, you may need to start it manually. In case of repeated problems, please check out the service log for additional troubleshooting information. Refer to the article "Viewing Service Logs" on the Hortonworks documentation pages for details.
Assign Masters
In our example, as we're only using one node, so this will be our master for all the services. In production environments, you can select other nodes as the service Master.
Assign Slaves and Clients
Select "Livy for Spark2 Server" in all nodes that you want this service running. You will need to remember which hosts you selected when you reach the "Update Livy configuration for Knox" step.
Caution
This step is really important. Make sure that you select "Livy for Spark2 Server" in at least one of your hosts because it is a hard requirement for DDC.
Customize Services
Note
The services can be installed during the TDP installation or at any moment. If this is configured during the installation, you will not have to restart the installed services. It is possible to do so when a new service is installed, however in this case you will have to restart the services.
At this point, you only need to provide credentials for different services and review the installation.
Create the Knox master secret password and store it somewhere.
For Hive you have to add a database.
Go to the Hive tab, then click Database.
Select Hive Database -> New MySQL.
Set the credentials for the database.
Then, keep clicking NEXT until the screen that shows DEPLOY. Click it.
When the installation completes, click NEXT and then click COMPLETE.
For more detailed information about the above procedure, refer to the Hortonworks official webpage.
Configuring Knox
The Apache Knox Gateway is a system that provides a single point of authentication and access for Hadoop services in a cluster. While Knox supports many authentication/federation providers this document will only describe configuring LDAP. Please check Apache Knox Gateway's online configuration for details on the other providers:
https://knox.apache.org/books/knox-0-5-0/knox-0-5-0.html#Authentication
Warning
It is your responsibility as the system owner to properly configure the authentication to secure the production environment. For demo purposes you can opt for the built-in LDAP server but for everything else you should configure your company's authentication/federation provider.
A change in the Knox configuration is needed to provide access from the DDC client. There are two ways a user could configure Knox for DDC. One is to modify the default topology and the other is to create a new topology. If you want to create a new topology, refer to the Knox documentation.
Note
Knox must also be DNS addressable, through a network DNS or by adding the DNS entry as described in the "CipherTrust Manager Administration Guide".
Updating Knox Topology
To set up the embedded LDAP for demo usage, and other settings, follow these steps:
Go to Knox configuration in the Ambari server UI.
(applicable for demo purposes only) Go to the ACTIONS menu and click Start Demo LDAP. Without starting the demo LDAP you cannot use the Knox logins.
Expand the Advanced topology section.
Add an entry for the Phoenix Query Server server.
Single node configuration:
<service> <role>AVATICA</role> <url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER>:8765</url> </service>
(optional) If the cluster has multiple PQS installed, you have to list all PQS nodes:
<service> <role>AVATICA</role> <url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER1>:8765</url> <url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER2>:8765</url> <url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER3>:8765</url> ... </service>
(optional for creating a cluster) If the cluster is Namenode HA enabled, add the following sections:
<provider> <role>ha</role> <name>HaProvider</name> <enabled>true</enabled> <param> <name>WEBHDFS</name> <value>maxFailoverAttempts=3;failoverSleep=1000;maxRetryAttempts=300;retrySleep=1000;enabled=true</value> </param> </provider>
Update the namenode service url value to reflect your name service ID. This is the dfs.internal.nameservices parameter in your hdfs-default.xml.
<service> <role>NAMENODE</role> <url>hdfs://yournameserviceid</url> </service>
Add webhdfs url for each namenode to your WEBHDFS service area like so:
<service> <role>WEBHDFS</role> <url>http://yournameservice1.openstacklocal:50070/webhdfs</url> <url>http://yournameservice2.openstacklocal:50070/webhdfs</url> </service>
(applicable for non-demo purposes) Example for LDAP:
<provider> <role>authentication</role> <name>ShiroProvider</name> <enabled>true</enabled> <param> <name>main.ldapRealm</name> <value>org.apache.shiro.realm.ldap.JndiLdapRealm</value> </param> <param> <name>main.ldapRealm.userDnTemplate</name> <value>uid= {0},ou=people,dc=hadoop,dc=apache,dc=org</value> </param> <param> <name>main.ldapRealm.contextFactory.url</name> <value>ldap://localhost:33389</value> </param> <param> <name>main.ldapRealm.contextFactory.authenticationMechanism</name> <value>simple</value> </param> <param> <name>urls./**</name> <value>authcBasic</value> </param> </provider>
Click SAVE then restart all the affected components.
At the top of the screen it will tell you that a restart is required and there is an orange RESTART button. Click that button and select Restart All Affected.
With this configuration you can configure Hadoop Services in Thales CipherTrust Data Discovery and Classification:
For HDFS use
/gateway/default/webhdfs/v1
as URI and/ciphertrust_ddc
as Folder.For Livy use
/gateway/default/livy/v1
as the URI
For more details on "Configuring HDFS" and "Configuring Livy" please check the "Thales CipherTrust Data Discovery and Classification Deployment Guide".
With this configuration you can configure PQS Services in Thales CipherTrust Transparent Encryption:
- Use
/gateway/default/avatica
for the URI andciphertrust_cte
as schema.
- Use
For instructions on configuring an external LDAP or Active Directory refer to the Apache Knox online documentation:
Knox Authentication Information in Ambari
When you are using the demo LDAP embedded with Knox, there is a file with the users credentials. To view these default credentials, select:
Knox > CONFIGS > Advanced-users-ldif
Scroll down to the #entry for sample user admin
to find the userPassword
, which by default is admin-password
.
Note
In production environments, when you are using a real LDAP or Active Directory, you have to use a user with admin permissions on the service.
Knox Logging
Changing Knox Log Level
Knox uses Log4j to keep track of its log messages. The default log level, which is INFO, may quickly cause the log file to use up the available space and DDC scans to fail.
Because of that you must change the log level to ERROR by editing the log configuration file. You can do this through Ambari, by following these steps:
In the Ambari toolbar on the left, click to expand Services, then click Knox.
Select the CONFIGS tab and scroll down to "Advanced gateway-log4j".
Open the "Advanced gateway-log4j" section and modify these parameters:
Change the log4j.logger.audit parameter to:
log4j.logger.audit=ERROR, auditfile
Add these two additional parameters at the end:
log4j.appender.auditfile.MaxFileSize=10MB log4j.appender.auditfile.MaxBackupIndex=10
Click SAVE and then restart Knox.
Purging the Knox Log Directory
If after applying these changes your Knox log grows too quickly, you may have to purge the log directory. First, however, you need to check if this is the case.
Check if Ambari is displaying a "NameNode Directory Status" error. This error indicates a failed directory (that is, one or more directories are reporting as not healthy.)
Check Ambari for a "Failed directory count" message to find out which directories are reporting problems. If the error message is showing "Failed directory count: 1" it may be the logs directory.
In the terminal, check the free disk space for
/var/log/
by issuing the command:df -h
If the output of the command is showing no free disk space for the /var/log/
directory, remove all the log files.
Configuring Phoenix Query Server (HBase)
Apache Phoenix requires additional modification. You need to enable namespace mapping and map the system tables to the namespace. For this procedure we are using the Ambari UI.
In the advanced HBase configuration, scroll down to Custom hbase-site (HBase > CONFIGS > ADVANCED > Custom hbase-site).
Use the Add Property... link to add these two additional properties (as illustrated in the image further below):
phoenix.schema.isNamespaceMappingEnabled=true phoenix.schema.mapSystemTablesToNamespace=true
Note
These properties will be added to the
hbase-site.xml
configuration file.Click SAVE then restart all the affected components.
At the top of the screen it will tell you that a restart is required and there is an orange RESTART button. Click that button and select Restart All Affected.
Creating DDC Directory Under HDFS
For DDC to utilize Hadoop Distributed File System (HDFS), you need to create a directory under HDFS. DDC will use this space for storing scan results and reports. You can create it through the command line. You need to do it only on the primary node. This directory can have any name, but for this example and throughout the following sections of this document, we will use /ciphertrust_ddc
.
Tip
You will need this path later to configure CipherTrust. See "Configuring HDFS" in the "Thales CipherTrust Data Discovery and Classification Deployment Guide".
Creating the DDC Directory Using the Command Line
SSH to the TDP instance and log in as root.
Switch to the
hdfs
user, who has permissions to create and destroy folders:su - hdfs
You need to create the
/ciphertrust_ddc
directory in HDFS, by issuing this command:hdfs dfs -mkdir /ciphertrust_ddc
Because the default permissions will not allow the scans to write over the Hadoop Database, you have to change the access permissions. By issuing this command you will grant full access rights for all users:
hdfs dfs -chmod 777 /ciphertrust_ddc
Note
After entering the last command, enter exit
to return to the root prompt.
Once you have created the HDFS directory using the CLI, you can optionally use the user interface to browse through sub-directories. Please refer to Browsing HDFS via User Interface
Updating and Exporting the Knox Server Certificate
When the Knox service is installed by Ambari, a self-signed certificate is created internally. However, the certificate created uses SHA1 and with a key size of 1024 bits. The update_knox_cert.sh script below will create and replace the certificate with a new one using SHA256 and a key size of 2048 bits.
Run this script in an SSH session on the TDP instance:
/root/setup/update_knox_cert.sh
When this script is run, it will prompt for the Knox master secret.
This secret is the one that you set in the step "7. Hadoop Services Installation Through Ambari > Customize Services" (CREDENTIALS > Knox Master Secret).
Finally, you need to export the SSL certificate of the Knox server and configure DDC to talk to Hadoop. You need to obtain the certificate where Knox is installed.
Put the Knox server certificate in a file, by using this command:
echo -n | openssl s_client -connect <IP_OF_KNOX_SERVER>:8443 | \ sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > /tmp/knox.cert
Copy the certificate to the system where you will connect to the CipherTrust/DDC GUI.
Issue this command to display the certificate in the terminal window:
cat /tmp/knox.cert
Now, you can copy the certificate from the terminal window and paste it in its own file on your machine. You will need this file when configuring DDC to talk to Hadoop. Refer to "Configuring CipherTrust Manager" in the "Thales CipherTrust Data Discovery and Classification Deployment Guide".
You can find the comprehensive procedure at this link on the Hortonworks documentation pages.