Troubleshooting
This section contains information on troubleshooting the various issues you may encounter in TDP.
Error conditions
TDP node down
How to identify?
In the host menu of Ambari, you can see that a node is down. You can also see some alerts in Ambari alert section.
How to fix?
A node can be down for several reasons, but the usual checks are:
- If you are using a virtualization environment or cloud, check that the machine is up and running.
- If you are using an on-premise cluster, check that the server is healthy.
- Check if all partitions have enough space, especially the logs partition.
- Empty the log partition if necessary.
- Try to reinstall the node, or delete it and create a new one.
HDFS service / nodes down
How to identify?
HFDS has a red alert on one ore more nodes.
How to fix?
- Check the error on the alert screen to find out why hdfs cannot start.
- Try to reinstall HDFS if you cannot find a better solution.
PQS service / nodes down
How to identify?
PQS has a red alert, so the service is not working.
How to fix?
On the alert screen you can find the details of the failure, so you can use that information to look for a solution.
Try to reboot the service or reinstall it if nothing works.
Knox service / nodes down
How to identify?
- You cannot reach Knox from outside of the cluster.
- Knox is returning error 50X.
- Ambari is showing an alert related with Knox.
How to fix?
- Check the Knox logs on
/var/log/knox/
. - Check that the Knox configuration is correct.
- Check that the node where Knox is hosted has enough disk space in all partitions, especially in the logs partition.
- Reboot or reinstall the service.
Not enough disk space on a node
How to identify?
- Ambari cannot start any service on that node.
- Ambari is showing timeouts of some services trying to connect to that node.
How to fix?
- Analyze all the partitions to see which partition is full.
- Remove logs from
/var/log/*
folders. - Delete old files from HDFS.
- If HDFS is full, try to add more disk to the machines or add more machines to increase the HDFS space limit.
Not enough RAM
Please, refer to Estimating the resources needed.
Not enough CPU
Please, refer to Estimating the resources needed.
HDFS not accessible via Knox using TDP credentials in Kylo
How to identify?
You are getting 40X errors when trying to connect to Knox.
How to fix?
- Check that the Knox authentication service is up and running.
- Check that the user exists on the auth service.
- Check that the path exists on HDFS and has the right permissions.
- Check that the Knox topology has the configuration of WebHFDS.
- Check that the HDFS is configured properly on Kylo.
- User / password
- Topology
- Path
Livy not accessible via Knox using TDP credentials in Kylo
How to identify?
You are getting 40X errors when trying to connect with Livy.
How to fix?
- Check that Knox authentication service is up and running.
- Check that the user exists on the auth service.
- Check that the Knox topology has the configuration of Livy server.
- Check that the Livy service is configured properly on Kylo.
- User / password
- Topology
Wrong / expired Knox TLS certificate
How to identify?
You are getting a Wrong Certificate error when configuring the TDP connection on Kylo.
How to fix?
Generate a new certificate for the Knox. Follow the steps in Updating and Exporting the Knox Server Certificate .
Auto-remediation does not remediate the results of a concrete scan
How to identify?
The file is not remediated even after scanning its location with remediation enabled.
How to fix?
- Check that the path is configured properly on the scan.
- Check that the policy is right configured.
- You can generate a report to find out the classification profiles found on the file(s) and check if your policy covers those classification profiles.
- If the fine has been created after last scan, you need to run the scan again.
TDP is too slow to process the results of a concrete scan
Please, refer to Estimating the resources needed.
TDP is too slow to remediate the results of a concrete scan
Please, refer to Estimating the resources needed.
TDP is too slow to generate a concrete report
Please, refer to Estimating the resources needed.
Expired/corrupted Ambari CA certificates
How to identify?
The Ambari agent logs errors for failed certificate verification, causing the Ambari console to lose heartbeat from all nodes.
How to fix?
The Ambari Certificate Authority (CA) issues certificates which are valid for 365 days (1 year). Renew Ambari CA certificates following the instructions provided on this page.
Note
In Ambari Agent section (step 11), the correct path to delete files is: /var/lib/ambari-agent/keys/
.
TDP Silent Installation Failed
How to identify?
Occasionally, TDP silent installation may encounter issues. The installer might stall in the midway or terminate without displaying any errors.
How to fix?
- Verify network connectivity
- Ensure DNS functionality is intact
- Verify ESX resource allocation, such as memory and CPU
Routine checks
All required services are up and running
You can check the service health on the Ambari console. Any problem triggers an alert that can be check on the above ring icon.
These are the required services for DDC to run:
Check HDFS folders and permissions for the TDP user configured in Kylo
Using SSH
Open a ssh session on the namenode and run the command:
$ hdfs dfs -ls /<ddc_folder_parent_path>/
For example, if the folder configured on DDC is /ciphertrust_ddc/
, you should run:
$ hdfs dfs -ls /
And the result should be:
[hdfs@sjdpe03ddc200-tdp315-node1 ~]$ hdfs dfs -ls /
Found 6 items
drwxr-xr-x - admin hdfs 0 2021-10-21 07:47 /ciphertrust_ddc
There you can see that the /ciphertrust_ddc
folder belongs to admin user, that belongs to hdfs group, which is correct.
Using browser
On Ambari, open HDFS menu tab and go to active namenode ui. Then click on Utilities and on Browse the File System. From there you can check the same folder details:
Check Knox HA Configuration
To check if there is more than one Knox node active, please, go to Knox menu on Ambari and check there the active Knox intances. There you can see which nodes has the service enabled and any state details.
Check DDC Spark jobs on YARN
Go to Ambari > YARN > ResourceManager UI.
It is possible that the link in Ambari is wrong. If you find this message, remove the 'ui2' string from the url.
There you can see all the jobs triggered on your cluster and the state:
Possible states are:
- Accepted: The job has been submitted and is pending to have enough available resources to be executed.
- Running: The job is being executed.
- Failed: The job has ended with errors.
- Succeeded: The job has finished successfully.
If you click on the ID of any job, you can find links to the job logs at the bottom of the page:
Check installed TDP version
On Ambari, go to Stack and Versions inside Cluster Admin Menu:
Then click on Versions: