Troubleshooting

This section contains information on troubleshooting the various issues you may encounter in TDP.

Error conditions

TDP node down

How to identify?

In the host menu of Ambari, you can see that a node is down. You can also see some alerts in Ambari alert section.

How to fix?

A node can be down for several reasons, but the usual checks are:

If you are using a virtualization environment or cloud, check that the machine is up and running.
If you are using an on-premise cluster, check that the server is healthy.
Check if all partitions have enough space, especially the logs partition.
- Empty the log partition if necessary.
Try to reinstall the node, or delete it and create a new one.

HDFS service / nodes down

How to identify?

HFDS has a red alert on one ore more nodes.

How to fix?

Check the error on the alert screen to find out why hdfs cannot start.
Try to reinstall HDFS if you cannot find a better solution.

PQS service / nodes down

How to identify?

PQS has a red alert, so the service is not working.

How to fix?

On the alert screen you can find the details of the failure, so you can use that information to look for a solution.

Try to reboot the service or reinstall it if nothing works.

Knox service / nodes down

How to identify?

You cannot reach Knox from outside of the cluster.
Knox is returning error 50X.
Ambari is showing an alert related with Knox.

How to fix?

Check the Knox logs on /var/log/knox/.
Check that the Knox configuration is correct.
Check that the node where Knox is hosted has enough disk space in all partitions, especially in the logs partition.
Reboot or reinstall the service.

Not enough disk space on a node

How to identify?

Ambari cannot start any service on that node.
Ambari is showing timeouts of some services trying to connect to that node.

How to fix?

Analyze all the partitions to see which partition is full.
Remove logs from /var/log/* folders.
Delete old files from HDFS.
If HDFS is full, try to add more disk to the machines or add more machines to increase the HDFS space limit.

Not enough RAM

Please, refer to Estimating the resources needed.

Not enough CPU

Please, refer to Estimating the resources needed.

HDFS not accessible via Knox using TDP credentials in Kylo

How to identify?

You are getting 40X errors when trying to connect to Knox.

How to fix?

Check that the Knox authentication service is up and running.
Check that the user exists on the auth service.
Check that the path exists on HDFS and has the right permissions.
Check that the Knox topology has the configuration of WebHFDS.
Check that the HDFS is configured properly on Kylo.
- User / password
- Topology
- Path

Livy not accessible via Knox using TDP credentials in Kylo

How to identify?

You are getting 40X errors when trying to connect with Livy.

How to fix?

Check that Knox authentication service is up and running.
Check that the user exists on the auth service.
Check that the Knox topology has the configuration of Livy server.
Check that the Livy service is configured properly on Kylo.
- User / password
- Topology

Wrong / expired Knox TLS certificate

How to identify?

You are getting a Wrong Certificate error when configuring the TDP connection on Kylo.

How to fix?

Generate a new certificate for the Knox. Follow the steps in Updating and Exporting the Knox Server Certificate .

Auto-remediation does not remediate the results of a concrete scan

How to identify?

The file is not remediated even after scanning its location with remediation enabled.

How to fix?

Check that the path is configured properly on the scan.
Check that the policy is right configured.
- You can generate a report to find out the classification profiles found on the file(s) and check if your policy covers those classification profiles.
If the fine has been created after last scan, you need to run the scan again.

Expired/corrupted Ambari CA certificates

How to identify?

The Ambari agent logs errors for failed certificate verification, causing the Ambari console to lose heartbeat from all nodes.

How to fix?

The Ambari Certificate Authority (CA) issues certificates which are valid for 365 days (1 year). Renew Ambari CA certificates following the instructions provided on this page.

Note

In Ambari Agent section (step 11), the correct path to delete files is: /var/lib/ambari-agent/keys/.

TDP Silent Installation Failed

How to identify?

Occasionally, TDP silent installation may encounter issues. The installer might stall in the midway or terminate without displaying any errors.

How to fix?

Verify network connectivity
Ensure DNS functionality is intact
Verify ESX resource allocation, such as memory and CPU

Routine checks

All required services are up and running

You can check the service health on the Ambari console. Any problem triggers an alert that can be check on the above ring icon.

These are the required services for DDC to run:

Check HDFS folders and permissions for the TDP user configured in Kylo

Using SSH

Open a ssh session on the namenode and run the command:

$ hdfs dfs -ls /<ddc_folder_parent_path>/

For example, if the folder configured on DDC is /ciphertrust_ddc/, you should run:

$ hdfs dfs -ls /

And the result should be:

[hdfs@sjdpe03ddc200-tdp315-node1 ~]$ hdfs dfs -ls /
Found 6 items
drwxr-xr-x   - admin  hdfs          0 2021-10-21 07:47 /ciphertrust_ddc

There you can see that the /ciphertrust_ddc folder belongs to admin user, that belongs to hdfs group, which is correct.

Using browser

On Ambari, open HDFS menu tab and go to active namenode ui. Then click on Utilities and on Browse the File System. From there you can check the same folder details:

Check Knox HA Configuration

To check if there is more than one Knox node active, please, go to Knox menu on Ambari and check there the active Knox intances. There you can see which nodes has the service enabled and any state details.

Check DDC Spark jobs on YARN

Go to Ambari > YARN > ResourceManager UI.

It is possible that the link in Ambari is wrong. If you find this message, remove the 'ui2' string from the url.

There you can see all the jobs triggered on your cluster and the state:

Possible states are:

Accepted: The job has been submitted and is pending to have enough available resources to be executed.
Running: The job is being executed.
Failed: The job has ended with errors.
Succeeded: The job has finished successfully.

If you click on the ID of any job, you can find links to the job logs at the bottom of the page:

Check installed TDP version

On Ambari, go to Stack and Versions inside Cluster Admin Menu:

Then click on Versions:

Troubleshooting

Error conditions

TDP node down

How to identify?

How to fix?

HDFS service / nodes down

How to identify?

How to fix?

PQS service / nodes down

How to identify?

How to fix?

Knox service / nodes down

How to identify?

How to fix?

Not enough disk space on a node

How to identify?

How to fix?

Not enough RAM

Not enough CPU

HDFS not accessible via Knox using TDP credentials in Kylo

How to identify?

How to fix?

Livy not accessible via Knox using TDP credentials in Kylo

How to identify?

How to fix?

Wrong / expired Knox TLS certificate

How to identify?

How to fix?

Auto-remediation does not remediate the results of a concrete scan

How to identify?

How to fix?

TDP is too slow to process the results of a concrete scan

TDP is too slow to remediate the results of a concrete scan

TDP is too slow to generate a concrete report

Expired/corrupted Ambari CA certificates

How to identify?

How to fix?

TDP Silent Installation Failed

How to identify?

How to fix?

Routine checks

All required services are up and running

Check HDFS folders and permissions for the TDP user configured in Kylo

Using SSH

Using browser

Check Knox HA Configuration

Check DDC Spark jobs on YARN

Check installed TDP version

On this page