Performance Tuning
This section contains information on tuning the TDP performance. Keep in mind, however, that each case must be studied separately as you may have have different volumes and time requirements.
Estimating the resources needed
Minimum requirements
Before you begin to tune the TDP performance, refer to Thales Data Platform Deployment and the requirements for the platform where you installed TDP ("Public Cloud Images" or "Private Cloud Images") to check if your system meets the minimum requirements.
On the following sections, we will provide some topics about how to tune this performance, but each case must be studied separately since each client will have different volumes and time requirements.
Spark tuning
Spark can be configured by adjusting some properties via Ambari. There are plenty of properties that configure every aspect of the Spark behavior on the Spark official documentation, but this section only just covers some of the most important properties.
Property Name | Default value | Purpose |
---|---|---|
spark.driver.cores | 1 | Number of cores to use for the driver process, only in cluster mode. |
spark.driver.memory | 1 GB | Amount of memory to use for the driver process. |
spark.executor.memory | 1 GB | Amount of memory to use per executor process, in MiB unless otherwise specified. |
spark.executor.cores | 1 in Yarn mode | The number of cores to use on each executor. In standalone and Mesos coarse-grained modes, for more detail, see this description. |
spark.task.cpus | 1 | Number of cores to allocate for each task. |
spark.executor.instances | 2 | The number of executors for static allocation. With spark.dynamicAllocation.enabled, the initial set of executors will be at least this large. |
Caution
The following instructions are recommended if you have at least 8 CPU / 32GB RAM on board per cluster node.
To increase the resources dedicated to the Spark jobs, you will need to access Ambari, please refer to Accessing Ambari for further information.
You can find the Spark properties in Ambari: Spark2 => Configs => Advanced => Custom spark2-defaults. After changing any of these values, Ambari will ask you to restart some services.
In the Ambari toolbar on the left, expand Services, then click Spark2.
Select the CONFIGS tab, then below it click ADVANCED.
Expand Custom spark2-defaults and then click Add Property....
Add the following properties in the text box:
spark.driver.cores=3 spark.driver.memory=3g spark.executor.cores=3 spark.executor.memory=3g spark.executor.instances=3
Check job execution
You can get the details of the job execution (performance, executors, memory consumed, etc.), you need to go to the YARN Resource Manager UI. To get there, you can click on YARN and then on the ResourceManager UI link, in the Quick Links menu.
Note
The link on the Ambari UI will refer to the user to http://<yarn-node>:8088/ui2/
, but this link is wrong. Just remove the /ui2/
bit from the dir and you will get the resource manager ui properly.
Once there, you can find a list with all the jobs. Each job can be on one of these states:
- Accepted: The job has been submitted to the cluster, and YARN is waiting to have enough resources to execute the job.
- Running: The job has resources assigned and it's executing.
- Succeeded: The job has finished succeeded.
- Failed: The job has failed by any unexpected reason.
- Canceled: The job has been manually cancelled by the user. DDC launches four different jobs on the cluster:
- ScanProcessorDDC: Process the scan data to create the data lake.
- ScanReporterDDC: Generate the reports.
- DDCPqsTagger: Henry connector for remediation.
- DataObjectReporterDDC: Create the data object report based on a dynamic query.
You can click on any application ID to go to the job execution details. There you can check the executors used to run the job, the memory assigned, and even more advanced information like the sql execution plan. Let us take a look at an example of a ScanProcessorDDC to understand what you can watch to know if you need to increase the minimum recommended configuration.
Resource Manager UI
In this screenshot, you can see that the org.thales.ScanProcessorDDC job has finished succesfully. You can also see how much time it took from when it was submitted to when it was completed.
Job details screen
If you click on the application id link, you will see the job details page.
- Here, you can see how much time took the job to be completed and the tracking URL, that is detailed on the next screenshot.
- Here you can go check the job logs.
Job event timeline
After clicking on the tracking url link, you can see this screen:
- Here you can check how many executors where added.
- On this area you can see how much time it takes for the job to be completed.
- This is the list of jobs that conform our application.
- This link guide you to the environment screen (below).
- Here you can go to the executors details.
Environment screen
On this screen you can confirm that the properties you configured on Spark are considered by Spark correctly.
Executors screen
This screen is the most useful when you try to understand if there are enough nodes assigned to each job or you need to add more resources or nodes. If you check this screen on a finished job, you can get the following information:
- RAM memory assigned to the application and if there is any dead node.
- Workers assigned to this particular job.
- How much memory was allocated for this job inside each node.
- How much time inverted each node on the execution.
However, if you want to know how much memory is being consumed during the execution, you need to open the Executors view during the execution. This operation is easier with a scan or report that takes some time to complete.
How do I know what to do to improve TDP performance?
Obtaining a perfect configuration requires some trial and error testing by adjusting the properties described above. However, as a general rule, you can choose to increase the default configuration if you observe any of those signals:
You begin to notice out of memory exceptions during any DDC spark job, specially, the ScanProcessorDDC.
On the executors view, during an execution, you observe that the jobs uses all memory booked, and the shuffle read / write grows over the nominal values. In this case, you can add more nodes and increase spark.executor.instances value to improve the job distribution.
You observe that YARN only can run a job at a time. This can indicate that you allocated too much resources and YARN can't run more that one job at the same time. You can solve this problem by decreasing the memory / cores allocated for the jobs or you can add more executors so YARN have more resources to use on all the jobs.
Keep in mind that, keeping the same number of executors (without horizontal scaling), increasing the resources (memory and cores) for each job you are decreasing the number of concurrent jobs, so you need to test in order to find the perfect balance according with our scenario.