Performance Tuning
This section contains information on tuning the TDP performance. Keep in mind, however, that each case must be studied separately as you may have have different volumes and time requirements.
How Do I Know What To Do To Improve TDP Performance?
Obtaining a perfect configuration requires some trial and error testing by adjusting the properties described above. However, as a general rule, you can choose to increase the default configuration if you observe any of those signals:
You begin to notice out of memory exceptions during any DDC spark job, specially, the ScanProcessorDDC.
On the executors view, during an execution, you observe that the jobs uses all memory booked, and the shuffle read / write grows over the nominal values. In this case, you can add more nodes and increase spark.executor.instances value to improve the job distribution.
You observe that YARN only can run a job at a time. This can indicate that you allocated too much resources and YARN can't run more that one job at the same time. You can solve this problem by decreasing the memory / cores allocated for the jobs or you can add more executors so YARN have more resources to use on all the jobs.
Keep in mind that, keeping the same number of executors (without horizontal scaling), increasing the resources (memory and cores) for each job you are decreasing the number of concurrent jobs, so you need to test in order to find the perfect balance according with our scenario.
Spark Tuning
Spark can be configured by adjusting some properties via Ambari. There are plenty of properties that configure every aspect of the Spark behavior on the Spark official documentation, but this section only just covers some of the most important properties.
Property Name | Default value | Purpose |
---|---|---|
spark.driver.cores | 1 | Number of cores to use for the driver process, only in cluster mode. |
spark.driver.memory | 1 GB | Amount of memory to use for the driver process. |
spark.executor.memory | 1 GB | Amount of memory to use per executor process, in MiB unless otherwise specified. |
spark.executor.cores | 1 in Yarn mode | The number of cores to use on each executor. In standalone and Mesos coarse-grained modes, for more detail, see this description. |
spark.task.cpus | 1 | Number of cores to allocate for each task. |
spark.executor.instances | 2 | The number of executors for static allocation. With spark.dynamicAllocation.enabled, the initial set of executors will be at least this large. |
Caution
The following instructions are recommended if you have at least 8 CPU / 32GB RAM on board per cluster node.
To increase the resources dedicated to the Spark jobs, you will need to access Ambari, please refer to Accessing Ambari for further information.
In the Ambari toolbar on the left, expand Services, then click Spark2.
Select the CONFIGS tab, then below it click ADVANCED.
Expand Custom spark2-defaults and then click Add Property....
Add the following properties in the text box:
spark.driver.cores=3 spark.driver.memory=3g spark.executor.cores=3 spark.executor.memory=3g spark.executor.instances=3