Performance tuning your Hadoop/Spark clusters to use cloud storage

Performance tuning your Hadoop/Spark clusters to use cloud storage

Tuesday, June 19
11:00 AM - 11:40 AM
Executive Ballroom 210A/E

Remote storage provides the ability to separate compute and storage, which ushers in a new world of infinitely scalable and cost-effective storage. Remote storage in the cloud built to the HDFS standard has unique features that make it a great choice for storing and analyzing petabytes of data at a time. Customers can have unlimited storage capacity without any limit to the number or size of the files. With such scale, superior I/O performance becomes an increasingly important consideration when performing analysis on this data. For all workloads, a remote storage in the cloud can provide amazing performance when all the different knobs are tuned correctly.

When running workloads atop of remote storage, the most important thing is to maximize the throughput usage between the compute layer and the storage layer. Oftentimes, the compute layer isn’t large enough to perform enough parallel read and writes to saturate the available throughput in the storage layer. This causes poor performance due to the underutilization of the available resources. We recognize that this is a common problem from many customer conversations, so we’ve identified areas where the user can do performance tuning to improve their job run times.

There are many unique aspects and considerations within the compute layer, which includes the physical layer, YARN layer, and workload layer, that can increase the concurrency of reads and writes to the store. The size and number of nodes in the user’s cluster should be chosen wisely to get the best performance tuning for each workload. Within the YARN layer, memory allocation and the number of containers are a few variables that can be tuned for all workloads. Additionally, setting tasks appropriately to utilize all the available resources will further improve job run times. In the workload layer, we can dive deeper into Spark and Hive, where these two workloads have specific knobs that can be modified for performance tuning. Taking advantage of the specific nuances of each workload will help the user extract every last bit of performance from remote storage.
In Spark, we first determine the number of applications running on the cluster to understand the amount of resources we have available to run our workload. Then we set executor memory and executor cores to optimize for an I/O intensive workload. Lastly, we determine the number of executors based on the available resources and the memory and core settings that we have previously chosen.

In Hive, memory plays the main role in determining how many YARN containers can run concurrently. By tuning the memory, more YARN containers can be created to run more tasks in parallel. Next, controlling the split-waves and mapper size will help to ensure that all available containers are used.

By taking a comprehensive approach to performance tuning, the user can maximize the throughput usage between compute and storage. Attendees will walk away with a strong understanding of how to correctly tune their workloads when their data is stored in a remote storage in the cloud. Ultra-fast performance removes the last big blocker for enterprises and makes it a no-brainer for them to shift their data to remote storage.


Stephen Wu
Senior Program Manager
Stephen Wu is a senior program manager for big data at Microsoft.