Uncovering an Apache Spark 2 Benchmark – Configuration, Tuning and Test Results

Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results

Tuesday, June 19
4:00 PM - 4:40 PM
Grand Ballroom 220C

Apache Spark is increasingly adopted as an alternate processing framework to MapReduce, due to its ability to speed up batch, interactive and streaming analytics. Spark enables new analytics use cases like machine learning and graph analysis with its rich and easy to use programming libraries. And, it offers the flexibility to run analytics on data stored in Hadoop, across data across object stores and within traditional databases. This makes Spark an ideal platform for accelerating cross-platform analytics on-premises and in the cloud. Building on the success of Spark 1.x release, Spark 2.x delivers major improvements in the areas of API, Performance, and Structured Streaming. In this paper, we will cover a high-level view of the Apache Spark framework, and then focus on what we consider to be very important improvements made in Apache Spark 2.x. We will then share the results of a real-world benchmark effort and share details on Spark and environment configuration changes made to our lab, discuss the results of the benchmark, and provide a reference architecture example for those interested in taking Spark 2.x for their own test drive. This presentation stresses the value of refreshing the Spark 1 with Spark 2 as performance testing results show 2.3x improvement with SparkSQL workloads similar to TPC Benchmark™ DS (TPC-DS).

Presentation Video


Mark Lochbihler
Principal Architect
Mark is currently in his fifth year within Hortonworks Partner Engineering and has 29 years of experience working with Advanced Analytic, Distributed Computing and Data platforms. He is currently focused on helping customers leverage Advanced Analytics, IoT and Big Data capabilities to achieve a competitive advantage. Mark has a BS in Computer Science from North Carolina State University and also holds a Six Sigma Black Belt.
Viplava Madasu
Big Data Systems Engineer
Hewlett Packard Enterprise
Viplava Madasu is a Big Data Systems Engineer at Hewlett Packard Enterprise where he currently works on evaluating emerging big data technologies and creating reference architectures for HPE converged infrastructure platforms. Previously, he worked developing software in different groups at HPE in Application Server Middleware/Java Hotspot JVM/SQL database engine areas. He holds a Masters degree in Computer Science from Indian Institute of Technology, Kharagpur.