Evaluation of TPC-H on Spark and Spark SQL in ALOJA

Evaluation of TPC-H on Spark and Spark SQL in ALOJA

Thursday, April 19
2:50 PM - 3:30 PM
Convention Hall I - C

The Evaluation of TPC-H on Spark and Spark SQL in ALOJA was conducted at the Big Data Lab to obtain the master degree in Management Information Systems at the Johann-Wolfgang Goethe University in Frankfurt, Germany. Furthermore, the analysis was partially accomplished in collaboration and close coordination with the Barcelona Super Computer Center.

The intention of this research was the integration of a TPC-H on Spark Scala benchmark into ALOJA, an open-source and public platform for automated and cost-efficient benchmarks and to perform an evaluation on the runtime of Spark Scala with or without Hive Metastore compared to Spark SQL. Various alternate file formats with different applied compressions on underlying data and its impact are evaluated. The conducted performance evaluation exposed diverse and captivating outcomes for both benchmarks. Further investigations attempt to detect possible bottlenecks and other irregularities. The aim is to provide an explanation to enhance knowledge of Spark’s engine based on examining the physical plans. Our experiments show, inter alia, that: (1) Spark Scala performs better in case of heavy expression calculation, (2) Spark SQL is the better choice in case of strong data access locality in combination with heavyweight parallel execution. In conclusion, diverse results were observed with the consequence that each API has its advantages and disadvantages.

Surprisingly, our findings are well spread between Spark SQL and Spark Scala and contrary to our expectations Spark Scala did not outperform Spark SQL in all aspects but support the idea that applied optimizations appear to be implemented in a different way by Spark for its core and its extension Spark SQL. The API on top of Spark provides extra information about the underlying structured data, which is probably used to perform additional optimizations.

In conclusion, our research demonstrates that there are differences in the generation of query execution plans that goes hand-in-hand with similar discoveries leading to inefficient joins, and it underlines the importance of our benchmark to identify disparities and bottlenecks.

Presentation Video

SPEAKERS

Raphael Radowitz
Quality Specialist
SAP Labs Korea
Raphael Radowitz is a Developer at SAP Labs Korea with comprehensive, detailed and up-to-date work experiences with global teams in creating and implementing databases, project management, generating reports, Spark and Hadoop. He has recently submitted his master thesis Evaluation of TPC-H on Spark and SparkSQL in ALOJA and obtained the master degree in Management Information Systems at the Johann-Wolfgang Goethe University in Frankfurt, Germany. His core fields of expertise include: • Databases, Hadoop Ecosystem, Apache Spark and Business Intelligence • Reporting Technologies • Project Management and Microsoft Project 2007 to 2013 ACADEMIC STUDIES 2017 Master of Science: Management Information Systems at Johann Wolfgang Goethe University Frankfurt am Main, Germany 2013 Semester abroad, Sungkyunkwan University (성균관대학교) Seoul, South Korea 2012 Student of the Master Degree programme Management Information Systems (1 semester) at Technische Hochschule Mittelhessen (THM), University of Applied Sciences, Germany 2012 Bachelor of Science: Management Information Systems at Fachhochschule Frankfurt am Main, University of Applied Sciences Frankfurt, Germany 2009 Semester abroad, Konkuk University, (건국대학교) Seoul, South Korea