The Evaluation of TPC-H on Spark and Spark SQL in ALOJA was conducted at the Big Data Lab to obtain the master degree in Management Information Systems at the Johann-Wolfgang Goethe University in Frankfurt, Germany. Furthermore, the analysis was partially accomplished in collaboration and close coordination with the Barcelona Super Computer Center.
The intention of this research was the integration of a TPC-H on Spark Scala benchmark into ALOJA, an open-source and public platform for automated and cost-efficient benchmarks and to perform an evaluation on the runtime of Spark Scala with or without Hive Metastore compared to Spark SQL. Various alternate file formats with different applied compressions on underlying data and its impact are evaluated. The conducted performance evaluation exposed diverse and captivating outcomes for both benchmarks. Further investigations attempt to detect possible bottlenecks and other irregularities. The aim is to provide an explanation to enhance knowledge of Spark’s engine based on examining the physical plans. Our experiments show, inter alia, that: (1) Spark Scala performs better in case of heavy expression calculation, (2) Spark SQL is the better choice in case of strong data access locality in combination with heavyweight parallel execution. In conclusion, diverse results were observed with the consequence that each API has its advantages and disadvantages.
Surprisingly, our findings are well spread between Spark SQL and Spark Scala and contrary to our expectations Spark Scala did not outperform Spark SQL in all aspects but support the idea that applied optimizations appear to be implemented in a different way by Spark for its core and its extension Spark SQL. The API on top of Spark provides extra information about the underlying structured data, which is probably used to perform additional optimizations.
In conclusion, our research demonstrates that there are differences in the generation of query execution plans that goes hand-in-hand with similar discoveries leading to inefficient joins, and it underlines the importance of our benchmark to identify disparities and bottlenecks.