Fast and Reliable Apache Spark SQL Releases

Fast and Reliable Apache Spark SQL Releases

Thursday, March 21
11:50 AM - 12:30 PM
Room 120-121

In this talk, we present a comprehensive framework for assessing the correctness, stability, and performance of the Spark SQL engine. Apache Spark is one of the most actively developed open source projects, with more than 1200 contributors from all over the world. At this scale and pace of development, mistakes bound to happen. To automatically identify correctness issues and performance regressions, we have build a testing pipeline that consists of two complementary stages: randomized testing and benchmarking.

Randomized query testing aims at extending the coverage of the typical unit testing suites, while we use micro and application-like benchmarks to measure new features and make sure existing ones do not regress. We will discuss various approaches we take, including random query generation, random data generation, random fault injection, and longevity stress tests. We will demonstrate the effectiveness of the framework by highlighting several correctness issues we have found through random query generation and critical performance regressions we were able to diagnose within hours due to our automated benchmarking tools.

Presentation Video


Nicolas Poggi
Sr Performance Engineer
Nicolas is a researcher overseeing the performance and scalability of new Spark releases at Databricks. Where he along with the Amsterdam SQL performance team is implementing the new benchmarking and monitoring infrastructure for the Databricks cloud platform. Previously, he was leading a project on upcoming architectures for Big Data processing at the Barcelona Supercomputing (BSC) - Microsoft Research joint center. Nicolas received his Ph.D. in Distributed Systems and Computer Architecture at UPC/BarcelonaTech, where he is still contributing part of the HPC and of the Data Centric Computing research groups.
Bogdan Ghit
Software Engineer
Bogdan Ghit is a computer scientist and software engineer at Databricks, where he works on optimizing the SQL performance of Apache Spark. Prior to joining Databricks, Bogdan pursued his PhD at Delft University of Technology where he worked broadly on datacenter scheduling with a focus on data analytics frameworks such as Hadoop and Spark. His thesis has led to a large number of publications in top conferences such as ACM Sigmetrics and ACM HPDC.