This talk will highlight the recent additions of vectorized UDFs and parallel cross-validation in Apache Spark 2.3. Vectorized UDFs not only enhance performance, but it also opens up more possibilities by using Pandas for input and output of the UDF. Parallel cross validation speeds up tuning ML models by exploiting your Spark cluster resources to the max. Bryan will discuss the details of Apache Arrow in Spark, how it is relevant to the rest of the big data ecosystem, and what is in store for the future. We will share performance results from using parallelism in cross-validation and some on-going work with optimizing ML pipelines.
This talk will also touch upon how we are leveraging this work to enhance the end to end Enterprise AI lifecycle in Open Source for developers and data scientists. Finally, we will highlight some of the other relevant projects at the Center for Open Source Data and AI Technologies and how you can contribute to the same.