Optimizing your SparkML pipelines using the latest features in Spark 2.3

Optimizing your SparkML pipelines using the latest features in Spark 2.3

Thursday, June 21
12:20 PM - 1:00 PM
Grand Ballroom 220B

This talk will highlight the recent additions of vectorized UDFs and parallel cross-validation in Apache Spark 2.3. Vectorized UDFs not only enhance performance, but it also opens up more possibilities by using Pandas for input and output of the UDF. Parallel cross validation speeds up tuning ML models by exploiting your Spark cluster resources to the max. Bryan will discuss the details of Apache Arrow in Spark, how it is relevant to the rest of the big data ecosystem, and what is in store for the future. We will share performance results from using parallelism in cross-validation and some on-going work with optimizing ML pipelines.
This talk will also touch upon how we are leveraging this work to enhance the end to end Enterprise AI lifecycle in Open Source for developers and data scientists. Finally, we will highlight some of the other relevant projects at the Center for Open Source Data and AI Technologies and how you can contribute to the same.

Presentation Video

SPEAKERS

Vijay Bommireddipalli
Program Director: Center for Open Source Data and AI Technologies
IBM
Vijay Bommireddipalli leads IBM’s Center for Open-Source Data & AI Technologies (CODAIT – http://codait.org) formerly known as the Spark Technology Center. His team focuses on creation and curation of Open Source Data and AI technologies at IBM. He joined IBM after finishing his MS in Computer Engineering at University of Massachusetts – Dartmouth. He has expertise in various technologies including Apache Spark and the Big Data ecosystem, Data Persistence, Data Management tooling, and Data Warehousing. He has presented extensively on these topics at various conferences worldwide.
Bryan Cutler
Software Engineer - Center for Open Source Data and AI Technologies
IBM
Bryan Cutler is a software engineer at IBM’s Center for Open Source Data and AI Technologies where he works on big data analytics and machine learning systems. He is a committer of Apache Spark in the areas of ML, SQL, Core and Python and a committer for the Apache Arrow project. His interests are in pushing the boundaries of software to build high performance tools that are also a snap to use.