Productionizing Spark ML pipelines with the portable format for analytics

Productionizing Spark ML pipelines with the portable format for analytics

Thursday, April 19
2:00 PM - 2:40 PM
Room V

The common perception of machine learning is that it starts with data and ends with a model. In real-world production systems, the traditional data science and machine learning workflow of data preparation, feature engineering, and model selection, while important, is only one aspect. A critical missing piece is the deployment and management of models, as well as the integration between the model creation and deployment phases.

This is particularly challenging in the case of deploying Apache Spark ML pipelines for low-latency scoring. Because execution of Spark ML pipelines is tightly coupled with the Spark SQL runtime, deployment using Spark is ill-suited to the needs of real-time predictive applications.

In this talk I will introduce the Portable Format for Analytics (PFA) for portable, open, and standardized deployment of data science pipelines and analytic applications. I will also introduce and evaluate Aardpfark, a library for exporting Spark ML pipelines to PFA, as well as compare and contrast it to other available alternatives including PMML, MLeap, ONNX, and Appleā€™s CoreML.

Presentation Video


Nick Pentreath
Principal Engineer
Nick Pentreath is a principal engineer at IBM's Center for Open-Source Data & AI Technologies (CODAIT), where he works on open-source machine learning projects. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations. He has also worked at Goldman Sachs, Cognitive Match, and Mxit. He is a committer and PMC member of the Apache Spark project and author of Machine Learning with Spark. Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.