The common perception of machine learning is that it starts with data and ends with a model. In real-world production systems, the traditional data science and machine learning workflow of data preparation, feature engineering, and model selection, while important, is only one aspect. A critical missing piece is the deployment and management of models, as well as the integration between the model creation and deployment phases.
This is particularly challenging in the case of deploying Apache Spark ML pipelines for low-latency scoring. Because execution of Spark ML pipelines is tightly coupled with the Spark SQL runtime, deployment using Spark is ill-suited to the needs of real-time predictive applications.
In this talk I will introduce the Portable Format for Analytics (PFA) for portable, open, and standardized deployment of data science pipelines and analytic applications. I will also introduce and evaluate Aardpfark, a library for exporting Spark ML pipelines to PFA, as well as compare and contrast it to other available alternatives including PMML, MLeap, ONNX, and Apple’s CoreML.