Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark

Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark

Thursday, June 21
9:30 AM - 10:10 AM
Executive Ballroom 210C/G

Research and development of machine learning (ML) algorithms are a hot topic in data analytics. Novel OSS ML libraries are continuously proposed such as Google TensorFlow and XGBoost of Washington U.

As choices of ML algorithms and libraries are increasing, model selection is getting a serious pain of data analytics in a bunch of business use cases. Despite the development of ML technologies, achievement of high accuracy essentially requires hyper parameter tuning in big search space. Data scientists have to execute ML algorithms hundreds to thousands times by switching OSS and hyper parameter configurations, which last several days. Data preprocessing is also one of data scientists' big headache because model selection among a bunch of ML OSS requires format conversion and saving the converted data to storage for each OSS.

To address the pain, we develop a high-speed framework for searching a predictive model using Apache Spark. Our framework automatically executes hyper parameter tuning among typical black-box models (e.g., multilayer perceptron models and gradient boosting tree models) and white-box models (e.g., decision tree models and linear models). It now employs TensorFlow and XGBoost as is, and is open to be integrated with further releases of developed ML OSS. Our framework reduces the time of training and selection of hundreds predictive models from days to hours by leveraging the high speed in-memory computing architecture of Spark.

In this talk we unveil the overview of the framework architecture, design challenges and solutions, experimental results, and Spark technical tips we found through the development. High speed model selection by our framework also demonstrates that even white-box models can achieve competitive high accuracy to that of black-box models by well hyper parameter tuning. Our framework brings a practical option to choose white-box models in case if interpretability of prediction is a barrier to model serving in real business operations.


Masato Asahara
NEC System Platform Research Laboratories
Masato Asahara (Ph.D.) is currently leading developments of Spark-based machine learning and data analytics systems, which fully automate predictive modeling. Masato received his Ph.D. degree from Keio University, and has worked at NEC for 8 years as a researcher in the field of distributed computing systems and computing resource management technologies.
Yoshiki Takahashi
Tokyo Institute of Technology
Yoshiki Takahashi is a student of the master of computer science program at the graduate school of Tokyo Institute of Technology. His academic research proposal is accepted in SysML 2018 which has attracted attention since its previous workshop era in NIPS. He worked on development of a Spark-based machine learning platform for automatic predictive modeling in his internship program at NEC Data Science Research Laboratories in 2017. He received his B.S. degree in 2017 from Tokyo Institute of Technology.