Research and development of machine learning (ML) algorithms are a hot topic in data analytics. Novel OSS ML libraries are continuously proposed such as Google TensorFlow and XGBoost of Washington U.
As choices of ML algorithms and libraries are increasing, model selection is getting a serious pain of data analytics in a bunch of business use cases. Despite the development of ML technologies, achievement of high accuracy essentially requires hyper parameter tuning in big search space. Data scientists have to execute ML algorithms hundreds to thousands times by switching OSS and hyper parameter configurations, which last several days. Data preprocessing is also one of data scientists' big headache because model selection among a bunch of ML OSS requires format conversion and saving the converted data to storage for each OSS.
To address the pain, we develop a high-speed framework for searching a predictive model using Apache Spark. Our framework automatically executes hyper parameter tuning among typical black-box models (e.g., multilayer perceptron models and gradient boosting tree models) and white-box models (e.g., decision tree models and linear models). It now employs TensorFlow and XGBoost as is, and is open to be integrated with further releases of developed ML OSS. Our framework reduces the time of training and selection of hundreds predictive models from days to hours by leveraging the high speed in-memory computing architecture of Spark.
In this talk we unveil the overview of the framework architecture, design challenges and solutions, experimental results, and Spark technical tips we found through the development. High speed model selection by our framework also demonstrates that even white-box models can achieve competitive high accuracy to that of black-box models by well hyper parameter tuning. Our framework brings a practical option to choose white-box models in case if interpretability of prediction is a barrier to model serving in real business operations.