Accelerating XGBoost applications with GPU and Spark

Accelerating XGBoost applications with GPU and Spark

Tuesday, June 19
2:50 PM - 3:30 PM
Grand Ballroom 220C

XGBoost is a library designed and optimized for generalized gradient boosting. It provides state-of-the-art performance for typical supervised machine learning problems, powers more than half of machine learning challenges at Kaggle, and attracts lots of users from industry.

Despite better performance compared with other gradient-boosting implementations, it’s still a time-consuming task to train XGBoost model. And it usually requires extensive parameter tuning to get a highly accurate model, which brings the strong requirement to speed up the whole process. There are two directions to accelerate this process: one is to use powerful hardware such as GPU; another way is to leverage distributed computation framework such as Apache Spark. In the latest version of XGBoost, it has already supported parallel tree construction algorithms on GPU, which can significantly improve the model training performance. On the other hand, XGBoost can be seamlessly integrated with Spark to build unified machine learning pipeline on massive data with optimized parallel parameter tuning function.

In this talk, we will cover the implementation and performance improvement of GPU-based XGBoost algorithm, summarize model tuning experience and best practices, share the insights on how to build a heterogeneous data analytic and machine learning pipeline based on Spark in a GPU-equipped YARN cluster, and show how to push model into production.

Presentation Video


Yanbo Liang
Staff software engineer
Yanbo is a staff software engineer at Hortonworks. His main interests center around implementing effective machine learning and deep learning algorithms or models in the areas of recommendation system, natural language processing and others. He is an Apache Spark PMC member and contributes to lots of other open source projects such as TensorFlow and Apache MXNet. He delivered the implementation of some core Spark MLlib algorithms. Prior to Hortonworks, he was a software engineer at Yahoo! and France Telecom working on machine learning and distributed system.
mingjie tang
Member of Tech stuff
Mingjie Tang is an engineer at Hortonworks. He is working on SparkSQL, Spark MLlib and Spark Streaming. He has broad research interest in database management system, similarity query processing, data indexing, big data computation, data mining and machine learning. Mingjie completed his PhD in Computer Science from Purdue University.