Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning

Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning

Thursday, March 21
11:00 AM - 11:40 AM
Room 127-128

Big data and AI are joined at the hip: AI applications require massive amounts of training data to build state-of-the-art models. The problem is, big data frameworks like Apache Spark and distributed deep learning frameworks like TensorFlow don’t play well together due to the disparity between how big data jobs are executed and how deep learning jobs are executed.

Apache Spark 2.4 introduced a new scheduling primitive: barrier scheduling. User can indicate Spark whether it should be using the MapReduce mode or barrier mode at each stage of the pipeline, thus it’s easy to embed distributed deep learning training as a Spark stage to simplify the training workflow. In this talk, I will demonstrate how to build a real case pipeline which combines data processing with Spark and deep learning training with TensorFlow step by step. I will also share the best practices and hands-on experiences to show the power of this new features, and bring more discussion on this topic.


Yanbo Liang
Staff Software Engineer
Yanbo is a staff software engineer at Hortonworks. He is working on the intersection of system and algorithm for machine learning and deep learning. He is an Apache Spark PMC member and contributes to several open source projects such as TensorFlow, Keras and XGBoost. He delivered the implementation of some major Spark MLlib algorithms. Prior to Hortonworks, he was a software engineer at Yahoo! and France Telecom working on machine learning and distributed system.