Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning

Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning

Thursday, March 21
11:00 AM - 11:40 AM
Room 120-121

Big data and AI are joined at the hip: AI applications require massive amounts of training data to build state-of-the-art models. The problem is, big data frameworks like Apache Spark and distributed deep learning frameworks like TensorFlow don’t play well together due to the disparity between how big data jobs are executed and how deep learning jobs are executed.

Apache Spark 2.4 introduced a new scheduling primitive: barrier scheduling. User can indicate Spark whether it should be using the MapReduce mode or barrier mode at each stage of the pipeline, thus it’s easy to embed distributed deep learning training as a Spark stage to simplify the training workflow. In this talk, I will demonstrate how to build a real case pipeline which combines data processing with Spark and deep learning training with TensorFlow step by step. I will also share the best practices and hands-on experiences to show the power of this new features, and bring more discussion on this topic.

Presentation Video


Robert Hryniewicz
AI Evangelist
Robert is an AI evangelist at Cloudera and has over 12 years of experience working on various projects related to Artificial Intelligence, Robotics, IoT, Enterprise & Embedded Software. His primary focus at Cloudera is building communities around IoT, Big Data and Data Science, and enabling Enterprises to accelerate adoption of cutting edge open-source technologies (from Edge to AI).