Apache Spark 2.3 boosts advanced analytics and deep learning with Python

Apache Spark 2.3 boosts advanced analytics and deep learning with Python

Wednesday, April 18
11:00 AM - 11:40 AM
Convention Hall I - C

Python is one of the most popular programming languages for advanced analytics, data science, machine learning, and deep learning. One of Python’s greatest assets is its extensive set of libraries, such as Numpy, Pandas, Scikit-learn, Theano, TensorFlow, Keras, and so on. Apache Spark is becoming the core component for big data processing and playing important role to help data scientists solve complicated problems. It has a great significance and strong demand to integrate Spark with the extremely rich Python ecosystems to handle challenges in artificial intelligence. In the latest Spark 2.3, some very exciting features were put in, for example: vectorized UDF in PySpark, which leverages Apache Arrow to provide high performance interoperability between Spark and Pandas/Numpy; Image format in dataFrame/dataset, which can improve Spark and TensorFlow (or other deep learning libraries) interoperability; high-efficiency parallel modeling tuning with Spark MLlib, etc. In this talk, we'll share best practice on real use cases and hands-on experiences to illustrate the power of these new features and bring more discussions on this topic.

Presentation Video


Yanbo Liang
Staff Software Engineer
Yanbo is a staff software engineer at Hortonworks. He is working on the intersection of system and algorithm for machine learning and deep learning. He is an Apache Spark PMC member and contributes to several open source projects such as TensorFlow, Keras and XGBoost. He delivered the implementation of some major Spark MLlib algorithms. Prior to Hortonworks, he was a software engineer at Yahoo! and France Telecom working on machine learning and distributed system.