Running distributed Tensorflow in production: challenges and solutions on YARN 3.0

Running distributed Tensorflow in production: challenges and solutions on YARN 3.0

Wednesday, June 20
11:00 AM - 11:40 AM
Grand Ballroom 220A

Deep learning is so popular, and Tensorflow is one of the most popular deep learning platforms. More and more enterprises start trying Tensorflow to solve their use cases. Putting Tensorflow into production is hard. Especially for distributed Tensorflow, which may need to answer questions like:

• How to choose machine configurations, GPUs, etc.
• How many parameter servers needed
• How to do parallel hyperparameter tuning
• How to choose Docker vs. non-docker
• How to do container placement for better performance
• How to efficiently work with other big data applications like Hive/Spark, etc.
• How to do resource reservation for predictability

With latest features added to YARN, such as GPU isolation, placement constraints (how to wisely place workers/parameter servers to better leverage resources), Docker container integration, native service support, etc. Now theres lots of work that can be done within YARN to better support deep learning and machine learning workloads.

In this talk, we will talk about challenges of running Tensorflow in a production environment, and how to use Apache Hadoop YARN 3.0 to solve these issues.

Presentation Video


Wangda Tan
Staff Software Engineer
Wangda Tan is Product Management Committee (PMC) member of Apache Hadoop and Staff Software Engineer at Hortonworks. His major working field is Hadoop YARN GPU isolation and resource scheduler, participated features like node labeling, resource preemption, container resizing etc. Before join Hortonworks, he was working at Pivotal, working on integration OpenMPI/GraphLab with Hadoop YARN. Before that, he was working at Alibaba cloud computing, participated creating a large scale machine learning, matrix and statistics computation platform using Map-Reduce and MPI.
Yanbo Liang
Staff software engineer
Yanbo is a staff software engineer at Hortonworks. His main interests center around implementing effective machine learning and deep learning algorithms or models in the areas of recommendation system, natural language processing and others. He is an Apache Spark PMC member and contributes to lots of other open source projects such as TensorFlow and Apache MXNet. He delivered the implementation of some core Spark MLlib algorithms. Prior to Hortonworks, he was a software engineer at Yahoo! and France Telecom working on machine learning and distributed system.