Deep learning is so popular, and Tensorflow is one of the most popular deep learning platforms. More and more enterprises start trying Tensorflow to solve their use cases. Putting Tensorflow into production is hard. Especially for distributed Tensorflow, which may need to answer questions like:
• How to choose machine configurations, GPUs, etc.
• How many parameter servers needed
• How to do parallel hyperparameter tuning
• How to choose Docker vs. non-docker
• How to do container placement for better performance
• How to efficiently work with other big data applications like Hive/Spark, etc.
• How to do resource reservation for predictability
With latest features added to YARN, such as GPU isolation, placement constraints (how to wisely place workers/parameter servers to better leverage resources), Docker container integration, native service support, etc. Now theres lots of work that can be done within YARN to better support deep learning and machine learning workloads.
In this talk, we will talk about challenges of running Tensorflow in a production environment, and how to use Apache Hadoop YARN 3.0 to solve these issues.