Accelerating TensorFlow with RDMA for high-performance deep learning

Accelerating TensorFlow with RDMA for high-performance deep learning

Thursday, April 19
2:50 PM - 3:30 PM
Room IV

Google’s TensorFlow is one of the most popular deep learning (DL) frameworks. In distributed TensorFlow, gradient updates are a critical step governing the total model training time. These updates incur a massive volume of data transfer over the network.

In this talk, we first present a thorough analysis of the communication patterns in distributed TensorFlow. Then we propose a unified way of achieving high performance through enhancing the gRPC runtime with Remote Direct Memory Access (RDMA) technology on InfiniBand and RoCE. Through our proposed RDMA-gRPC design, TensorFlow only needs to run over the gRPC channel and gets the optimal performance. Our design includes advanced features such as message pipelining, message coalescing, zero-copy transmission, etc. The performance evaluations show that our proposed design can significantly speed up gRPC throughput by up to 1.5x compared to the default gRPC design. By integrating our RDMA-gRPC with TensorFlow, we are able to achieve up to 35% performance improvement for TensorFlow training with CNN models.

Presentation Video


Dhabaleswar K (DK) Panda
Professor and University Distinguished Scholar
The Ohio State University
Dr. Dhabaleswar K. (DK) Panda is a Professor and University Distinguished Scholar of Computer Science at the Ohio State University. He obtained his Ph.D. in computer engineering from the University of Southern California. His research interests include parallel computer architecture, high-performance computing, communication protocols, big data, deep learning, files systems, network-based computing, and Quality of Service. He has published over 450 papers in major journals and international conferences related to these research areas. Dr. Panda and his research group members have been doing extensive research on modern networking technologies including InfiniBand, Omni-Path, High-Speed Ethernet and RDMA over Converged Enhanced Ethernet (RoCE). His research group is currently collaborating with National Laboratories and leading InfiniBand and Ethernet/iWARP companies on designing various subsystems of next-generation high-end systems. The MVAPICH2 (High-Performance MPI over InfiniBand, iWARP, and RoCE) open-source software package, developed by his research group, are currently being used by more than 2,925 organizations worldwide (in 86 countries). This software has enabled several InfiniBand clusters (including the 1st one) to get into the latest TOP500 ranking. These software packages are also available with the Open Fabrics stack for network vendors (InfiniBand and iWARP), server vendors and Linux distributors. The new RDMA-enabled Apache Hadoop and Memcached packages, consisting of acceleration for HDFS, MapReduce, RPC and Memcached, are publicly available from Dr. Panda's research is supported by funding from US National Science Foundation, US Department of Energy, and several industry including Intel, Cisco, SUN, Mellanox, QLogic, NVIDIA and NetApp. He is an IEEE Fellow and a member of ACM. More details about Dr. Panda, including a comprehensive CV and publications are available at
Xiaoyi Lu
Research Assistant Professor
The Ohio State University
Dr. Xiaoyi Lu is a Research Assistant Professor in the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high performance interconnects and protocols, Big Data, Hadoop/Spark/Memcached Ecosystem, Parallel Computing Models (MPI/PGAS), Virtualization, Cloud Computing, and Deep Learning. He has published over 100 papers in International journals and conferences related to these research areas. He has been actively involved in various professional activities (PC Co-Chair, PC Member, and Reviewer) in academic journals and conferences. Recently, Dr. Lu is leading the research and development of RDMA-based accelerations for Apache Hadoop, Spark, HBase, and Memcached, and OSU HiBD micro-benchmarks, which are publicly available from ( These libraries are currently being used by more than 290 organizations from 34 countries. More than 27,700 downloads of these libraries have taken place from the project site. He is a core member of the MVAPICH2 (High-Performance MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE) project and he is leading the research and development of MVAPICH2-Virt (high-performance and scalable MPI for hypervisor and container based HPC cloud). He is a member of IEEE and ACM. More details about Dr. Lu are available at