YARN federation: taming a beasty fleet with global optimizations

YARN federation: taming a beasty fleet with global optimizations

Thursday, June 21
9:30 AM - 10:10 AM
Executive Ballroom 210D/H

Operating one cluster is hard, operating a few large clusters is harder, operating many, massive clusters is... terrible. The burden hits both operators (e.g., workload load balancing, data replication, HA strategies) and users (e.g., Where is my job running? Where is my data stored? Where do I have machine capacity?).

In this talk, we target this issue based on our experience at Microsoft in operating several clusters with tens of thousands of nodes each and present a new set of mechanisms and policies that make the operation of such clusters seamless for users and painless for operators. From a user perspective, there is only "one cluster", and all the capacity purchased can be consumed anywhere in the company fleet. From an operator perspective, the fleet is organized in sub-clusters that self-balance and adapt to achieve global objectives.

A new component, the Global Policy Generator (GPG), oversees the operations of an entire virtually-unified federation cluster. GPG leverages light-weight coordination among the resource managers (RMs) of the sub-clusters to achieve high cluster utilization without compromising on important scheduling invariants and without incurring high preemption costs. In particular, GPG extends the familiar notion of hierarchical dominant resource fairness (HDRF) across sub-clusters and makes scheduling decisions that are close to consistent with global HDRF. Moreover, it optimizes for sub-cluster load balance and job-to-data locality. The architecture is designed to tolerate component failures and network partitions while remaining highly-available.

We track the progress of this work in YARN-7402.


Carlo Curino
Principal Scientist
Carlo A. Curino received a Bachelor in Computer Science at Politecnico di Milano. He participated to a joint project between University of Illinois at Chicago (UIC) and Politecnico di Milano, obtaining a Master Degree in Computer Science at UIC and the Laurea Specialistica (cum laude) in Politecnico di Milano. During the PhD at Politecnico di Milano, he spent almost two years as a visiting researcher at University of California, Los Angeles (UCLA) working with prof. Carlo Zaniolo (UCLA) and prof. Alin Deutsch (UCSD). He then spent two years as Post Doc Associate at CSAIL MIT working with prof. Samuel Madden and prof. Hari Balakrishnan. At MIT he was also the primary lecturer for the course on databases CS630, taught in collaboration with Mike Stonebraker. He spent a year as Research Scientist at Yahoo! Research. Currently Carlo is a Principal Research Scientist in the Microsoft Cloud and Information Service Lab (CISL). Carlo's recent research interests include: large scale distributed systems, performance tuning, scheduling. In the past he worked on: mobile+cloud platforms, entity dedup at scale, relational databases and cloud computing, workload management and performance analysis, schema evolution, temporal databases.
Subru Krishnan
Principal Research Engineer
Subru Krishnan is a Principal Research Engineer at Microsoft in the Cloud and Information Services Lab (CISL) currently focusing on YARN, specifically scaling it to 100K+ nodes and providing SLA guarantees. He has been working on the Hadoop ecosystem since 2007. Prior to Microsoft, he worked at Yahoo! where he contributed to Oozie's precursor, near real-time stream processing on Hadoop and HBase replication.