Operating one cluster is hard, operating a few large clusters is harder, operating many, massive clusters is... terrible. The burden hits both operators (e.g., workload load balancing, data replication, HA strategies) and users (e.g., Where is my job running? Where is my data stored? Where do I have machine capacity?).
In this talk, we target this issue based on our experience at Microsoft in operating several clusters with tens of thousands of nodes each and present a new set of mechanisms and policies that make the operation of such clusters seamless for users and painless for operators. From a user perspective, there is only "one cluster", and all the capacity purchased can be consumed anywhere in the company fleet. From an operator perspective, the fleet is organized in sub-clusters that self-balance and adapt to achieve global objectives.
A new component, the Global Policy Generator (GPG), oversees the operations of an entire virtually-unified federation cluster. GPG leverages light-weight coordination among the resource managers (RMs) of the sub-clusters to achieve high cluster utilization without compromising on important scheduling invariants and without incurring high preemption costs. In particular, GPG extends the familiar notion of hierarchical dominant resource fairness (HDRF) across sub-clusters and makes scheduling decisions that are close to consistent with global HDRF. Moreover, it optimizes for sub-cluster load balance and job-to-data locality. The architecture is designed to tolerate component failures and network partitions while remaining highly-available.
We track the progress of this work in YARN-7402.