The rise in popularity of machine learning, streaming, and latency-sensitive online applications in shared production clusters has raised new challenges for cluster schedulers. To optimize their performance and resilience, these applications require precise control of their placements by means of complex constraints. Examples of such scenarios are the following:
• Deep learning applications need to run on GPU machines with specific GPU models and driver/kernel versions.
• Hive or Spark applications benefit from being collocated on the same rack to reduce network cost and thus speed up their execution. At the same time, it is desirable to limit the number of allocations per machine to minimize resource interference.
• Low-latency services such as HBase need to be allocated across failure domains to improve their availability.
• A DNS service might need to run on machines with public IP address.
In this talk we present the brand new addition of expressive placement constraints in YARN. We show how applications can leverage such constraints to achieve complex placements, such as collocating their allocations on the same node/rack (affinity), spreading their allocations across nodes/racks (anti-affinity), or allowing up to a specific number of allocations per node group (cardinality) to strike a balance between the two. We describe real use cases from production clusters and show the benefits of placement constraints on large clusters using popular applications in both on-prem and cloud settings.