As Apache Hadoop clusters become central to an organization’s operations, they have clusters in more than one data center. Historically, this has been largely driven by requirements of business continuity planning or geo localization. It has also recently been gaining a lot of interest from a hybrid cloud perspective, i.e. wherein people are trying to augment their traditional on-prem setup with cloud-based additions as well. A robust replication solution is a fundamental requirement in such cases.
Seamless disaster recovery has several challenges. Data, metadata, and transaction information need to be moved in sync. It should also be easy for the users and applications to reason about the state of the replica. The “hadoop scale” also brings unique challenges as bandwidth between clusters can be a limiting factor. The data transfer has to be minimized for replication, failover, as well as fail back scenarios.
In this talk we will discuss how the above challenges are addressed for supporting seamless replication and disaster recovery for Hive.