Seamless replication and disaster recovery for Apache Hive Warehouse

Seamless replication and disaster recovery for Apache Hive Warehouse

Thursday, June 21
9:30 AM - 10:10 AM
Meeting Room 211A/B/C/D

As Apache Hadoop clusters become central to an organization’s operations, they have clusters in more than one data center. Historically, this has been largely driven by requirements of business continuity planning or geo localization. It has also recently been gaining a lot of interest from a hybrid cloud perspective, i.e. wherein people are trying to augment their traditional on-prem setup with cloud-based additions as well. A robust replication solution is a fundamental requirement in such cases.

Seamless disaster recovery has several challenges. Data, metadata, and transaction information need to be moved in sync. It should also be easy for the users and applications to reason about the state of the replica. The “hadoop scale” also brings unique challenges as bandwidth between clusters can be a limiting factor. The data transfer has to be minimized for replication, failover, as well as fail back scenarios.

In this talk we will discuss how the above challenges are addressed for supporting seamless replication and disaster recovery for Hive.


Sankar Hariappan
Staff Software Engineer
A core member of R&D Engineering Group in Hortonworks primarily working on HDP (Hortonworks Data Platform) and DPS (Data Plane Service). An active contributor and committer of Apache Hive project with major contributions on Hive replication and ACID features. Also have rich experience on distributed systems and In-memory database technologies.
anishek agarwal
Enginnering Manager
Working on Hive Replication for the past 1.5 years, building an effective, easy to use, replication/DR solution. In previous life i have built/managed the Data platform pipelines for a AdTech company.