HDFS is well designed to operate efficiently at scale for normal hardware failures within a datacenter, but it is not designed to handle significant negative events, such as datacenter failures. To overcome this defect, a common practice of HDFS disaster recovery (DR) is replicating data from one location to another through DistCp, which provides a robust and reliable backup capability for HDFS data through batch operations. However, DistCp also has several drawbacks: (1) Taking HDFS Snapshots is time and space consuming on large HDFS cluster. (2) Applying file changes though MapReduce may introduce additional execution overhead and potential issues. (3) DistCp requires administrator intervene to trigger, perform, and verify DistCp jobs, which is not user-friendly in practice.
In this presentation, we will share our experience in HDFS DR and introduce our light-weighted HDFS disaster recovery system that addresses afore-mentioned problems. Different from DistCp, our light-weighted DR system is designed based on HDFS logs (e.g. edit log and Inotify), light-weighted producer/consumer framework, and FileSystem API. During synchronization, it fetches limited subsets of namespace and incremental file changes from NameNode, then our executors apply these changes incrementally to remote clusters through FileSystem API. Furthermore, it also provides a powerful user interface with trigger conditions, path filters and jobs scheduler, etc. Compared to DistCp, it is more straightforward, light-weighted, reliable, efficient, and user-friendly.