HDFS tiered storage

Tuesday, June 19
11:50 AM - 12:30 PM
Executive Ballroom 210A/E

Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Microsoft Azure or Amazon S3, and on-premises object stores, such as Western Digital’s ActiveScale. In these settings, applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems for business continuity planning (BCP) and/or supporting hybrid cloud architectures to achieve the required business goals for durability, performance, and coordination.

To resolve this complexity, HDFS-9806 has added a PROVIDED storage tier to mount external storage systems in the HDFS NameNode. Building on this functionality, we can now allow remote namespaces to be synchronized with HDFS, enabling asynchronous writes to the remote storage and the possibility to synchronously and transparently read data back to a local application wanting to access file data which is stored remotely. In this talk, which corresponds to the work in progress under HDFS-12090, we will present how the Hadoop admin can manage storage tiering between clusters and how that is then handled inside HDFS through the snapshotting mechanism and asynchronously satisfying the storage policy.

Presentation Slides

SPEAKERS

Chris Douglas
Principal Research Software Engineer
Microsoft
Chris Douglas has worked in Apache Hadoop since 2007, starting as a frequent contributor to the MapReduce data path. He is one of the original designers of YARN. As a member of the Cloud an Information Services lab (CISL) at Microsoft, his research focuses on systems for large-scale analytics. His current work builds storage abstractions for big data workloads in cloud settings.
Thomas Demoor
Object Storage Architect
Western Digital
Thomas architects the S3 layer and the Hadoop integration of Western Digital's object storage system 'ActiveScale'. Together with the team, he has contributed multiple improvements to the Apache Hadoop s3a connector and has co-architected the HDFS Provided Storage feature. He joined WD through the Amplidata acquisition. Previousl,y he obtained a Computer Science PhD in Queueing Theory at Ghent University, Belgium.