HDFS tiered storage: mounting object stores in HDFS

Wednesday, April 18
4:50 PM - 5:30 PM
Room III

Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Microsoft Azure, and on-premises object stores, such as Western Digital’s ActiveScale. In these settings, applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems for business continuity planning (BCP) and/or supporting hybrid cloud architectures to achieve the required business goals for durability, performance, and coordination.

To resolve this complexity, HDFS-9806 has added a PROVIDED storage tier to HDFS allowing mounting external namespaces, both object stores and other HDFS clusters. Building on this functionality, we can now allow remote namespaces to be synchronized with HDFS, enabling asynchronous writes to the remote storage and the possibility to synchronously and transparently read data back to a local application wanting to access file data which is stored remotely. This talk, which corresponds to the work in progress under HDFS-12090, will present how the Hadoop admin can manage storage tiering between clusters and how that is then handled inside HDFS through the snapshotting mechanism and asynchronously satisfying the storage policy.

Presentation Video

SPEAKERS

Thomas Demoor
Object Storage Architect
Western Digital
Thomas architects the S3 layer and the Hadoop integration of Western Digital's object storage system 'ActiveScale'. Together with the team, he has contributed multiple improvements to the Apache Hadoop s3a connector and has co-architected the HDFS Provided Storage feature. He joined WD through the Amplidata acquisition. Previousl,y he obtained a Computer Science PhD in Queueing Theory at Ghent University, Belgium.
Ewan Higgs
Software Architect
Western Digital
Ewan is a Software Architect at Western Digital where he works on tiered storage between HDFS and S3. Previously, he's worked on embedded systems, held various roles in the finance industry - mostly managing market data, and worked as a systems administrator for Ghent University's HPC group.