Expedia Group is in the process of migrating its Hadoop infrastructure from a single organization-wide on-premise cluster to large numbers of smaller in-cloud clusters. We've also moved from a centralized operating model, where one team was responsible for our Hadoop platform, to a distributed approach where infrastructure is owned and operated by our different brands: Hotels.com, Expedia.com, HomeAway.com, etc. This segmentation of our data platforms has allowed us to realize greater agility, resource elasticity, and reduced costs. However, it has generated architectural fragmentation, creating cloud-based data silos that impeded our ability to explore, discover, and share data across our organization. We describe these technical challenges and the solutions we've developed to provide our users with a virtual, unified view of our many data lakes. We'll present Apiary, an open source project that we developed that provides a standardized pattern for deploying and operating data lakes that support:
- federated data set sharing across accounts, regions, and clouds
- a "Bring Your Own Tool" culture, supporting a broad range of data processing platforms in the Hadoop ecosystem
- replication of data sets for disaster recovery
- data access security