This presentation describes the journey taken by the Hotels.com big data platform team when tasked with migrating big data sets and pipelines from on-premises clusters to cloud based platforms. We present two open source tools that we built to overcome the unexpected challenges we faced.
The first of these is Circus Train—a dataset replication tool that copies Hive tables between clusters and clouds. We will also discuss various other options for dataset replication and what unique features Circus train has. The second tool is Waggle Dance—a federated Hive query service that enables querying of data stored across multiple Hive metastores. We will demonstrate the differences between Waggle Dance and existing federated SQL query engine tools and what use cases it enables. Giving real world examples, we will describe how we've used these tools to successfully build a petabyte scale platform that is now also being used by other brands within the Expedia organisation. We focus on actual problems and solutions that have arisen in a huge, organically grown corporation, rather than idealised architectures.