Tools and approaches for migrating big datasets to the cloud

Tools and approaches for migrating big datasets to the cloud

Wednesday, April 18
2:50 PM - 3:30 PM
Room V

This presentation describes the journey taken by the big data platform team when tasked with migrating big data sets and pipelines from on-premises clusters to cloud based platforms. We present two open source tools that we built to overcome the unexpected challenges we faced.

The first of these is Circus Train—a dataset replication tool that copies Hive tables between clusters and clouds. We will also discuss various other options for dataset replication and what unique features Circus train has. The second tool is Waggle Dance—a federated Hive query service that enables querying of data stored across multiple Hive metastores. We will demonstrate the differences between Waggle Dance and existing federated SQL query engine tools and what use cases it enables. Giving real world examples, we will describe how we've used these tools to successfully build a petabyte scale platform that is now also being used by other brands within the Expedia organisation. We focus on actual problems and solutions that have arisen in a huge, organically grown corporation, rather than idealised architectures.

Presentation Video


Adrian Woodhead
Principal Engineer
Adrian is a principal engineer at in London where he works with teams focusing on the services powering their big data processing systems. Prior to this Adrian led the big data team at and has been using Hadoop and various other parts of the big data ecosystem since 2007. He has previously spoken at Strata and co-wrote a chapter in the early editions of the seminal “Hadoop: The Definitive Guide”.
Elliot West
Principal Engineer
Elliot is a principal engineer at in London where he designs tooling and platforms in the big data space. Prior to this Elliot worked in’s data team, developing services for managing large volumes of music metadata.