Running Hive queries fast in the cloud

Running Hive queries fast in the cloud

Wednesday, June 20
4:00 PM - 4:40 PM
Grand Ballroom 220C

More and more companies are storing their datasets in native cloud storage solutions such as S3 and WASB. Running queries directly on those datasets has always been a possibility, yet there are many hurdles to jump to make those queries efficient and secure. There are also challenges on rapid cloud cluster deployment, scaling as well as security and noisy neighbors.

In this talk, we'll cover the lessons we've learned along the way. We'll use a Tableau workbook to illustrate a typical BI scenario and show what's happening behind the scene. We'll dive into S3 caching, partitioning strategy as well as query tuning, configuration, Ranger integration as well as pitfalls to avoid along the way. Lastly, we'll discuss internals of ACID merge, how it works against S3 buckets as well as key metrics to monitor during cloud operations.

The overall goal of this talk is to gear people in the community with knowledge to operate hive confidently in the cloud.

Presentation Video


Nita Dembla
Senior Software Engineer
I am IT professional with varied experience, passionate about faster retrieval of data. After a decade of optimizing queries, building and implementing statistics gathering algorithms and the optimizer at IBM-Informix, I'm now an Apache contributor focused on running benchmarks on Hive and related SQL-on-Hadoop technologies such as Impala and Presto.